EU AI Act: Open Source LLMs must disclose their training data

The Contention Around Open Source Model Regulation

The question of how and to what extent open source models are going to be regulated by the EU AI Act has been contentious. France, Germany, and Italy lobbied for their national champions to be protected from regulations, in particular the requirement that they disclose what data their models are trained on. This requirement is viewed as sensitive, as AI companies are facing increasing numbers of copyright lawsuits from creators whose works were used to train AI models.

Unambiguous Language in the Final Draft of the AI Act

The final draft of the AI Act now contains an unambiguous language requiring even open source models to disclose sufficient information about their training data:

“In any case, given that the release of general purpose AI models under free and open source licence does not necessarily reveal substantial information on the dataset used for the training or fine-tuning of the model and on how thereby the respect of copyright law was ensured, the exception provided for general purpose AI models from compliance with the transparency related requirements should not concern the obligation to produce a summary about the content used for model training and the obligation to put in place a policy to respect Union copyright law in particular to identify and respect the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.”

AI Act final draft

Implications for Open Source AI Developers

The Act requires open source model developers to provide sufficient information on the data used to train the model so that those whose works were used to train the model can identify and object to the use of their works. This requirement will up the stakes for the AI developer community in light of high-profile lawsuits in the United States, particularly the case of New York Times vs OpenAI and Microsoft. Developers will have to consider from now on how they acquire and vet their training data so that they can minimize the risks of including copyrighted material which could lead to litigation.

Global Trends in AI Transparency and Disclosure

The EU is not alone in moving towards greater transparency and disclosure requirements for AI models. The “AI Foundation Model Transparency Act of 2023” proposes similar requirements and has been introduced in the US Congress.

The Challenge for Businesses Using AI

The large number of new requirements coming from the AI Act and other regulations are going to pose a major challenge to businesses looking to use AI.