The Commission Nationale de l’Informatique et des Libertés (CNIL), the French data protection authority, recently published two recommendations to support innovation in the field of artificial intelligence (AI) in accordance with the General Data Protection Regulation (GDPR). Their target group includes providers and deployers of AI systems and models, as well as data controllers and data protection officers.
In these two recommendations, the data protection authority states that those whose data is used to train the models must be informed that it is taking place, and must be able to exercise their rights, in particular the rights to access, rectify, object to, and erase their data. It adopts a didactic approach, issuing recommendations, best practice and explanations on how to apply the principles of the GDPR in the context of AI. It also asserts that the GDPR makes it possible to address the specificities of AI models while acknowledging that challenges may arise.
In line with the Opinion on AI models issued by the European Data Protection Board on 17 December 2024, the CNIL confirmed that not all AI models can be considered anonymous. Large language models (LLM), for example, may contain personal data and are therefore subject to the GDPR. It adds that, in cases where the GDPR applies, the regulation covers the training dataset, the model, and its use. Given that more recent AI models were not taken into account when the GDPR was formulated, GDPR principles have been adapted to cover the latest AI technology.
Recommendation 1: Informing individuals
The obligation to inform data subjects stems from the principle of transparent processing (Article 12 of the GDPR) and applies to data collected both directly and indirectly. The data controller may be exempt from this obligation if providing information is not possible in practice or would require a disproportionate effort, or may only be required to provide general information.
The French data protection authority states that it is good practice to allow a reasonable period between collecting the information and training the model when the data collected is of a sensitive nature. This period would make it possible to guarantee the exercise of rights; failing to allocate time for this risks infringing users’ rights.
Regarding the provision of information, it must be concise, transparent, comprehensible and accessible. The CNIL recommends that information be provided at multiple levels, with priority given to essential information at the first level. Such information may take the form of information provided to the data subject concerned or, by way of derogation from article 14.5 GDPR, general information. General information is permitted when the data subject has already been informed about the processing, or when providing information to the data subject would require disproportionate effort.
With regard to collecting indirectly identifiable data available online, individuals’ personal information may often be extensive, as more details are required to identify the data subjects. A case-by-case analysis should be carried out to assess the necessity of collecting such personal information.
Examples of good practice and recommendations on how to inform individuals about the processing of their data include:
- When reusing a dataset or AI model subject to the GDPR, communicating the contact details of the initial data controller.
- When scraping data from websites or reusing scraped data, providing information about the sources or categories of sources used.
- When developing general-purpose AI models within the meaning of the AI Act, providing a summary of the content used for training.
Concerning AI models subject to the GDPR, such as those that, as a result of their training, have memorised part of the training data, the recommendation reaffirms the obligation to comply with the provisions of the GDPR and suggests specifying the nature of the risks associated with data extraction, the measures taken to limit those risks, and the available recourse mechanisms.
The CNIL affirms that users must be able to exercise their rights in relation to training datasets and AI models, provided they are not considered anonymous. However, it also acknowledges the challenges that may arise in exercising these rights.
Recommendation 2: Exercise of rights.
Two key facets of AI are highlighted in relation to the exercise of rights: training datasets and AI models that are subject to the GDPR.
- Exercising rights over training datasets
In terms of identification, when the data controller does not need, or no longer needs, to identify an individual and can demonstrate that they no longer have the capacity to do so, they may indicate this in response to a request for the exercise of an individual’s rights. However, the GDPR allows individuals to provide additional information to help identify them and enable the exercise of their rights.
In terms of the right to access, the CNIL recommends that the provision of training data should include personal data, annotations and associated metadata. However, the French data protection authority tempers this requirement by specifying that the exercise of this right must not infringe on intellectual property rights or business secrecy.
Under the right to access, individuals may request any available information on the recipient of the data and its source. Even if the data controller is not required to store users’ data because it does not need, or no longer needs, to identify individuals, it mustcompile sufficient documentation on the sources of the training data to enable individuals to exercise their rights and demonstrate compliance with the GDPR. Whether to store data should be determined on a case-by-case basis. - Exercising rights over models whose processing is subject to the GDPR
With regard to the application of the GDPR to AI models, the debate surrounding their anonymity is also addressed. It draws a distinction between cases where personal data is obviously present and those where the presence of personal data is yet to be determined. Since there is no consensus on the influence of personal data on the parameters of AI models and research is ongoing, the CNIL’s position is that state of the art does not enable a data controller to identify a person in a set of stored data without additional information. However, it may be possible to determine whether the model has learned information about a person through testing and simulated attacks.
Regarding generative AI, the recommendation highlights the importance of distinguishing the model from the system to avoid confusion when exercising rights. Indeed, in some cases, the presence of personal data may stem from other components of the system that contribute elements to the responses generated, rather than from the model itself. This distinction will have repercussions for the exercise of rights, as the request will need to be addressed to the appropriate data controller: the provider or third party that added the database to the system.
Two methods that facilitate responding to requests for the exercise of individuals’ rights are presented: model re-training and the use of filters, provided the controller can demonstrate that these methods are sufficiently robust and effective.
Despite advances in research regarding how to identify individuals’ data, providers are urged to anonymise training data or to ensure that the model remains anonymous once it has been trained.
If you are interested in the issue of anonymity in relation to AI models, we invite you to read our latest article: Ghosts in the algorithm: busting AI hallucinations under the GDPR.
To stay informed, visit our website at AI-Regulation.com and follow us on LinkedIn, Twitter and Facebook.
S.P