Synthetic data: A safeguard or a threat to privacy?
The spread of AI and its use in any market sector make it urgent to find solutions to protect fundamental rights and freedoms, particularly concerning the right to privacy and data protection.
The exploitation of vast amounts of personal data for training AI systems, the difficulty of verifying their accuracy and relevance, the loss of control by individuals over their information, and the existence of many issues in verifying the accuracy of decisions made by AI are just some of the privacy-related risks associated with the use of AI systems.
Synthetic data as a tool for data minimization
As outlined by the Consultative Committee of the Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data (Convention 108), responsible innovation in the field of AI requires an approach focused on avoiding and mitigating the potential risks of processing personal data.
Using synthetic data can be a solution to minimize the amount of personal data processed by AI applications, prevent retrieving this information back to the relevant individuals (ensuring the non-reversibility of de-identification), and overcome the obstacles to technological evolution posed by data protection laws.
Synthetic data - which is so defined since it is obtained through a synthesizing process - is fictitious information derived from actual data, thanks to the use of generative machine learning algorithms. The algorithm is trained to reproduce the characteristics and structure of the original dataset, allowing us to obtain statistically accurate results.
The synthesizing process - which can be implemented through various techniques - starts from a real dataset, which can include any kind of information (including images), to obtain an artificial dataset that mirrors the features of the original dataset. This process allows the characteristics and structure of the source information to be replicated without the need to replicate or trace the identifying elements of the source information (ie reveal any personal data).
Synthetic data enables us to overcome the limits of anonymization
The above features represent a significant step forward for using personal data because they overcome the inconveniences associated with anonymized data. Data protection legislation does not apply to anonymized data because it does not fall under the definition of “personal data.”
However, to refer to anonymized data, it must be impossible to retrieve the identity of the individual to whom it relates. This represents a limitation to the use of the information, either because technological progress has made it very difficult to guarantee the absolute irreversibility of the data obtained or because the removal of any identifying element to ensure complete anonymization often results in the reduction of the usefulness of the data obtained.
Such inconveniences can be overcome thanks to the synthesizing process described above. Moreover, the fact that synthetic data - such as anonymous data - isa not considered “personal data” makes legislation on protecting such data inapplicable to their use, which provides many obstacles to using such information. The draft of the AI Act puts synthetic and anonymized data on an equal footing when Article 54 regulates the conditions for using personal data for development in the AI regulatory testing space.
This explains the reasons why the use of synthetic data is increasingly common in the field of machine learning, whose algorithms need a massive amount of data to be “trained.”
Privacy concerns arising from the use of synthetic data
No risk for privacy, then? Unfortunately, this is not the case.
Although synthetic data has an artificial nature, it’s obtained from real information, which must be processed following data protection laws.
Firstly, this must be considered when collecting the information to be used in the synthesizing process. Compliance with data protection laws must be ensured when selecting or obtaining the information to be synthesized by the algorithm. In particular, it’s necessary to ensure that individuals are adequately informed about the purpose of processing their data, that they have the chance to maintain control over its use, and that such use is based on an appropriate legal basis.
The above is particularly important considering that, according to Article 2-decies of the Italian Privacy Code (Legislative Decree No. 196/2003 as subsequently amended), personal data processed in breach of the personal data protection laws cannot be used.
Furthermore, appropriate criteria should be defined to verify that the synthesizing algorithm is not biased by deficits in reprocessing the original dataset, such that the identity of the data subjects can be traced.
Measures must be taken to prevent the possibility of tracing back to the original dataset. According to EDPS, a “privacy assurance assessment” should be conducted to assess how data subjects could be reidentified and what information would be revealed about them in such a case.
By contrast, appropriate cautions should be taken to ensure the transparent use of synthetic data, avoiding the risk of potentially harmful distortions (eg identity theft or the “deep fake” technique allows the creation of synthetic multimedia content that can generate distorting effects on public opinion).
Lastly, it is necessary to avoid the risk of discrimination that could result from using synthetic data that is not adequately representative of the phenomena it addresses.
The quality of synthetic data is closely related to the quality of the original information and the data generation model. Synthetic data may reflect biases present in the source dataset. This risk is exacerbated by the difficulty of verifying algorithm outputs, especially when dealing with particularly complex datasets.
The above reflections highlight how synthetic data - like many other innovations introduced by AI - can be a precious tool that could benefit society as a whole. However, its use must be controlled and carried out in compliance with applicable laws, particularly data protection laws. With this in mind, we hope that the AI Act will provide unambiguous answers and to ensure responsible use of the technology under discussion.