13 June 2023 • 7 minute read

Can generative AI rely on the text and data mining (TDM) exception for its training?

Written by:Elena VareseCarolina Battistella

The self-training of generative AI systems and the coordination with copyright

Generative AI systems “self-train” using machine learning algorithms that analyze massive amounts of data, images and content and learn to use that information to create new content similar to existing content.

Such analysis, however, could be considered as a reproduction, even if only temporary, of the data and sources used, including any protected works or entire portions of the databases employed. Therefore, problems of coordination with the regulations protecting copyright and related rights - in particular, the exclusive right of reproduction under Article 13 of Law No. 633/1941 (Copyright Law) - may arise from the automated extraction of such content. But not only that. It could also conflict with the right of the creator of a database to prohibit the extraction or reuse of all or a substantial part of it.

In the context of copyright law, the doctrine has questioned whether creative processing of the protected information and/or work can be carried out. On this point the European legislator has already provided that in the process of data processing, the absence of authorization from the author of the work from which they are extracted may constitute copyright infringement. However, it’s clear that making the activity of data and content extraction subject to the prior obtaining of authorization from the owner of the copyrights involved would entail high transactional costs and also timeframes incompatible with those of developing AI systems. It’s precisely for these reasons that the European legislator intervened by reforming this subject through the introduction of certain exceptions and limitations to copyright that are mandatory for each Member State.

TDM’s exceptions

Specifically, with regard to data mining, the Copyright Directive 2019/790/EU introduced the text and data mining (TDM) exceptions, which are regulated in Articles 3 (Text and data mining for the purposes of scientific research) and 4 (Exception or limitation for text and data mining). TDM is defined in Article 2 of the Copyright Directive as “any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.” At the national level, these articles have been transposed, respectively, with the introduction into the Copyright Law of Articles 70-ter - which deals only with extraction for scientific purposes by research organizations and cultural heritage protection institutions - and 70-quarter - which allows the extraction of text and data in general, by anyone, even for mere profit.

Given the large amounts of data that AI systems use to generate new content, the close relationship between generative AI and the TDM exception is evident: the text and data mining exception allows AI systems to access large amounts of data, which are used by generative AI to create new content. Should these systems not be allowed to access such data, their ability to generate content would undoubtedly be limited.

The admissibility of text and data mining for commercial purposes: Legitimate access and reservation

Among the two TDM exceptions regulated by the European directive, the second one, which allows mining also for profit, deserves particular attention. Article 70-quater of the Copyright Law exempts any text and data mining activity that is carried out on the intellectual work, including software or database protected by a related right, regardless of the purpose or the qualification of the person performing it.

This, however, provided that:

the person had legitimate access to the content for the purpose of text and data mining; and
the owner of the copyright and related rights and/ or the owner of the database have not expressly reserved the extraction of text and data (opt out mechanism), thus bringing TDM’s activities under its exclusive control.

However, the liberalizing scope of the opt-out mechanism granted by Article 70-quater depends on the manner in which the reservation is made by the rights holder. It is Article 4, para. 3 of the Copyright Directive itself that requires that the reservation be expressed “in an appropriate manner, such as machine readable means in the case of content made publicly available online.” This provision seems to require that the reservation statement be readable in an automated manner when the work to which it relates is made available to the public on the internet. Actually, the effects of opting out can also result from the inclusion of an appropriate clause in a contract, an assumption moreover confirmed by the Copyright Directive itself, which does not include Article 4 among the mandatory rules.

In addition, the qualification of the reservation statement is independent of any assessment regarding whether there are computer mechanisms to prevent data extraction. This interpretation is based on the merely informative function of the reservation. Thus, it will be sufficient to include the reservation in the R&D of the website, even if it lacks protective measures.

Therefore, the reservation:

may be a “digital” statement without computer protection mechanisms, such as the exclusion protocols contained in robots.txt files; or
may be achieved through the affixing of a digital rights management system that not only has a computer protection function but also incorporates an automatically detectable computer declaration; and
on the other hand, it cannot consist of the mere affixing of technical protection measures that do not include any declaration, and which therefore turn out to be mere tacit manifestations of will. Thus, the presence of technical measures does not have the effect of making any TDM activity per se unlawful, but it does, however, make extractions incompatible with the technical measure adopted prohibited, since Article 174-ter prohibits circumventing technological protection measures.

Retention of copies after the conclusion of data mining

A further problematic issue concerns the retention of copies after data mining has concluded. With respect to this, para. 2 of Art. 70-quarter provides that reproductions and extractions “may be retained for as long as necessary for the purposes of text and data mining,” this is because the functionality of a copy to text or data mining ceases at the time it is accomplished. Therefore, copies may not be retained for purposes beyond that of TDM, such as to verify and demonstrate achievements.

There is, however, part of the doctrine that argues that reproductions for data mining can also be kept for as long as it takes to train AI systems. In this respect, it would actually need to be checked on a case-by-case basis whether AI training constitutes text and data mining or whether, instead, it constitutes an activity subsequent to it. Only in the former case copies could be retained even during the AI training phase.

Article 70-quater, however, omits to regulate the reproductions and any further uses necessary for the use of the text and data extracted as a result of their computational analysis, namely the use that AI systems could potentially make of them. On this point, some scholars have noted that the use of the result of data mining could be conditioned on the permission of the owner of the rights to the analyzed content.

When only the form or a portion of it is extracted with data mining, it must be verified whether the extracted and reused fragments constitute independently creative and therefore protected portions. With respect to this question, someone believe that the use of creative fragments does not interfere with copyright when their original meaning imprinted by the author is no longer understandable, for example, because in the new context such fragments are unrecognizable.

Therefore, developers who intend to use copyrighted works to train a generative AI system will need to follow three steps:

obtain legitimate access to the data;
verify that the rights holders have not reserved the right to make reproductions for TDM purposes;
keep the copies made only as long as necessary for TDM purposes.

Clearly, it is important to monitor future case law to understand how these requirements will be applied in practice.

Contacts

Elena Varese

Partner

Alessandro Ferrari