The rapid advancement of Artificial Intelligence (AI) has increased the need for large-scale datasets and various software to train AI models. In this process, developers and researchers utilize open-source licenses to freely use data, develop models, and share research results. However, when choosing a license suitable for AI model training, it is crucial to consider various legal and technical factors. Open-source does not mean that the works can be used freely without conditions; the terms of use and legal constraints must be clearly understood. This column delves into the appropriate open-source licenses for AI training and the key factors to consider when selecting one.
Open-source licenses allow the use, modification, and distribution of software or data, designed to protect the rights of the copyright holder while enabling users to share code or data. These licenses can be broadly divided into two types: Copyleft and Permissive licenses. Both types offer freedom in software use, but there are significant differences in the scope and conditions of that freedom. Let's take a closer look at the main features and differences between these two licenses.
Copyleft licenses ensure the free use of software while including strong provisions to maintain that freedom. Software under this license must apply the same license conditions when providing the source code to other developers or users. Copyleft licenses enforce the openness and freedom of software.
Key Features:
Representative Copyleft Licenses:
Permissive licenses offer maximum freedom to the user while minimizing restrictions on software use and distribution. Software under permissive licenses provides highly flexible conditions for use, modification, and distribution, without the requirement to distribute the modified software under the original license.
Key Features:
Representative Permissive Licenses:
When selecting a license for AI training, the following factors should be considered:
Here are some of the key open-source licenses frequently used in AI training. However, whether the use of works under these licenses for AI training constitutes copyright infringement will be determined by the courts. Particularly in the U.S., if the content generated by the AI model is not sufficiently transformed from the original copyrighted work, data training may be considered copyright infringement.
As AI development progresses, generative AI models are gaining attention. These models are used to generate text, images, music, etc., and have achieved significant innovations, particularly in natural language processing and computer vision. In the AI training process, large amounts of data are used, and this data is often protected by copyright. The lack of copyright attribution during training is typically due to the closed nature of the training process and the fact that the generated output is not directly linked to the original copyrighted works. For example, even open-source AI models like META's LLaMA do not disclose the datasets used for training.
However, if the original copyrighted work is merely analyzed or patterns are learned from it during the training process, and the final generated output is not directly similar to the original, the lack of copyright attribution may not be considered copyright infringement. This argument is one being discussed in the U.S., and its interpretation may vary depending on court rulings.
In contrast, Europe has taken a clear stance on copyright attribution through the EU AI Act. AI models, particularly generative AI models, must disclose any copyrighted content used in their development. This is to ensure that the data used in training does not violate copyright laws. Therefore, AI system developers must document and disclose detailed information about copyrighted content used during training.
On December 27, 2023, the Ministry of Culture, Sports and Tourism and the Korea Copyright Commission released guidelines on generative AI copyright. These guidelines aim to prevent copyright disputes arising from the use of large datasets to generate content, targeting AI businesses, copyright holders, and users.
From the AI Business Perspective:
From the Copyright Holder’s Perspective:
From the AI User’s Perspective:
While these guidelines do not yet have legal binding force, the AI industry expresses concern that these guidelines could burden new business initiatives. The Super AI Promotion Council views these guidelines as potentially limiting AI training and calls for a new legal framework that balances copyright protection and AI development.
With the advancement of AI technology, the use of large datasets for AI model training has become essential. These datasets are often provided as open-source, allowing developers and researchers to train AI models using them. However, whether using copyrighted material in AI model training constitutes copyright infringement remains an unclear area. Nevertheless, open-source licenses explicitly define the rights associated with the use of works, and materials provided under these licenses can be used freely as long as the terms are followed. Particularly, permissive licenses (like MIT, Apache 2.0) impose very few restrictions on use, modification, and distribution, making it less likely that legal issues will arise when using them for AI model training.
Furthermore, determining whether copyright infringement occurs during AI model training requires distinguishing between merely referencing learned data and directly replicating or reproducing it. AI models typically operate by learning patterns or statistical characteristics from datasets to generate new outputs rather than replicating the content or form of copyrighted works. This process is different from creating a "derivative work" under copyright law, and it is not considered direct replication of existing works.
Moreover, if the content generated by the AI model does not substantially resemble or replicate the original work, it may fall outside the scope of copyright infringement as defined by copyright law. Even if new outputs are generated based on learned data, as long as they do not meet the "substantial similarity" standard, they are less likely to be considered copyright infringement. This is particularly relevant in U.S. legal precedent, where the "idea-expression dichotomy" principle holds that ideas themselves, as opposed to their specific expressions, are not protected by copyright.
However, this is a personal view and should be interpreted with caution, awaiting further court decisions in each jurisdiction.
It is important for AI developers and businesses to understand the differences between copyleft and permissive licenses and choose the license that aligns with the intended use and distribution plans for their AI models. Considering various factors, such as commercial use, data openness, and patent protection, is crucial in selecting the best license to ensure both the development and legal safety of AI. The issue of copyright in AI training is complex, and it is essential to continuously monitor the legal standards and guidelines of each country and respond accordingly.