Open source license for AI learning

The rapid advancement of Artificial Intelligence (AI) has increased the need for large-scale datasets and various software to train AI models. In this process, developers and researchers utilize open-source licenses to freely use data, develop models, and share research results. However, when choosing a license suitable for AI model training, it is crucial to consider various legal and technical factors. Open-source does not mean that the works can be used freely without conditions; the terms of use and legal constraints must be clearly understood. This column delves into the appropriate open-source licenses for AI training and the key factors to consider when selecting one.

Basic Understanding of Open Source Licenses

Open-source licenses allow the use, modification, and distribution of software or data, designed to protect the rights of the copyright holder while enabling users to share code or data. These licenses can be broadly divided into two types: Copyleft and Permissive licenses. Both types offer freedom in software use, but there are significant differences in the scope and conditions of that freedom. Let's take a closer look at the main features and differences between these two licenses.

Copyleft Licenses

Copyleft licenses ensure the free use of software while including strong provisions to maintain that freedom. Software under this license must apply the same license conditions when providing the source code to other developers or users. Copyleft licenses enforce the openness and freedom of software.

Key Features:

Guarantee of Freedom: Copyleft licenses ensure that anyone can freely use the software, including modifying, redistributing, and using it commercially.
Obligation to Apply the Same License: Any modified version of the software or its derivatives must be distributed under the same license as the original. For instance, software under the GNU General Public License (GPL) must remain under GPL even after modification or extension.
Mandatory Sharing: When software is modified or redistributed, both the original and modified source codes must be made public, promoting software development and sharing.

Representative Copyleft Licenses:

GNU General Public License (GPL): A representative of copyleft licenses, ensuring the free use and sharing of software while enforcing the maintenance of that freedom.
GNU Lesser General Public License (LGPL): A variant of GPL, mainly used for libraries, where the software using the library does not necessarily have to be distributed under GPL.

Permissive Licenses

Permissive licenses offer maximum freedom to the user while minimizing restrictions on software use and distribution. Software under permissive licenses provides highly flexible conditions for use, modification, and distribution, without the requirement to distribute the modified software under the original license.

Key Features:

Broad Freedom: Permissive licenses allow the software to be used in almost any manner, including modification and integration into closed-source software.
No Obligation to Apply the Same License: Users are not required to distribute modified software under the same permissive license. For example, software under the MIT License can be modified and then redistributed under a closed-source license.
Simple Requirements: Permissive licenses primarily require the retention of copyright notices and disclaimers. Users must maintain the original copyright holder’s name and license information, but there are few other restrictions.

Representative Permissive Licenses:

MIT License: One of the most widely used permissive licenses, offering minimal restrictions on software use, modification, and distribution.
Apache License 2.0: A permissive license with patent clauses, providing enhanced legal protection while ensuring the freedom of use.
BSD License: A permissive license with very simple conditions, allowing for free use and distribution.

Considerations When Choosing a License

When selecting a license for AI training, the following factors should be considered:

Openness: The choice of license depends on whether you intend to open AI models or data to the public or keep them closed. Copyleft licenses (like GPL) impose an obligation to make the source public, whereas permissive licenses (like MIT, Apache 2.0) offer more flexibility.
Commercial Use: If you plan to use AI models commercially, permissive licenses are more suitable. Copyleft licenses often require the disclosure of source code, making it difficult to maintain commercial secrets.
Patent Protection: If patent-related protection is needed, consider licenses like Apache License 2.0, which includes patent clauses.
Dataset Usage: The license of a dataset directly impacts AI training. Dataset-specific licenses like CC-BY, CC0, and PDDL are advantageous for freely using and sharing data.

Frequently Used Open Source Licenses for AI Training

‍

Here are some of the key open-source licenses frequently used in AI training. However, whether the use of works under these licenses for AI training constitutes copyright infringement will be determined by the courts. Particularly in the U.S., if the content generated by the AI model is not sufficiently transformed from the original copyrighted work, data training may be considered copyright infringement.

MIT License: Widely used permissive license that grants nearly all rights to the user, including the ability to copy, modify, and distribute software. The original copyright and disclaimers must be retained, but commercial use is freely permitted. The MIT License is well-suited for AI model training due to its minimal restrictions.
Apache License 2.0: A permissive license with patent clauses, enhancing legal protection. It’s suitable for AI model development when patent protection is necessary or when using the model for commercial purposes.
Creative Commons Attribution (CC-BY): Primarily used for datasets, allowing free use of data with the condition of attributing the author. Commercial use is allowed, and redistribution requires only attribution.
Creative Commons Zero (CC0): The most open form of Creative Commons licenses, similar to the public domain, allowing unrestricted use of data and code.
Microsoft Open Use of Data Agreement (MS O-UDA): A license created by Microsoft that allows free use of data for specific purposes, such as AI training, while requiring compliance with certain restrictions.
Community Data License Agreement – Permissive (CDLA-Permissive): Allows the free use of data, with the only condition being attribution. It is advantageous for AI training as it explicitly permits the use of data for various purposes, including commercial use.

Issues of Copyright Attribution in AI Training

As AI development progresses, generative AI models are gaining attention. These models are used to generate text, images, music, etc., and have achieved significant innovations, particularly in natural language processing and computer vision. In the AI training process, large amounts of data are used, and this data is often protected by copyright. The lack of copyright attribution during training is typically due to the closed nature of the training process and the fact that the generated output is not directly linked to the original copyrighted works. For example, even open-source AI models like META's LLaMA do not disclose the datasets used for training.

However, if the original copyrighted work is merely analyzed or patterns are learned from it during the training process, and the final generated output is not directly similar to the original, the lack of copyright attribution may not be considered copyright infringement. This argument is one being discussed in the U.S., and its interpretation may vary depending on court rulings.

In contrast, Europe has taken a clear stance on copyright attribution through the EU AI Act. AI models, particularly generative AI models, must disclose any copyrighted content used in their development. This is to ensure that the data used in training does not violate copyright laws. Therefore, AI system developers must document and disclose detailed information about copyrighted content used during training.

AI and Copyright Law in Korea

On December 27, 2023, the Ministry of Culture, Sports and Tourism and the Korea Copyright Commission released guidelines on generative AI copyright. These guidelines aim to prevent copyright disputes arising from the use of large datasets to generate content, targeting AI businesses, copyright holders, and users.

From the AI Business Perspective:

When using copyrighted works during AI model training, clear permission for use or fair use compliance must be obtained.
Filtering measures are recommended to prevent AI-generated output from being similar to existing works, and clear contracts between service providers are advised to clarify liability.

From the Copyright Holder’s Perspective:

If a copyright holder does not want their work used in AI training, they can specify this in terms of service or implement measures like the robots.txt standard to block it.

From the AI User’s Perspective:

Users should be cautious when AI-generated output is similar to existing works, as copyright infringement issues may arise during public performance, exhibition, or distribution. Attribution is also required.

While these guidelines do not yet have legal binding force, the AI industry expresses concern that these guidelines could burden new business initiatives. The Super AI Promotion Council views these guidelines as potentially limiting AI training and calls for a new legal framework that balances copyright protection and AI development.

Personal View and Conclusion

With the advancement of AI technology, the use of large datasets for AI model training has become essential. These datasets are often provided as open-source, allowing developers and researchers to train AI models using them. However, whether using copyrighted material in AI model training constitutes copyright infringement remains an unclear area. Nevertheless, open-source licenses explicitly define the rights associated with the use of works, and materials provided under these licenses can be used freely as long as the terms are followed. Particularly, permissive licenses (like MIT, Apache 2.0) impose very few restrictions on use, modification, and distribution, making it less likely that legal issues will arise when using them for AI model training.

Furthermore, determining whether copyright infringement occurs during AI model training requires distinguishing between merely referencing learned data and directly replicating or reproducing it. AI models typically operate by learning patterns or statistical characteristics from datasets to generate new outputs rather than replicating the content or form of copyrighted works. This process is different from creating a "derivative work" under copyright law, and it is not considered direct replication of existing works.

Moreover, if the content generated by the AI model does not substantially resemble or replicate the original work, it may fall outside the scope of copyright infringement as defined by copyright law. Even if new outputs are generated based on learned data, as long as they do not meet the "substantial similarity" standard, they are less likely to be considered copyright infringement. This is particularly relevant in U.S. legal precedent, where the "idea-expression dichotomy" principle holds that ideas themselves, as opposed to their specific expressions, are not protected by copyright.

However, this is a personal view and should be interpreted with caution, awaiting further court decisions in each jurisdiction.

It is important for AI developers and businesses to understand the differences between copyleft and permissive licenses and choose the license that aligns with the intended use and distribution plans for their AI models. Considering various factors, such as commercial use, data openness, and patent protection, is crucial in selecting the best license to ensure both the development and legal safety of AI. The issue of copyright in AI training is complex, and it is essential to continuously monitor the legal standards and guidelines of each country and respond accordingly.

‍