Legal Treatment Of Crowd-TrAIned AI Models And Shared Derivation Data.
I. Overview
Crowd-trained AI models are AI systems trained on datasets collected from multiple contributors, often through crowdsourcing platforms. Examples include:
- Text corpora contributed by users
- Annotated images for object recognition
- Public or semi-public datasets aggregated for machine learning
Shared derivation data refers to datasets where multiple contributors’ original works are combined or transformed into AI training data.
Key legal issues:
- Copyright ownership of training data
- Rights in the resulting AI-generated output
- Licensing and consent from contributors
- Liability for infringement if AI reproduces protected works
- Derivative works issues
II. U.S. Legal Principles
- Copyright protection for contributions – 17 U.S.C. § 101 et seq.
- Fair use doctrine – 17 U.S.C. § 107
- Critical in AI training on copyrighted works.
- Work-for-hire doctrine – 17 U.S.C. § 101
- Crowdsourced contributors may or may not transfer rights to the platform.
III. Key Case Laws
1. Authors Guild v. Google, Inc. (Google Books, 2015)
- Issue: Google scanned millions of books to create searchable text database. Authors claimed copyright infringement.
- Holding: Court ruled it was transformative fair use.
- Relevance: AI models trained on crowdsourced or derivative datasets may be considered fair use if the training is transformative and does not substitute the original works.
2. Thaler v. Perlmutter (2023)
- Issue: Can AI be an author of a work?
- Holding: Only humans can hold copyright; AI alone cannot.
- Relevance: For crowd-trained AI models, the human contributors and platform operators may hold rights, but AI output itself cannot be copyrighted.
3. Naruto v. Slater (2018)
- Issue: Non-human authorship claim (a monkey selfie).
- Holding: Non-humans cannot claim copyright.
- Relevance: Reinforces that crowd data contributors are the legal authors of underlying works, even if the AI processes their input.
4. Authors Guild v. HathiTrust (2014)
- Issue: Universities digitized books for research and accessibility.
- Holding: Court found digitization was fair use, especially for research.
- Implication: Training AI on datasets compiled from multiple sources for research or analysis is likely protected under fair use, provided the output is not a replacement for the original works.
5. Andersen v. Marvel (1991) – Derivative Works Standard
- Issue: Unauthorized use of original material for a new work.
- Holding: Using copyrighted material without permission in derivative works is infringement.
- Relevance: Crowd-trained AI models may inadvertently reproduce copyrighted material, creating derivative works. Platforms must ensure licensing of contributors’ data.
6. Warhol Foundation v. Goldsmith (2023)
- Issue: Transformative use in art reproduction.
- Holding: Even transformative works can infringe if they appropriate the heart of the original.
- Relevance: AI models trained on crowd data must avoid reproducing distinctive expressive elements without permission.
7. Google v. Oracle (2021)
- Issue: API copyrightability and reuse of Java code.
- Holding: Copying certain elements for interoperability was fair use.
- Relevance: Legal precedent for reuse of functional elements of crowd datasets for training AI models—functional data may be treated differently from expressive content.
IV. Legal Principles for Crowd-Trained AI
| Legal Principle | Application to Crowd-Trained AI |
|---|---|
| Copyright Ownership | Contributors retain rights unless explicitly licensed to platform. |
| Fair Use | Training may qualify as transformative if model does not reproduce expressive content. |
| Derivative Works | AI output resembling underlying works may be infringing; must monitor for replication. |
| Contract & Licensing | Contributor agreements must clearly assign rights or permit AI training. |
| Human Oversight | Even if AI produces output, liability lies with human operators for infringement. |
V. International Perspectives
A. European Union
- Directive 2019/790 (Article 17) – Platforms can be liable for unauthorized content; AI training on copyrighted material must respect licensing.
- Database Directive (96/9/EC) – Protects investment in obtaining, verifying, presenting data, relevant for shared derivation datasets.
B. UK IPO – Computer-Generated Works
- Recognizes the person making arrangements for the work as the author.
- Crowd-trained AI operators may be deemed authors of AI output, if significant human arrangement occurs.
VI. Practical Implications for Platforms Using Crowd-Trained Data
- Explicit licensing from contributors
- Platforms must secure rights for training and derivative output.
- Data audit & provenance tracking
- Identify which contributions are copyrighted or restricted.
- Avoid reproduction of expressive content
- Apply filtering to prevent infringing outputs.
- Fair use defense is context-dependent
- Research, analysis, and transformative training may be protected, but commercial output is riskier.
- Transparency and consent
- Contributors should be informed that their data may be used for AI training.
VII. Summary Table of Cases and Lessons
| Case | Year | Key Point for Crowd-Trained AI |
|---|---|---|
| Authors Guild v. Google | 2015 | Training on large dataset may be fair use if transformative |
| Thaler v. Perlmutter | 2023 | AI cannot hold copyright; humans remain authors |
| Naruto v. Slater | 2018 | Non-human input cannot claim copyright |
| Andersen v. Marvel | 1991 | Unauthorized derivative works infringe copyright |
| Warhol v. Goldsmith | 2023 | Even transformative use can infringe expressive core |
| Google v. Oracle | 2021 | Functional reuse may qualify as fair use |
| Authors Guild v. HathiTrust | 2014 | Research-focused digitization is fair use; supports training models |
Conclusion:
Crowd-trained AI models occupy a complex legal space. Key principles:
- Human contributors retain copyright, AI output cannot be copyrighted.
- Fair use may protect training, especially for research or transformative purposes.
- Derivative works risks require careful monitoring of AI output.
- Licensing agreements and contracts are critical for shared data.
- International law varies, with stricter liability in the EU than the U.S.

comments