Legal Treatment Of Crowd-TrAIned AI Models And Shared Derivation Data.

18 Mar 2026 --
0 Comments

I. Overview

Crowd-trained AI models are AI systems trained on datasets collected from multiple contributors, often through crowdsourcing platforms. Examples include:

Text corpora contributed by users
Annotated images for object recognition
Public or semi-public datasets aggregated for machine learning

Shared derivation data refers to datasets where multiple contributors’ original works are combined or transformed into AI training data.

Key legal issues:

Copyright ownership of training data
Rights in the resulting AI-generated output
Licensing and consent from contributors
Liability for infringement if AI reproduces protected works
Derivative works issues

II. U.S. Legal Principles

Copyright protection for contributions – 17 U.S.C. § 101 et seq.
Fair use doctrine – 17 U.S.C. § 107
- Critical in AI training on copyrighted works.
Work-for-hire doctrine – 17 U.S.C. § 101
- Crowdsourced contributors may or may not transfer rights to the platform.

III. Key Case Laws

1. Authors Guild v. Google, Inc. (Google Books, 2015)

Issue: Google scanned millions of books to create searchable text database. Authors claimed copyright infringement.
Holding: Court ruled it was transformative fair use.
Relevance: AI models trained on crowdsourced or derivative datasets may be considered fair use if the training is transformative and does not substitute the original works.

2. Thaler v. Perlmutter (2023)

Issue: Can AI be an author of a work?
Holding: Only humans can hold copyright; AI alone cannot.
Relevance: For crowd-trained AI models, the human contributors and platform operators may hold rights, but AI output itself cannot be copyrighted.

3. Naruto v. Slater (2018)

Issue: Non-human authorship claim (a monkey selfie).
Holding: Non-humans cannot claim copyright.
Relevance: Reinforces that crowd data contributors are the legal authors of underlying works, even if the AI processes their input.

4. Authors Guild v. HathiTrust (2014)

Issue: Universities digitized books for research and accessibility.
Holding: Court found digitization was fair use, especially for research.
Implication: Training AI on datasets compiled from multiple sources for research or analysis is likely protected under fair use, provided the output is not a replacement for the original works.

5. Andersen v. Marvel (1991) – Derivative Works Standard

Issue: Unauthorized use of original material for a new work.
Holding: Using copyrighted material without permission in derivative works is infringement.
Relevance: Crowd-trained AI models may inadvertently reproduce copyrighted material, creating derivative works. Platforms must ensure licensing of contributors’ data.

6. Warhol Foundation v. Goldsmith (2023)

Issue: Transformative use in art reproduction.
Holding: Even transformative works can infringe if they appropriate the heart of the original.
Relevance: AI models trained on crowd data must avoid reproducing distinctive expressive elements without permission.

7. Google v. Oracle (2021)

Issue: API copyrightability and reuse of Java code.
Holding: Copying certain elements for interoperability was fair use.
Relevance: Legal precedent for reuse of functional elements of crowd datasets for training AI models—functional data may be treated differently from expressive content.

IV. Legal Principles for Crowd-Trained AI

Legal Principle	Application to Crowd-Trained AI
Copyright Ownership	Contributors retain rights unless explicitly licensed to platform.
Fair Use	Training may qualify as transformative if model does not reproduce expressive content.
Derivative Works	AI output resembling underlying works may be infringing; must monitor for replication.
Contract & Licensing	Contributor agreements must clearly assign rights or permit AI training.
Human Oversight	Even if AI produces output, liability lies with human operators for infringement.

V. International Perspectives

A. European Union

Directive 2019/790 (Article 17) – Platforms can be liable for unauthorized content; AI training on copyrighted material must respect licensing.
Database Directive (96/9/EC) – Protects investment in obtaining, verifying, presenting data, relevant for shared derivation datasets.

B. UK IPO – Computer-Generated Works

Recognizes the person making arrangements for the work as the author.
Crowd-trained AI operators may be deemed authors of AI output, if significant human arrangement occurs.

VI. Practical Implications for Platforms Using Crowd-Trained Data

Explicit licensing from contributors
- Platforms must secure rights for training and derivative output.
Data audit & provenance tracking
- Identify which contributions are copyrighted or restricted.
Avoid reproduction of expressive content
- Apply filtering to prevent infringing outputs.
Fair use defense is context-dependent
- Research, analysis, and transformative training may be protected, but commercial output is riskier.
Transparency and consent
- Contributors should be informed that their data may be used for AI training.

VII. Summary Table of Cases and Lessons

Case	Year	Key Point for Crowd-Trained AI
Authors Guild v. Google	2015	Training on large dataset may be fair use if transformative
Thaler v. Perlmutter	2023	AI cannot hold copyright; humans remain authors
Naruto v. Slater	2018	Non-human input cannot claim copyright
Andersen v. Marvel	1991	Unauthorized derivative works infringe copyright
Warhol v. Goldsmith	2023	Even transformative use can infringe expressive core
Google v. Oracle	2021	Functional reuse may qualify as fair use
Authors Guild v. HathiTrust	2014	Research-focused digitization is fair use; supports training models

Conclusion:

Crowd-trained AI models occupy a complex legal space. Key principles:

Human contributors retain copyright, AI output cannot be copyrighted.
Fair use may protect training, especially for research or transformative purposes.
Derivative works risks require careful monitoring of AI output.
Licensing agreements and contracts are critical for shared data.
- International law varies, with stricter liability in the EU than the U.S.