Legal Treatment Of Crowd-TrAIned AI Models And Shared Derivation Data.

I. Overview

Crowd-trained AI models are AI systems trained on datasets collected from multiple contributors, often through crowdsourcing platforms. Examples include:

  • Text corpora contributed by users
  • Annotated images for object recognition
  • Public or semi-public datasets aggregated for machine learning

Shared derivation data refers to datasets where multiple contributors’ original works are combined or transformed into AI training data.

Key legal issues:

  1. Copyright ownership of training data
  2. Rights in the resulting AI-generated output
  3. Licensing and consent from contributors
  4. Liability for infringement if AI reproduces protected works
  5. Derivative works issues

II. U.S. Legal Principles

  1. Copyright protection for contributions – 17 U.S.C. § 101 et seq.
  2. Fair use doctrine – 17 U.S.C. § 107
    • Critical in AI training on copyrighted works.
  3. Work-for-hire doctrine – 17 U.S.C. § 101
    • Crowdsourced contributors may or may not transfer rights to the platform.

III. Key Case Laws

1. Authors Guild v. Google, Inc. (Google Books, 2015)

  • Issue: Google scanned millions of books to create searchable text database. Authors claimed copyright infringement.
  • Holding: Court ruled it was transformative fair use.
  • Relevance: AI models trained on crowdsourced or derivative datasets may be considered fair use if the training is transformative and does not substitute the original works.

2. Thaler v. Perlmutter (2023)

  • Issue: Can AI be an author of a work?
  • Holding: Only humans can hold copyright; AI alone cannot.
  • Relevance: For crowd-trained AI models, the human contributors and platform operators may hold rights, but AI output itself cannot be copyrighted.

3. Naruto v. Slater (2018)

  • Issue: Non-human authorship claim (a monkey selfie).
  • Holding: Non-humans cannot claim copyright.
  • Relevance: Reinforces that crowd data contributors are the legal authors of underlying works, even if the AI processes their input.

4. Authors Guild v. HathiTrust (2014)

  • Issue: Universities digitized books for research and accessibility.
  • Holding: Court found digitization was fair use, especially for research.
  • Implication: Training AI on datasets compiled from multiple sources for research or analysis is likely protected under fair use, provided the output is not a replacement for the original works.

5. Andersen v. Marvel (1991) – Derivative Works Standard

  • Issue: Unauthorized use of original material for a new work.
  • Holding: Using copyrighted material without permission in derivative works is infringement.
  • Relevance: Crowd-trained AI models may inadvertently reproduce copyrighted material, creating derivative works. Platforms must ensure licensing of contributors’ data.

6. Warhol Foundation v. Goldsmith (2023)

  • Issue: Transformative use in art reproduction.
  • Holding: Even transformative works can infringe if they appropriate the heart of the original.
  • Relevance: AI models trained on crowd data must avoid reproducing distinctive expressive elements without permission.

7. Google v. Oracle (2021)

  • Issue: API copyrightability and reuse of Java code.
  • Holding: Copying certain elements for interoperability was fair use.
  • Relevance: Legal precedent for reuse of functional elements of crowd datasets for training AI models—functional data may be treated differently from expressive content.

IV. Legal Principles for Crowd-Trained AI

Legal PrincipleApplication to Crowd-Trained AI
Copyright OwnershipContributors retain rights unless explicitly licensed to platform.
Fair UseTraining may qualify as transformative if model does not reproduce expressive content.
Derivative WorksAI output resembling underlying works may be infringing; must monitor for replication.
Contract & LicensingContributor agreements must clearly assign rights or permit AI training.
Human OversightEven if AI produces output, liability lies with human operators for infringement.

V. International Perspectives

A. European Union

  • Directive 2019/790 (Article 17) – Platforms can be liable for unauthorized content; AI training on copyrighted material must respect licensing.
  • Database Directive (96/9/EC) – Protects investment in obtaining, verifying, presenting data, relevant for shared derivation datasets.

B. UK IPO – Computer-Generated Works

  • Recognizes the person making arrangements for the work as the author.
  • Crowd-trained AI operators may be deemed authors of AI output, if significant human arrangement occurs.

VI. Practical Implications for Platforms Using Crowd-Trained Data

  1. Explicit licensing from contributors
    • Platforms must secure rights for training and derivative output.
  2. Data audit & provenance tracking
    • Identify which contributions are copyrighted or restricted.
  3. Avoid reproduction of expressive content
    • Apply filtering to prevent infringing outputs.
  4. Fair use defense is context-dependent
    • Research, analysis, and transformative training may be protected, but commercial output is riskier.
  5. Transparency and consent
    • Contributors should be informed that their data may be used for AI training.

VII. Summary Table of Cases and Lessons

CaseYearKey Point for Crowd-Trained AI
Authors Guild v. Google2015Training on large dataset may be fair use if transformative
Thaler v. Perlmutter2023AI cannot hold copyright; humans remain authors
Naruto v. Slater2018Non-human input cannot claim copyright
Andersen v. Marvel1991Unauthorized derivative works infringe copyright
Warhol v. Goldsmith2023Even transformative use can infringe expressive core
Google v. Oracle2021Functional reuse may qualify as fair use
Authors Guild v. HathiTrust2014Research-focused digitization is fair use; supports training models

Conclusion:

Crowd-trained AI models occupy a complex legal space. Key principles:

  • Human contributors retain copyright, AI output cannot be copyrighted.
  • Fair use may protect training, especially for research or transformative purposes.
  • Derivative works risks require careful monitoring of AI output.
  • Licensing agreements and contracts are critical for shared data.
    • International law varies, with stricter liability in the EU than the U.S.

LEAVE A COMMENT