Ai Training Datasets And Fair Use Considerations.

I. AI Training Datasets: What the Legal Issue Is

1. What Are AI Training Datasets?

AI training datasets are large collections of data—text, images, audio, video, code, or mixed media—used to train machine-learning models. During training:

Works are copied into memory or storage

Data is analyzed statistically

The model learns patterns, correlations, and structures

The model does not retain verbatim copies (in principle), but can sometimes reproduce similar outputs

From a copyright perspective, copying occurs, which triggers the need for:

Permission

A license

Or a legal exception (most commonly fair use)

II. Fair Use Doctrine (Core Legal Framework)

Under U.S. copyright law, fair use is assessed using four statutory factors:

Purpose and character of the use

Commercial vs non-commercial

Transformative vs expressive substitution

Nature of the copyrighted work

Factual works get weaker protection

Highly creative works get stronger protection

Amount and substantiality used

Quantitative and qualitative importance

Effect on the potential market

Whether the use substitutes the original or harms its licensing market

AI training cases revolve around transformative use, non-expressive copying, and market harm.

III. Key Case Laws (Detailed Analysis)

1. Authors Guild v. Google, Inc. (Google Books Case)

Facts:

Google scanned millions of copyrighted books without permission

Full copies were made

Users could only view small text snippets

Books were indexed for searchability

Legal Issue:

Does copying entire copyrighted books for indexing and search constitute fair use?

Court’s Reasoning:

The use was highly transformative

Google was not offering books as substitutes

The purpose was information location, not reading

The full copying was necessary for the transformative purpose

Holding:

✔ Fair use

Importance for AI Training:

Establishes that copying entire works can still be fair use

Supports the argument that non-expressive, analytical use (like training models) can be transformative

Strong precedent for machine learning on copyrighted texts

2. Authors Guild v. HathiTrust

Facts:

University libraries created a shared digital repository

Used for:

Full-text search

Access for print-disabled individuals

Preservation

Legal Issue:

Is digitizing and storing copyrighted books fair use?

Court’s Reasoning:

Full-text search is non-consumptive

No expressive reading by the public

Accessibility for disabled users strongly favored fair use

No market harm shown

Holding:

✔ Fair use

Importance for AI:

Introduced the idea of “non-consumptive use”

AI training is often argued as non-consumptive

Reinforces that internal analysis ≠ infringement

3. Perfect 10, Inc. v. Amazon.com (and Google Images)

Facts:

Google displayed thumbnail images of copyrighted photos

Full images were hosted elsewhere

Thumbnails helped users locate information

Legal Issue:

Are thumbnails infringing copies?

Court’s Reasoning:

Thumbnails were transformative

Purpose shifted from artistic expression to information location

Reduced resolution reduced market harm

Holding:

✔ Fair use

Importance for AI:

Courts accept reformatting and functional reuse

Supports training AI on images when the use is analytical rather than aesthetic

Influences image-generation training disputes

4. Kelly v. Arriba Soft Corp.

Facts:

A search engine copied photographs to create thumbnails

Used for image search results

Legal Issue:

Does creating thumbnails infringe photographers’ rights?

Court’s Reasoning:

Use was transformative

Thumbnails served a new function

Public benefit outweighed harm

Holding:

✔ Fair use

Importance for AI:

Early foundation for search and indexing exceptions

Reinforces functional transformation arguments used by AI developers

5. A.V. ex rel. Vanderhye v. iParadigms (Turnitin Case)

Facts:

Student essays were copied into plagiarism-detection databases

Essays were stored permanently

Students alleged infringement

Legal Issue:

Is copying student works for plagiarism detection fair use?

Court’s Reasoning:

Purpose was preventing plagiarism, not expressive reuse

Essays were not publicly displayed

No market substitution

Holding:

✔ Fair use

Importance for AI:

Closely analogous to AI training

Validates database storage for analysis

Shows courts tolerate permanent storage if purpose is transformative

6. Sony Corp. of America v. Universal City Studios (Betamax Case)

Facts:

Home users recorded TV shows for later viewing

Studios claimed contributory infringement

Legal Issue:

Is time-shifting fair use?

Court’s Reasoning:

Private, non-commercial use

No proven market harm

Technology capable of substantial non-infringing uses

Holding:

✔ Fair use

Importance for AI:

Protects technology creators

Supports AI developers where tools have lawful applications

Foundation for later tech-fair-use doctrines

7. Andy Warhol Foundation v. Goldsmith (2023)

Facts:

Warhol’s silkscreen portrait based on a photograph

Used commercially in licensing

Legal Issue:

Does stylistic transformation alone make a use fair?

Court’s Reasoning:

Focused heavily on commercial substitution

Warned against overly broad “transformative” claims

Purpose and market competition mattered more than aesthetics

Holding:

✘ Not fair use

Importance for AI:

Limits AI fair use arguments

Warns that if AI outputs compete directly with original works, fair use may fail

Emphasizes market harm analysis

IV. Applying These Cases to AI Training

Strong Arguments Supporting Fair Use:

AI training is transformative

Use is non-expressive and statistical

Outputs do not contain stored copies

Public benefit (innovation, accessibility, research)

Weak Points / Legal Risks:

Training on highly creative works

Models reproducing copyrighted material

Commercial exploitation

Competing with licensing markets

V. Key Legal Takeaways

Copying alone ≠ infringement

Transformative analytical use is heavily protected

Market substitution is the biggest risk

Outputs matter more than training inputs

Courts balance innovation vs creator rights

VI. Conclusion

Current U.S. jurisprudence leans toward allowing AI training under fair use, especially when:

The training is non-expressive

Outputs are not substantially similar

The model does not replace the original market

However, recent cases (like Warhol) signal that courts will scrutinize commercial exploitation and market harm more closely, meaning future AI cases may tighten boundaries.

LEAVE A COMMENT