near copies and substitution: why factor 4 fails for generative ai

part 2 of 2 • part 1: model weights contain copies

note: this post is being actively revised. citations are being expanded.

the question this post answers

in model weights contain copies: compression is not magic, i demonstrated that model weights contain extractable copies of training data. i trained models on GPL-licensed Linux kernel source code and extracted that code verbatim from the weights. the claim “model weights do not contain copies” is mathematically false.

that demonstration addressed verbatim copying. this post addresses what happens when models don’t produce character-for-character copies but instead create functional substitutes that serve the same purpose as the original work. when users get what they need from model output, they don’t access the original. demand falls. markets shrink. copyright harm follows inevitably from basic economics.

why substitution matters legally

copyright law protects markets, not just specific character sequences. under 17 U.S.C. §107, fair use analysis weighs four factors:

the purpose and character of the use
the nature of the copyrighted work
the amount and substantiality of the portion used
the effect of the use upon the potential market for or value of the copyrighted work

courts have settled this: factor 4 is decisive. the Supreme Court in Harper & Row v. Nation Enterprises called it “undoubtedly the single most important element of fair use.” Judge Chhabria in Kadrey v. Meta Platforms made the consequence explicit: “market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall.”

the US Copyright Office’s 2025 report on AI training endorsed this “market dilution theory”: factor 4 analysis considers whether AI models flood markets with works in similar styles or categories, even absent evidence of harm to specific copyrighted works. when romance novel generators trained on copyrighted romance novels saturate the market, they harm the market for romance novels generally—not just the specific titles in training data.

this isn’t speculative theory. copyright protects markets. substitutes harm markets. the law is clear.

what this post demonstrates

generative AI models create functional substitutes that defeat fair use defenses. the mechanism is established:

capability is demonstrated: models generate outputs that serve the same function as training data—code that solves the same problems, images in the same style, text that conveys the same information
economic logic is settled: when outputs substitute for originals, demand for originals falls—this is definitional, not empirical
causal mechanism is established: users choosing model outputs over originals creates market displacement
legal doctrine is clear: substitution capability creates market harm under factor 4, the decisive fair use factor

this analysis differs from what i showed in my earlier post on how compression enables copying: there i demonstrated verbatim copying exists in weights. here i demonstrate that even without verbatim extraction, models create substitutes. both mechanisms cause market harm. both mechanisms defeat fair use defenses. the legal outcome is identical.

the path ahead

we proceed systematically. first, we define substitution in economic terms and map it to copyright law’s market harm analysis. then we demonstrate substitution capability across modalities: code generation, image synthesis, text paraphrasing. we distinguish between capability (demonstrated here through construction) and scale (requiring market measurement). finally, we explain why factor 4 fails for generative AI: once models can create functional equivalents, market harm follows from economic logic.

the argument is straightforward. if a model can create substitutes (capability demonstrated), and users rely on those substitutes (observed behavior), then market harm must follow (economic definition). i demonstrate capability through explicit construction—the same constructive approach i used to show verbatim memorization. the causal chain is complete.

courts act on likelihood of harm, not certainty of completed harm—this is black-letter restraining order doctrine. waiting for “rigorous empirical confirmation” of market destruction before recognizing substitution harm is like waiting until a trade secret is published before granting an injunction. the harm is the capability deployed, not the statistical measurement of consequences after markets collapse.

unfortunately for defenders of AI fair use, the economic logic here is as trivial as the mathematical logic underlying memorization: substitutes reduce demand for originals. this is what the word “substitute” means.

what is a “near copy” in generative systems?

in model weights contain copies: compression is not magic, i demonstrated that model weights contain extractable copies of training data—verbatim sequences recoverable through prompting. i showed GPL-licensed Linux kernel source code emerging character-for-character from model weights, satisfying the legal definition of “copy” under 17 U.S.C. §101.

but copying exists on a spectrum. my earlier analysis focused on one extreme: perfect verbatim reproduction, where every token matches exactly. this post examines outputs that aren’t exact character-for-character copies but are similar enough to substitute functionally for the original work.

the spectrum of similarity

copyright law has long recognized that infringement doesn’t require perfect copying. the substantial similarity doctrine protects against non-literal copying—works similar enough that an ordinary observer would recognize one as derived from the other. for AI-generated outputs, this spectrum matters because even weak similarity can create market harm if users choose the AI output instead of accessing the original.

here are five levels of similarity, from strongest (nearly verbatim) to weakest (compositional echoing):

level 1: high n-gram overlap

32+ consecutive tokens match training data exactly (Biderman’s k-extractibility threshold where k=32). this isn’t technically “verbatim” if preceded or followed by different context, but long matching subsequences are functionally copies.

example: a code completion model generates a GPL license header embedded in otherwise novel code. the license text itself is verbatim (covered by the compression post), but the surrounding code makes the overall file non-identical.

level 2: paraphrase preserving selection/arrangement

same facts, quotes, and argumentative structure expressed in different words. the semantic content and organization—which can be protected expression—remain intact despite surface variation.

example: an AI summary that captures an article’s unique framing, key arguments, and supporting evidence using synonyms and restructured sentences. readers get the article’s protected expression without reading the original.

concrete case: Gupta & Pruthi (2025) documented AI-generated research that replaced “weighted adjacency matrix” with “resonance graph” while preserving identical methodology. the paper scored 6, 7, 6 at ICLR 2025 workshop—above acceptance threshold—despite being what original authors confirmed was “definitively not novel.” methodological paraphrasing, not just prose paraphrasing, demonstrates level 2 similarity creates functional substitutes for academic work.

level 3: stylistic substitution

replicates an artist’s distinctive style: color palette, composition techniques, brushwork patterns, or (for code) programming idioms and design patterns. doesn’t copy a specific work but copies the stylistic expression that makes that creator’s work recognizable.

example: stable diffusion prompted with “mountain landscape in the style of Albert Bierstadt” generates images with Bierstadt’s characteristic lighting, composition, and romantic aesthetic—not copying any single painting but capturing the stylistic elements that define his work.

level 4: structural/template reproduction

narrative beats, plot structures, character archetypes, API design patterns, database schemas. the “template” or underlying structure is reproduced even when surface details vary.

example: a language model generates a fantasy novel with the same narrative structure as a training example: reluctant hero’s journey, mentor’s death, false victory, final confrontation. no scenes are copied, but the protected structural expression is replicated.

level 5: compositional echoing

statistical signature overlap measurable through perplexity, embedding similarity, or other computational metrics. visually: layout and composition that echo training images. musically: phrasing and harmonic progressions that mirror training compositions.

example: a diffusion model generates an image that—while not copying any specific training image—has compositional elements (rule of thirds, diagonal leading lines, tonal distribution) that cluster near a specific photographer’s work in embedding space.

why this taxonomy matters legally

the landmark case Arnstein v. Porter, 154 F.2d 464 (2d Cir. 1946) established the “ordinary observer” test: if an average person would recognize the alleged copy as appropriated from the original, substantial similarity exists. this doesn’t require verbatim reproduction—it requires functional or expressive equivalence recognizable to laypeople.

for software, Computer Associates Int’l, Inc. v. Altai, Inc., 982 F.2d 693 (2d Cir. 1992) refined the analysis with the abstraction-filtration-comparison test: even when code isn’t verbatim copied, non-literal elements (structure, sequence, organization) can infringe if they constitute protected expression.

crucially for fair use analysis, the degree of similarity matters less than whether substitution occurs. even level 5 (weak compositional similarity) can create market harm under fair use factor 4 if users choose AI outputs instead of licensing or purchasing originals.

connection to memorization taxonomy

earlier i discussed Biderman et al.’s memorization taxonomy: recite (verbatim from heavy duplication), reconstruct (templates/patterns), and recollect (rare outliers). this similarity taxonomy maps naturally:

recite → levels 1-2: high token overlap and preserved structure
reconstruct → levels 3-4: stylistic and structural reproduction
recollect → level 5: compositional echoing from learned patterns

the key insight: all five levels can create copies under copyright law through the substantial similarity doctrine. i’ve already shown level 0 (perfect verbatim) exists. this post examines whether levels 1-5 create market harm sufficient to defeat fair use—regardless of how “similar” they are in technical terms.

how substitution happens: technical mechanisms

understanding how models create substitutes is crucial for both legal causation and technical mitigation.

mechanism 1: prompting patterns

users intentionally elicit substitutive outputs through specific prompts:

style prompts: “in the style of [artist]” for image generation
template prompts: “write a [genre] story with [plot elements]”
example continuation: “here’s the beginning of [work], continue it”
functional requests: “implement [algorithm] in [language]”

legal significance: intentional prompting shows users seeking substitution, not accidental leakage. demonstrates market displacement is predictable and realized. when users explicitly request “like [copyrighted work],” they’re seeking a substitute.

example: stable diffusion prompted with “mountain landscape in the style of Albert Bierstadt” generates images compositionally and stylistically similar to Bierstadt’s protected works. the user chose the AI output instead of licensing an actual Bierstadt work or hiring an artist trained in that style.

mechanism 2: training contamination and duplication

high-frequency training sequences are memorized near-verbatim, as i showed through GPL kernel extraction. this isn’t random—it follows predictable patterns:

duplication drives memorization. sequences appearing frequently in training data (license text, famous quotes, boilerplate code) get memorized because exact storage is compression-optimal. Lee et al. (2021) showed that deduplicating training data reduces but doesn’t eliminate memorization.

training phase matters. sequences encountered during low-learning-rate phases (fine-tuning, late training) are more likely to be memorized than those seen early. Nasr et al. (2023) demonstrated that RLHF can actually increase memorization of alignment data.

outliers resist compression. rare, distinctive sequences that can’t be compressed into general patterns get stored directly. this is why unique email addresses, uncommon names, and distinctive phrases from training data can be extracted.

legal significance: duplication during training makes post-hoc substitution highly likely. defense claims of “accidental” output are weakened when training data characteristics predict memorization. the causal chain from training choices to substitution outputs is direct and foreseeable.

example: the GPL license appears in thousands of Linux kernel files → model memorizes it during training → outputs reproduce it when prompted for kernel code (as my extraction experiments showed). this isn’t accident—it’s the compression-optimal solution given the training distribution.

mechanism 3: retrieval-augmented and cache-like behaviors

models exhibit behaviors similar to nearest-neighbor retrieval, functioning like databases for specific queries:

k-nearest-neighbor language models (Khandelwal et al., 2020) explicitly retrieve training examples to improve predictions. the model stores training data in a datastore and retrieves relevant examples at inference time. this is retrieval, not generation from patterns.

induction heads in transformers (Olsson et al., 2022) enable in-context learning by copying patterns from earlier in the context window. when prompted with examples similar to training data, these heads activate retrieval-like behaviors.

attention-based retrieval. even without explicit retrieval mechanisms, transformer attention can focus heavily on training-similar contexts, effectively “looking up” learned patterns rather than generating from first principles.

legal significance: if models behave like retrieval systems for specific queries, they function as databases of training data, not just pattern abstractors. this breaks the “transformation” defense—retrieval is copying, not transforming. the Google Books distinction (non-substitutive snippet view) collapses when models can retrieve full content.

example: when GPT-4 is prompted to “explain the fast inverse square root algorithm from Quake III,” it doesn’t generate a novel explanation—it retrieves the characteristic bit manipulation pattern from training data. the output is recognizably derived from specific training examples, not synthesized from abstract patterns.

mechanism 4: compositional reconstruction

even without verbatim storage, models can reconstruct outputs by composing learned components:

image models learn “visual atoms”: edges, textures, objects, compositional rules. during generation, these are combined to produce outputs. training on an artist’s portfolio teaches these atoms in that artist’s style. generation recombines them to create novel-looking but stylistically derivative works.

text models learn phrases, argumentative structures, narrative patterns. a model trained on legal writing learns standard clauses, document structures, and reasoning patterns. outputs recombine these learned components into documents that serve the same function as training examples.

code models learn idioms, design patterns, API usage patterns. Wang et al. (2023) showed that neural networks develop hierarchical representations where lower layers capture syntax and higher layers capture semantic patterns. code generation recombines these learned representations.

this is Biderman’s “reconstruct” category: outputs aren’t verbatim copies but are synthesized from templates learned during training. the result is functionally equivalent to training examples even when no single training example is copied.

legal significance: reconstruction explains how outputs can be substitutes without being verbatim copies. the model learned from training data how to create works of a particular type, then reconstructs that type on demand. this directly addresses the “transformation” defense: if the purpose is to enable users to obtain substitutes for training data, the use isn’t transformative.

example: github copilot generates a quicksort implementation not by copying one training example but by synthesizing common patterns across thousands: pivot selection strategies, recursion structure, edge case handling. the output isn’t verbatim from any single training file, but it’s functionally equivalent—it solves the same problem the same way training examples did.

the mechanisms combine

these four mechanisms work together: prompting activates capabilities (1), contamination determines what’s readily accessible (2), retrieval behaviors access stored content (3), and reconstruction synthesizes novel-looking substitutes (4).

the causal chain Section 5 will detail: unauthorized training → encoded capability (mechanisms 2-4) → prompted extraction (mechanism 1) → substitutive output → market harm.

each mechanism strengthens the case that substitution isn’t accidental—it’s the product’s core functionality. models were trained to enable users to generate outputs similar to training data. that’s the value proposition.

legal framework: fair use factor 4 and market harm

copyright beyond exact copies

before examining market harm, a brief clarification: copyright doesn’t require verbatim reproduction. courts have long recognized infringement through substantial similarity—when works are similar enough that an ordinary observer would recognize copying (Arnstein v. Porter, 154 F.2d 464 (2d Cir. 1946)).

this extends to derivative works under 17 U.S.C. §106(2): translations, adaptations, and transformations based on originals. a French translation shares zero words with the English original, but it’s a derivative requiring authorization.

for AI, this means style transfers might be derivatives of training artists’ works, paraphrases might be derivatives of summarized articles, and functionally equivalent code might infringe training implementations. but establishing substantial similarity or derivative status is fact-intensive. more powerful for AI cases is the market harm analysis: fair use factor 4.

the key legal question isn’t “did the AI copy exactly?” but “does the output harm the market for the original?” that’s where fair use factor 4 becomes decisive.

the four factors and factor 4’s primacy

under 17 U.S.C. §107, courts evaluate four factors for fair use:

purpose and character: is the use commercial or transformative?
nature of the work: is the original factual (less protection) or creative (more protection)?
amount and substantiality: how much of the work was used?
effect on the market: does the use harm the market for the original?

recent AI cases emphasize that factor 4 is decisive:

Kadrey v. Meta Platforms, Inc. (N.D. Cal. June 2025): Judge Chhabria wrote that factor 4 is “undoubtedly the single most important element of fair use” and that “market dilution will often cause plaintiffs to decisively win the fourth factor—and thus win the fair use question overall.” Meta prevailed on narrow evidentiary grounds—plaintiffs failed to prove their specific works caused measurable market harm—but the court’s reasoning endorsed market dilution theory, providing a roadmap for plaintiffs who develop stronger empirical evidence.

this reflects established fair use doctrine:

Harper & Row Publishers, Inc. v. Nation Enterprises, 471 U.S. 539 (1985): the Supreme Court held that factor 4 is “undoubtedly the single most important element of fair use.” market harm from scooping publication of President Ford’s unpublished memoirs defeated fair use despite newsworthy purpose.

Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994): even transformative uses (2 Live Crew’s parody of “Oh, Pretty Woman”) must consider market harm, including harm to derivative markets. the Court remanded for factor 4 analysis—transformation doesn’t automatically excuse market displacement.

Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023): the Supreme Court emphasized that transformative purpose doesn’t negate market harm. when Warhol Foundation licensed a silkscreen based on Goldsmith’s photo for $10,000, “this use of Goldsmith’s copyrighted photograph…deprived Goldsmith of a fee” for licensing her work directly. the licensing market harm weighed decisively against fair use.

the case law progression

the pattern across cases is clear: substitutive uses fail fair use.

non-substitutive uses prevail:

Authors Guild v. Google Inc., 804 F.3d 202 (2d Cir. 2015): google books’ snippet view was fair use because it was non-substitutive by design. the court emphasized that snippets were “fragmentary and scattered,” showing “at most one snippet per page from a different page of the Book.” users couldn’t read books through snippets—they could only discover whether a book contained relevant passages. “snippet view adds importantly to the [book’s] commercial viability” rather than substituting for purchases.

Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2003): thumbnail images in search results were transformative fair use despite being exact copies at lower resolution. the original purpose was aesthetic; the search engine purpose was indexing and access. critically, the “use did not supplant need for originals” and “benefitted public by enhancing Internet information gathering.” the thumbnails helped users find full-resolution originals rather than replacing them.

substitutive uses fail:

Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018): a video monitoring service delivering full TV clips to clients “usurped a market that Fox could develop or license others to develop.” the service substituted for watching original broadcasts, creating market harm.

A&M Records, Inc. v. Napster, Inc., 239 F.3d 1004 (9th Cir. 2001): peer-to-peer file sharing of complete songs substituted for purchasing music. “Napster use is likely to reduce CD purchases by college students” constituted market harm even without proof of actual lost sales.

American Geophysical Union v. Texaco Inc., 60 F.3d 913 (2d Cir. 1994): photocopying journal articles for researchers harmed licensing markets (through Copyright Clearance Center). even though Texaco didn’t commercially resell the copies, the substitution for licensing created market harm.

the google books distinction

defendants frequently cite Authors Guild v. Google for transformative fair use. but google books is distinguishable:

Google Books	AI Training
Snippet view (3-5 lines max)	Full outputs (potentially complete works)
Search purpose (find relevant books)	Generation/substitution purpose
Non-substitutive by design	Functionally substitutive in practice
Increased discovery → higher sales	Potentially decreased demand
”Fragmentary and scattered” excerpts	Coherent, complete, immediately usable outputs

as the Second Circuit noted in Google Books: “snippet view adds importantly to the [book’s] commercial viability” because it helps users find books to purchase. if snippet view had allowed reading entire books by combining snippets, the outcome would differ. AI outputs often do allow “reading the entire book” (metaphorically)—accessing the full functional value without the original.

market dilution theory

the US Copyright Office’s 2025 report on AI training (Part 3, May 2025) introduced an expanded interpretation of factor 4: market dilution theory.

the report notes that “the speed and scale at which AI systems generate content pose a serious risk of diluting markets for existing works.” crucially, factor 4 analysis shouldn’t be read narrowly to consider only the specific copyrighted work’s market. courts should consider:

derivative markets: adaptations, translations, sequels, spin-offs
potential licensing opportunities: markets that could be exploited even if not currently active
market flooding: when AI-generated works in similar styles or categories saturate markets, harming demand for the category generally

example from the report: when a romance novel generator trained on copyrighted romance novels floods the market with AI-generated romances, it harms “the market for romance novels” broadly—not just the specific titles in training data. readers may substitute AI-generated novels for traditionally published ones, harming all romance authors whose works contributed to training.

this theory addresses the “uncharted territory” Judge Chhabria identified in Kadrey: traditional factor 4 analysis focuses on specific works, but AI’s ability to generate unlimited derivatives creates category-level market harm. the Copyright Office endorsed considering this broader market impact.

uk fair dealing and market impact

UK copyright law is narrower than US fair use, with specific enumerated purposes (research, criticism, quotation, etc.). but UK courts still consider market effects in fairness assessments:

Ashdown v. Telegraph Group Ltd. [2001] EWCA Civ 1142: even for permitted purposes, fairness includes balancing public interest against commercial harm. market competition and economic impact weigh in the fairness assessment.

Public Relations Consultants Association Ltd. v. Newspaper Licensing Agency Ltd. (Meltwater) [2013] UKSC 18: even temporary copying for permitted uses must be assessed for fairness, considering whether licensing arrangements exist for the use. market alternatives matter.

UK’s framework is likely less favorable to AI training than US fair use, despite TDM exceptions for research (CDPA §29A), because commercial AI training doesn’t fit research exceptions and UK courts traditionally protect licensing markets more stringently.

synthesis: factor 4 is decisive for AI

the legal framework is clear:

factor 4 is the single most important fair use factor (Harper & Row, Kadrey)
substitutive uses create market harm (TVEyes, Napster, Texaco)
transformation doesn’t negate market harm (Campbell, Warhol)
non-substitutive transformative uses can prevail (Google Books, Kelly v. Arriba Soft)—but only when deliberately designed not to replace originals
market dilution encompasses derivative and licensing markets (Copyright Office 2025)
substantial noninfringing uses (Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984)) can defend against contributory infringement—but defendants must show AI’s primary use isn’t substitution
constitutional balance favors progress (Google LLC v. Oracle America, Inc., 593 U.S. 1 (2021))—but also favors compensating creators whose works enable innovation

AI models create functional substitutes for training data (Sections 1-3 demonstrated this capability). market harm under factor 4 follows as the logical consequence. capability + economic logic = likelihood of market harm. that’s the standard. courts don’t require proof of completed market destruction before finding harm likely.

from training-time infringement to inference-time substitution

the mechanisms in Section 3 create a direct causal chain from unauthorized training to market harm. understanding this chain is crucial for legal causation analysis under fair use factor 4, because it establishes that market substitution isn’t accidental—it’s the predictable consequence of training choices.

the four-step causal chain

step 1: training data assembly

copyrighted works are scraped, compiled, and used for training without authorization. while some jurisdictions provide TDM exceptions for research purposes, the initial copying occurs regardless: the copyrighted content is reproduced into training datasets.

step 2: encoding capability

training encodes the ability to regenerate similar outputs through multiple pathways:

duplication drives memorization: sequences appearing frequently in training data (license text, boilerplate code, famous passages) are memorized near-verbatim. this isn’t accidental—it’s the compression-optimal solution, as my kernel code extraction demonstrated.
patterns enable reconstruction: even without verbatim storage, training extracts statistical patterns across thousands of examples. the model learns “how to write like [artist]” or “how to implement [algorithm]” from the training distribution.
outliers are retained: rare sequences that resist compression into general patterns get stored for later recollection, as Biderman’s taxonomy predicts.

step 3: prompted extraction

users intentionally or inadvertently prompt for substitutive outputs. this isn’t a side effect—it’s the product’s value proposition:

“generate an image in the style of [artist]” explicitly requests stylistic reproduction
“implement quicksort in Python” requests functional equivalents of training examples
“summarize this article” requests paraphrases that convey original content

the model’s training has created the capability; the prompt activates it.

step 4: market substitution

the generated output serves as a functional equivalent to the original work:

users obtain what they need without purchasing or licensing the original
demand for the copyrighted work declines
creators lose potential revenue
market harm occurs

legal causation: the “but for” test

for factor 4 analysis, plaintiffs must establish that the defendant’s use harms the market for the original work. the causal chain above demonstrates “but for” causation:

but for the training on copyrighted data (step 1), the model wouldn’t encode the capability to generate similar outputs (step 2). but for that capability, users couldn’t prompt for substitutes (step 3). but for those prompts, market-displacing outputs wouldn’t be generated (step 4).

each link is necessary. remove any step and the chain breaks.

foreseeability

market substitution isn’t an unforeseeable accident. it’s predictable from first principles:

we know which sequences get memorized (duplicated content, outliers, content seen during low-learning-rate training)
we know memorized content can be extracted via prompting (Carlini et al., Biderman et al.)
we know extracted outputs serve as substitutes by design—that’s why users prompt for them
therefore, market harm from substitution is the foreseeable consequence of training choices

training on an artist’s portfolio to enable “in the style of [artist]” generation isn’t an unfortunate side effect. it’s the intended outcome.

the “intervening cause” defense

defendants might argue that user prompting is an intervening cause that breaks the causal chain from training to market harm. this argument fails for several reasons:

prompting is the intended use. the model was designed to respond to prompts. developers know—and intend—that users will request outputs based on training data patterns. providing a tool designed for substitution is analogous to providing infrastructure for unauthorized copying (see A&M Records v. Napster—the file-sharing platform wasn’t performing the copying, but enabled and encouraged it).

the capability originates in training. user prompts don’t create the substitution capability; they merely invoke what training encoded. if the model hadn’t been trained on copyrighted data, no prompt could extract similar outputs. the causal origin is training, not prompting.

contributory infringement precedent. courts have long held that providing tools that enable infringement—even when the tool provider doesn’t directly infringe—can constitute contributory infringement. if a model’s primary value proposition is generating substitutes for copyrighted training data, the training itself is the infringing act.

mitigation attempts and their limits

defendants often point to technical mitigations that allegedly break the causal chain:

deduplication reduces memorization by removing repeated training examples. but as the compression post showed, even deduplicated models memorize content that appears multiple times across different sources. deduplication reduces probability, not capability.

RLHF and DPO alignment can suppress verbatim reproduction by training models to refuse certain prompts or avoid copying. but alignment is imperfect—adversarial prompting can often bypass refusals, and alignment doesn’t remove the encoded information, only makes it harder to access.

content filters block outputs that match known copyrighted works. but filters are reactive (they can’t anticipate all copyrighted content) and address symptoms rather than causes. the capability for substitution remains; the filter just prevents some manifestations. research demonstrates these mitigations are brittle: Qu et al. (2024) developed automated jailbreaking techniques that reduced ChatGPT’s copyright filter block rate from 84% to 6.25%, enabling copyrighted content generation 73.3% of the time. practical exploitation by prompt engineers has successfully bypassed filters across major AI systems within hours of deployment.

the critical point: imperfect mitigation indicates the capability exists and requires active suppression. if substitution capability didn’t exist, mitigation wouldn’t be necessary. the fact that developers must implement technical countermeasures demonstrates that the causal chain from training to market harm is real—and designed in, not accidental.

this causal framework establishes that market harm under fair use factor 4 isn’t speculative. it’s the direct, foreseeable consequence of training on copyrighted data without authorization. the capability was encoded during training, activated by prompting, and manifests as market-displacing substitution—exactly as the training choices predict.

evidence of substitution across modalities

demonstrating substitution capability doesn’t require showing it happens in 100% of cases. we only need to show the capability exists and is exercised in practice. this section demonstrates substitution across text, images, and code with concrete examples and quantification.

text: paraphrase and summarization

capability demonstrated

language models excel at semantic equivalence without verbatim copying. DreamSim (Sundaram et al., NeurIPS 2023) showed that learned perceptual similarity metrics capture human judgment of similarity better than pixel-level metrics. for text, paraphrase detection models using MPNet embeddings with cosine similarity >0.85 achieve ~75% accuracy at identifying semantically equivalent text that differs in surface form.

beyond technical capability, Gupta & Pruthi (2025) demonstrated real-world academic plagiarism through “smart plagiarism”: AI-generated research that paraphrases methodology while changing terminology. examining 50 LLM-generated research documents, experts identified 24% as paraphrased or significantly borrowed from existing work, with the remaining 76% showing varying degrees of similarity. critically, an AI-generated paper received ICLR 2025 workshop scores of 6, 7, 6 (above acceptance threshold) despite using “resonance graph” instead of “weighted adjacency matrix”—identical methodology, different words. original authors confirmed it was “definitively not novel,” yet it bypassed both automated plagiarism detectors and expert reviewers.

what this means: models can generate paraphrases that humans recognize as conveying the same content despite high edit distance (low character-level similarity). an AI summary of a research article preserves the key claims, methodology, and conclusions in different words—capturing the article’s protected expression without verbatim copying. the Gupta & Pruthi findings demonstrate this isn’t theoretical: AI-generated academic work functions as a substitute for original research, fooling both automated systems and domain experts.

market substitution potential

academic journals: AI-generated abstracts replace paid abstract services. if researchers can get “good enough” summaries from AI, they don’t subscribe to full-text databases
news: AI summaries reduce traffic to original articles. users get the facts and key quotes without reading the source
research papers: AI-extracted key findings substitute for full-text access. students can answer exam questions based on AI summaries without reading assigned papers

quantification

edit distance: paraphrases have low character-level similarity (50-70% Levenshtein distance) but high semantic similarity (>0.85 cosine similarity in embedding space)
information overlap: evaluators judge summaries convey 70-90% of key claims in original papers
functional equivalence: in downstream tasks (answering questions, making decisions), AI summaries perform 80-90% as well as reading originals

evidence gaps

rigorous market displacement studies are rare:

no published before/after analysis of journal subscription cancellations post-AI
limited data on news site traffic decline causally attributed to AI summaries (as opposed to other factors like social media)
“but-for” causation unstudied: would users have paid for originals absent AI?

images: style transfer and composition

capability demonstrated

diffusion models reproduce training images and styles. Carlini et al. (2023) extracted 1,000+ training images from stable diffusion by prompting with similar images and filtering for near-duplicates. Wen et al. (NeurIPS 2023) demonstrated “tree-ring” watermarks persist through training, enabling detection of specific training images in generated outputs.

getty images’ lawsuit against stability AI claims stable diffusion was trained on 12 million getty watermarked images—and generated outputs sometimes include remnants of getty watermarks, demonstrating training on specific copyrighted images.

market substitution potential

stock photography: AI-generated “corporate office” or “mountain sunset” images replace licensed stock photos from shutterstock, getty
commissioned art: clients prompt “in the style of [artist]” instead of hiring the artist or paying licensing fees
book covers, marketing graphics: publishers use AI instead of hiring illustrators

quantification

replication rates (version-specific): Wang et al. (2024) measured Stable Diffusion V1.5 and found 10-20% replication ratios against an open-source gallery using their D-Rep Dataset with 40,000 image-replica pairs (NeurIPS 2024)
bright ending attention: Chen et al. (2025) identified the “bright ending” cross-attention phenomenon where memorized image patches show abnormally high attention to the final text token—distinguishing local memorization (specific regions) from global memorization (entire images) (ICLR 2025 Spotlight)
perceptual similarity: CLIP embeddings show high similarity (>0.9 cosine similarity) between AI-generated and training artist styles for targeted prompting
human studies: observers can’t reliably distinguish AI-generated from artist-original at >70% accuracy for some styles (watercolor, impressionist)
LPIPS (learned perceptual image patch similarity) and SSIM (structural similarity index) capture compositional similarity even when pixel-level differences exist

market impact evidence (preliminary)

anecdotal reports from creative professionals suggest displacement: freelance illustrators document lost projects due to clients using midjourney/stable diffusion instead. surveys of creative workers report AI tools replacing some commissioned work, though rigorous causal attribution studies remain limited.

evidence gaps

controlled experiments on market substitution are needed:

systematic revenue studies: have stock photo sales declined? by how much?
causal attribution: is decline due to AI or broader market trends?
artist-specific impact: how many commissions lost specifically due to “in the style of [artist]” generation?

code: functional equivalence

capability demonstrated

github copilot and similar tools generate functionally equivalent implementations. Yeticstiren et al. (2023) evaluated copilot on 164 problems: 46.3% correctness on first attempt. critically, Nguyen and Nadi (2023) showed that paraphrasing problem descriptions yielded 46% different suggestions—indicating outputs aren’t deterministic memorization but are sensitive to training data patterns.

market substitution potential

stack overflow: declining traffic coincides with copilot adoption. similarweb data shows stack overflow traffic declined ~40% YoY by Q2 2024 (though causation vs correlation uncertain)
programming tutorials and courses: AI explains and implements algorithms on demand, potentially substituting for paid educational content
code snippet libraries: AI-generated implementations substitute for licensed code examples

quantification

functional similarity: unit tests pass at 46-80% rate for common algorithms depending on problem complexity
token-level overlap: low exact match (<20%) but high structural similarity measured via abstract syntax tree (AST) comparison
API pattern replication: high overlap in call sequences, parameter patterns, error handling for common APIs

evidence gaps

attribution studies are needed:

controlled studies: does copilot access reduce stack overflow usage causally?
licensing impact: have sales of code tutorial books/courses declined?
developer surveys: would they have purchased resources absent AI tools?

audio: music and speech generation

capability demonstrated

text-to-audio diffusion models exhibit systematic memorization of training data. Bharucha et al. (2024) analyzed memorization in AudioCaps-trained models and found mel spectrogram similarity robustly detects training data matches. critically, they discovered large amounts of duplicated audio in AudioCaps itself—the same duplication-drives-memorization pattern seen in text and images.

Messina et al. (2025) demonstrated that without anti-memorization guidance, text-to-audio models unintentionally reproduce portions of training data. they found that mitigation strategies create a trade-off between reducing memorization and preserving prompt fidelity—evidence that memorization is pervasive enough to require active suppression.

Epple et al. (2024) showed that imperceptible audio watermarks persist through training and detectably influence model outputs. training data characteristics encode into weights, proving retention extends to non-text modalities.

market substitution potential

music libraries: AI-generated background music replaces licensed tracks from AudioJungle, Epidemic Sound, Universal Production Music
voice acting: text-to-speech models trained on voice actor performances substitute for commissioned work
sound effects: generated audio replaces Foley artists and sound effect libraries
commercial music: AI-generated songs compete directly with human artists for chart positions and streaming revenue

quantification

perceptual similarity: mel-spectral distance and learned audio similarity metrics (CLAP, AudioLDM embeddings) capture functional equivalence
genre/style replication: models generate “in the style of [artist/genre]” with high classification accuracy
watermark persistence: demonstrates training data retention through generation pipeline
documented copyright similarity: GEMA’s lawsuit against Suno documented AI outputs where “melody, harmony and rhythm largely corresponded to world-famous works” including “Forever Young” (Alphaville), “Mambo No. 5” (Lou Bega), and “Cheri Cheri Lady” (Modern Talking)

market impact evidence (realized, not just potential)

in november 2024, Breaking Rust, an AI-generated “country band” that appeared online in October 2024, hit #1 on Billboard’s Country Digital Song Sales chart with “Walk My Walk.” the song reached the top by selling approximately 3,000 copies and accumulated 1.8 million monthly listeners on Spotify. Breaking Rust’s social media shows a clearly AI-generated cowboy avatar and provides no indication of human involvement in music creation.

this is not capability—this is realized market substitution. an AI-generated artist with zero human creative input:

displaced human country artists from the #1 chart position
generated substantial streaming revenue (1.8M monthly listeners)
competed directly with human musicians for consumer attention and spending
was legitimized by Billboard’s inclusion on official charts

notably, Breaking Rust’s Instagram describes the project only as “Outlaw Country” with no AI disclosure, while the creator (Aubierre Rivaldo Taylor) operates another project (Defbeatsai) that does disclose AI generation. Breaking Rust isn’t even the first: Xania Monet, another AI artist, hit #1 on the R&B Digital Song Sales chart earlier in November 2024.

legal actions confirm substitution

GEMA (German music rights society) filed lawsuits documenting market-harming substitution:

november 2024 vs OpenAI: ChatGPT trained on copyrighted song lyrics from ~100,000 GEMA members without compensation
january 2025 vs Suno: documented that Suno’s outputs have “melody, harmony and rhythm largely corresponding to world-famous works” including specific copyrighted songs from Alphaville, Kristina Bach, Lou Bega, Frank Farian, and Modern Talking

GEMA’s documentation provides specific examples where AI outputs are functionally equivalent to copyrighted works—not just stylistically similar, but matching in melody, harmony, and rhythm. this is level 1-2 similarity from Section 1: high structural overlap creating perfect functional substitutes.

evidence gaps

rigorous market displacement studies are emerging but incomplete:

chart displacement is documented: AI artists reaching #1 proves they outcompete humans for consumer spending
revenue attribution uncertain: how much streaming revenue goes to AI vs human artists in country/R&B genres?
causal impact on human artist income: systematic before/after analysis needed for affected genres
consumer intent studies: do Breaking Rust listeners know it’s AI? would they have chosen human artists if disclosed?

why audio evidence is decisive

unlike text (paraphrases) or images (style transfers) where similarity is debatable, audio replication can be measured precisely: mel-spectral distance, harmonic analysis, rhythm pattern matching. GEMA’s documentation that Suno outputs match copyrighted works in melody, harmony, and rhythm provides objective, quantifiable evidence of substitution—not subjective similarity judgments.

moreover, Breaking Rust demonstrates that courts don’t need to wait for longitudinal studies. the market harm is observable now: #1 chart position, 1.8M listeners, direct competition with human artists. this is the realized substitution that factor 4 analysis requires—not hypothetical future harm, but present displacement.

across modalities, consistent patterns emerge:

high semantic/functional similarity without verbatim: outputs serve the same purpose as training data without exact copying (levels 2-5 similarity from Section 1)
user intention to substitute: prompts explicitly request alternatives (“in the style of,” “implement,” “summarize”)
anecdotal market harm: artist commissions down, writer rates down on freelance platforms, developer resource sales potentially down
causal attribution uncertain: observational data shows correlation, but controlled studies establishing causation are rare

economic projections exist but require verification: a CISAC global economic study projected 24-21% revenue loss for creative sectors by 2028 if AI substitution continues unchecked, affecting 200K+ jobs. these are projections, not confirmed measurements.

capability establishes likelihood

models produce functionally equivalent outputs across modalities. the technical capability for substitution exists and is routinely demonstrated across text, code, and images.

correlation exists between AI deployment and market indicators (stack overflow traffic declining, artist commissions falling, freelance rates dropping). establishing precise causal magnitude requires controlled studies—but courts don’t wait for precise magnitude before finding likelihood of harm.

this is the distinction that matters for legal analysis: capability + economic logic = likelihood of harm. that’s the legal standard for factor 4. claiming we need “rigorous empirical studies with proper controls” before finding market harm likely confuses the standard. preliminary injunctions don’t require completed harm. restraining orders don’t wait for statistical evidence. likelihood is enough.

from capability to legal consequence

established with evidence

through Sections 1-5, we’ve established:

✅ substitution capability exists (Sections 1-2)

models generate outputs at all five similarity levels (n-gram overlap, paraphrase, stylistic, structural, compositional)
technical mechanisms (prompting, contamination, retrieval, reconstruction) enable substitution
capability is routine, not exceptional—it’s the product’s value proposition

✅ mechanisms are understood (Section 3)

prompting: users intentionally seek substitutes
contamination: duplication in training drives predictable memorization
retrieval: models function like databases for specific queries
reconstruction: compositional synthesis creates functional equivalents

✅ legal doctrine supports market harm analysis (Section 4)

fair use factor 4 is decisive (Harper & Row, Kadrey)
substitutive uses have failed fair use (TVEyes, Napster, Texaco)
transformation doesn’t negate market harm (Campbell, Warhol)
market dilution theory encompasses derivative and licensing markets (Copyright Office 2025)

✅ causal chain is established (Section 5)

training → encoding → prompting → substitution creates “but for” causation
foreseeability: substitution is predictable from training choices
intervening cause defense fails: prompting is intended use

requires empirical study

the next phase requires economists, not computer scientists:

❓ usage patterns at scale

what percentage of AI uses are substitutive vs. novel creations?
how often do users prompt explicitly for training-data-similar outputs?
do users recognize when they’re substituting AI outputs for originals?

❓ actual market displacement

has creator revenue declined measurably post-AI deployment?
can declines be causally attributed to AI (vs. other factors like economic conditions, platform changes)?
are some sectors harmed while others benefit?

❓ economic significance

is market harm de minimis (negligible) or substantial?
what’s aggregate revenue loss across affected creators?
does AI expand markets (net positive) or redistribute existing value (zero-sum or negative)?

❓ user behavior and intent

would users have paid for originals absent AI? (the “but-for” test for damages)
are AI outputs “good enough” or do originals remain preferred for quality-sensitive uses?
how price-elastic is demand for original vs. AI-generated content?

❓ cross-price elasticity

formal economic measurement: does AI availability reduce demand for originals?
if AI usage increases 10%, does original demand decrease X%?
are they substitutes (positive elasticity) or complements (negative elasticity)?

why empirical gaps matter

burden of proof in litigation

for fair use factor 4, plaintiffs must establish market harm or likelihood thereof. defendants can rebut with evidence of no harm. the legal standard is preponderance of evidence (more likely than not).

demonstrating capability (which this post has done) establishes likelihood of harm—sufficient to shift burden to defendants. but defendants can rebut by showing:

users don’t actually substitute at scale
revenues haven’t declined
AI creates discovery effects that increase demand

current state of litigation

most AI copyright cases lack rigorous economic evidence. courts are deciding factor 4 based on:

theoretical arguments about substitution potential (like this post’s)
analogies to Google Books, Napster, etc.
assumptions about user behavior without survey evidence
sparse anecdotal testimony from individual creators

what’s needed

rigorous empirical studies require:

before/after revenue analysis for affected creators with proper controls for confounding factors (economic cycles, platform changes, etc.)
user surveys with representative sampling and validated instruments measuring willingness to pay, substitution behavior, and “but for” causation
causal inference methods like difference-in-differences (comparing AI-affected vs. unaffected markets), instrumental variables (finding exogenous variation in AI exposure), or randomized controlled trials (experimentally varying AI access)
cross-sectional studies comparing markets with different AI penetration rates while controlling for other differences

timeline and urgency

rigorous economic studies take 2-3 years from design to peer-reviewed publication. current litigation is moving faster—courts are issuing rulings now. there’s a mismatch: legal decisions are outpacing empirical evidence.

this creates risk: courts may rule on factor 4 based on incomplete information, potentially getting causation wrong in either direction.

architectural and jurisdictional variations

not all models have equal substitution potential

technical choices affect degree but not kind:

encoder-only models (BERT) are less prone to substitution than generative models (GPT, stable diffusion)
heavy deduplication reduces memorization frequency but doesn’t eliminate capability
RLHF/DPO alignment can suppress verbatim outputs but doesn’t remove encoded information
retrieval-augmented generation changes dynamics—explicit retrieval is more obviously copying

but these are differences in degree of substitution risk, not existence of capability. the claim “weights contain copies and enable substitution” remains true even for heavily mitigated models—the probability is reduced, not eliminated.

jurisdictional differences matter

united states: broad fair use doctrine with flexible four-factor test. transformative use doctrine is expansive. factor 4 is decisive but requires plaintiff proof of harm.
united kingdom: narrower fair dealing with specific enumerated purposes. TDM exception for research exists but commercial AI training likely doesn’t qualify. courts traditionally protect licensing markets stringently.
european union: mandatory TDM exceptions (DSM Directive Articles 3-4) with opt-out for rightholders. member states vary on enforcement. stricter than US but with researcher carve-outs.

venue shopping is likely based on plaintiff strength in different jurisdictions.

the honest assessment

what this post has accomplished:

demonstrated substitution capability across modalities with concrete examples
explained technical mechanisms enabling substitution
established legal framework (factor 4 decisive, substitution fails fair use)
identified causal chain from training to market harm

what this post hasn’t established:

that substitution occurs at economically significant scale (requires usage studies)
that measurable market harm has occurred (requires revenue analysis and causal attribution)
that users predominantly use AI to avoid paying for originals (requires behavioral surveys)
causal attribution of observed market changes to AI vs. other factors

why this matters: demonstrating capability isn’t demonstrating behavior. courts need evidence of actual market harm, not just potential for harm, to make informed fair use decisions. however, capability is legally relevant: it establishes likelihood of harm, shifting burden to defendants to show actual behavior doesn’t match predicted behavior.

in other words: we’ve shown the gun is loaded and pointed at creators’ markets. whether the trigger has been pulled—and how many casualties resulted—requires empirical measurement we don’t yet have.

counterarguments: steel-manning the opposition

to be fair, there are legitimate arguments against finding that AI model outputs create market harm through substitution. let me present them as charitably as possible before explaining why the economic and legal evidence still supports the substitution thesis.

argument 1: “AI outputs are transformative, not substitutive”

defenders argue that AI creates genuinely new works by synthesizing patterns from thousands of training examples. a github copilot implementation isn’t replacing any specific stack overflow answer but creating novel code by combining learned patterns. an image model trained on an artist’s portfolio doesn’t copy any single work but generates new compositions. this is transformation, not substitution—the outputs serve different purposes than the inputs.

supporters cite Kelly v. Arriba Soft (thumbnail search) and Google Books (snippet view) as precedents where copying for indexing and discovery was transformative. they argue AI training is analogous: extracting patterns to enable new creative capabilities, not replacing specific works.

rebuttal: transformation and substitution aren’t mutually exclusive. as Campbell v. Acuff-Rose demonstrated, even transformative parodies must be analyzed for market harm under factor 4. the Supreme Court recognized that a transformative purpose doesn’t automatically negate market effects—“the more transformative the new work, the less will be the significance of other factors” (emphasis added), but courts must still consider whether the use “usurps the market” for the original.

critically, both Kelly and Google Books succeeded on factor 4 because they were designed to be non-substitutive. thumbnails led users to full-resolution originals; snippets were too fragmentary to replace reading books. AI outputs often provide complete functional substitutes: full implementations, complete images, comprehensive summaries. when outputs satisfy user needs entirely, the analogy breaks.

more fundamentally, “transformative purpose” (training for pattern extraction) differs from “transformative output” (whether the result replaces originals). if a user’s need is satisfied by the AI output—whether that output is a paraphrase, a style-matched image, or functionally equivalent code—that’s substitution in economic terms, regardless of how “transformed” the output is from any specific training example. Section 5 showed the causal chain: training on copyrighted data enables the capability to generate substitutes. when users exercise that capability, market harm follows.

argument 2: “the market is expanding, not contracting”

another defense points to democratization: more people are coding, creating images, and writing than ever before. AI tools expand access to creative production. total market growth means opportunities exist even if specific niches decline. a net positive outcome for creative industries shouldn’t be characterized as harmful.

rebuttal: market expansion doesn’t preclude individual harm. copyright protects individual creators’ rights, not aggregate market size. if AI enables 1,000 new users to generate images but displaces 100 professional illustrators whose training data was used without compensation, the net may be positive but the 100 are harmed. fair use factor 4 asks whether the use harms the market “for the copyrighted work,” not whether it harms “all creative markets generally.”

furthermore, expanding markets don’t necessarily align with the underlying constitutional goals of promoting scientific progress. if the market expansion results in displacing useful services with personal goonbots or vibecoding-as-slot-machine experience, then clearly the constitutional mandate weights against this outcome, not for it. once again, it is a matter to be observed empirically before we can say.

moreover, the counterfactual matters: the market might have grown more if creators were compensated for training data. lost licensing revenue is harm even if other revenues exist. as the US Copyright Office’s 2025 report noted, factor 4 should encompass “potential licensing opportunities” and markets that “would have developed” absent unauthorized use. when training data could have been licensed but wasn’t, creators lose potential revenue streams—that’s measurable economic harm.

argument 3: “users wouldn’t have paid anyway”

defenders often claim that many AI use cases involve tasks users would have done themselves (poorly) or not at all. a student using chatgpt to summarize an article wasn’t going to subscribe to the journal. a hobbyist using stable diffusion wasn’t going to commission an artist. no sale was lost because no purchase would have occurred. absent market harm, fair use should apply.

rebuttal: this requires empirical evidence, not assertion. and critically, it’s the burden-shifting argument that matters here. as Section 7 discussed, establishing factor 4 requires plaintiffs to show likelihood of market harm. but once plaintiffs establish that outputs can serve as substitutes (which Sections 2-3 demonstrated), defendants bear the burden of rebutting with evidence that users wouldn’t have paid.

even if individual users wouldn’t pay, aggregate effects matter. if 1% of AI users would have paid for originals, and there are millions of users, that’s economically significant harm. moreover, this argument is dangerous precedent. applied to any infringement, it would gut copyright: “I wasn’t going to buy the movie, so pirating it causes no harm.” copyright doesn’t require proof that each individual infringer would have been a paying customer—it protects the right to control copying and the economic value of that control.

furthermore, market harm extends beyond direct financial loss. creators lose attribution and the social and reputational capital that builds careers. when AI outputs incorporate patterns, styles, or knowledge from training data without acknowledgment, creators are deprived of credit that would lead to future commissions, collaborations, and professional opportunities. an artist whose style is replicated loses not just the sale but the visibility that drives their next client. a researcher whose work is paraphrased loses citations that establish academic reputation. this reputational harm compounds over time and across millions of uses—even when individual instances might not represent direct lost sales.

argument 4: “AI enables discovery and promotion”

like google books, AI might increase demand for original works through discovery. exposure to AI-generated summaries or style-transferred images could lead users to seek out originals. if AI serves as marketing or discovery mechanism, it benefits rather than harms creators. this net positive effect should favor fair use under factor 4.

rebuttal: this is empirically testable—and defendants should provide evidence if they claim it. google books succeeded on fair use partly because snippet view (limited to 3-5 lines) was designed to be non-substitutive. as the Second Circuit noted, snippet view “adds importantly to the [book’s] commercial viability” by helping readers find relevant books they then purchase. critically, google’s design choices limited substitution risk: no full-page views, no continuous text extraction, no complete chapter access.

AI model outputs are different. they often provide complete, functional substitutes: full code implementations, complete images, comprehensive summaries, entire paragraphs of prose. when outputs satisfy the user’s need entirely, discovery doesn’t follow—the user has what they wanted. if defendants can show that AI exposure increases purchases of training works (through citations, attribution, watermarks linking to originals), that’s evidence for factor 4 in their favor. but the burden is on them to demonstrate a discovery effect, not to assume it.

argument 5: “fair use should favor innovation”

a final policy argument: overly restrictive copyright interpretation would stifle AI innovation. these technologies offer enormous social benefits—productivity gains, accessibility, democratization. fair use doctrine exists to balance copyright protection against public interest. innovation should tip the scales.

defenders cite Google v. Oracle, where the Supreme Court emphasized that “to allow enforcement of copyright here would risk harm to the public” in promoting science and useful arts. blocking AI training could similarly harm technological progress and the public interest.

rebuttal: fair use already balances innovation through the four-factor framework. factor 1 (purpose and character) considers transformative uses. but Campbell, Google Books, and recent AI cases all emphasize that transformation doesn’t override market harm. innovation benefits society, but so does compensating creators whose works made that innovation possible.

moreover, Google v. Oracle is distinguishable. that case involved API declaring code—functional interfaces that create network effects and developer lock-in. the Court worried that allowing Oracle to control Java API structure would create monopolies contrary to copyright’s constitutional purpose. training data isn’t functional interface code requiring interoperability; it’s creative expression. the constitutional balance arguably favors creators whose works enable innovation but who aren’t compensated for that contribution.

the choice isn’t “innovation or copyright”—it’s “innovation with authorization” versus “innovation without authorization.” licensing regimes exist (collective licensing organizations, compulsory licensing, opt-out mechanisms). claiming fair use is the only path forward ignores these alternatives that allow innovation while ensuring compensation.

argument 6: “substantial noninfringing uses” (sony betamax)

defenders point to Sony v. Universal (the Betamax case), where the Supreme Court held that devices capable of both infringing and noninfringing uses don’t constitute contributory infringement if the product has “substantial noninfringing uses.” VCRs could copy TV shows illegally, but they could also time-shift broadcasts legally. AI models similarly can generate both original content (noninfringing) and training-data substitutes (potentially infringing). the substantial legitimate uses should shield developers from liability.

rebuttal: the Betamax standard applies to contributory infringement, not direct infringement or fair use analysis. moreover, Betamax concerned time-shifting—copying broadcasts users were already entitled to watch at a different time. VCRs didn’t create new access to content users hadn’t licensed. AI models are different: they enable access to creative expression from training data that users never licensed or purchased. a user generating “code in the style of [training example]” or “summary of [article]” is obtaining value from copyrighted works they didn’t pay for, not merely time-shifting content they already had rights to access.

additionally, Betamax considered whether Sony should face contributory liability for user infringement. the question here is whether the training itself (assembling and using copyrighted works to build the model) constitutes infringement, and whether outputs create market harm. those are direct infringement and fair use questions, not contributory infringement. the Betamax safe harbor doesn’t automatically extend to fair use factor 4 analysis.

common thread and burden of proof

most counterarguments reduce to: “substitution isn’t happening at scale, or is offset by benefits.”

our response: this is an empirical question. if defendants claim outputs don’t harm markets, they should fund rigorous studies demonstrating it. absent evidence, courts should apply the logical economic prediction: substitutes reduce demand for originals. that’s the definition of substitution in economics.

defendants claiming “no market harm” have the resources and data to demonstrate it. they can measure usage patterns, revenue impacts, causal attribution, and cross-price elasticity. if their products truly don’t harm creator markets—or if benefits offset harms—show it with evidence. absent that evidence, courts should apply the economic prediction: substitutes reduce demand for originals.

the legal framework is clear: factor 4 is decisive (as Kadrey, Harper & Row, and the Copyright Office report emphasized), substitutive uses fail fair use (as TVEyes, Napster, and Texaco showed), and Google Books is distinguishable because snippet view was designed non-substitutive while AI outputs often aren’t.

capability for market-harming substitution exists (Sections 2-6 established this). capability + economic logic = likelihood of harm. that’s sufficient for factor 4 analysis. courts don’t wait for markets to collapse before finding harm likely.

conclusion: from capability to likelihood of harm

we’ve established four things:

capability exists: models generate functional substitutes across modalities
causal mechanism is established: training → encoding → prompting → substitution
economic logic is settled: substitutes reduce demand. that’s what “substitute” means.
legal doctrine is clear: factor 4 is decisive. substitutive uses fail fair use.

capability + economic logic = likelihood of harm. that’s the legal standard.

courts act on likelihood of harm, not evidence of completed destruction. they don’t wait for statistical confirmation that illustrators are unemployed, writers are displaced, and small creative businesses are bankrupt—while a handful of AI companies scale to billions in revenue.

that’s how preliminary relief works. trade secret injunctions don’t wait for the secret to be published and market share lost. patent injunctions don’t wait for the competitor’s business to succeed.

likelihood is the standard, not a labor market catastrophe and social unrest.

we’ve demonstrated the capability across modalities. economic theory tells us what happens when substitutes enter markets: demand for originals falls. uncertainty about magnitude doesn’t defeat likelihood. claiming “we need empirical confirmation before courts can find harm likely” confuses the standard. it’s exactly backwards.

defendants claiming “no market harm” bear the burden now. they have the usage data. they have the revenue figures. they know whether their products drive discovery and purchases (like Google Books snippets) or displace them (like Napster downloads).

if their systems truly expand markets rather than cannibalize them, demonstrate it. conduct the studies. release the data. show that users who get AI-generated code still buy O’Reilly books, that users who get AI-generated images still commission artists, that users who get AI-generated summaries still subscribe to publications.

waiting for “2-3 years of peer-reviewed empirical studies” before finding market harm likely would defeat the purpose of factor 4. by the time you have perfect causal evidence with validated instruments and quasi-experimental designs, the markets are gone.

Harper & Row didn’t wait for sales data. Napster didn’t wait for longitudinal studies. courts act when the economic prediction is clear and the capability is demonstrated. both conditions are met.

the “AI training is spectacularly transformative” argument fails on factor 4 regardless of factor 1. Warhol established this: even transformative uses fail fair use when they usurp licensing markets.

Google Books succeeded because of design choices that prevented substitution—three-line snippets that were “fragmentary and scattered,” helping users discover books to purchase.

AI outputs are different. when a model generates complete implementations, full images, comprehensive summaries, users obtain what they need. no purchase follows because none is required.

my earlier post demonstrated verbatim copying exists through mathematical construction. this post demonstrates that functional substitution creates market harm through economic logic. together they demolish the “AI training is always fair use” position from complementary angles: one shows the copying, the other shows the harm.

both are established, not speculative. courts claiming otherwise are wrong.

ps: for the mathematical foundation showing that compression enables extractable copying and why model weights legally contain copies, see the companion post.

references

legal authorities

us supreme court

Harper & Row, Publishers, Inc. v. Nation Enterprises, 471 U.S. 539 (1985) - “undoubtedly the single most important element of fair use”
Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994) - transformative use doesn’t negate market harm analysis
Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023) - licensing market harm decisive
Sony Corp. of America v. Universal City Studios, Inc., 464 U.S. 417 (1984) - substantial noninfringing uses defense for contributory infringement
Google LLC v. Oracle America, Inc., 593 U.S. 1 (2021) - copyright enforcement shouldn’t harm public interest in progress

us courts of appeals

Authors Guild v. Google Inc., 804 F.3d 202 (2d Cir. 2015) - snippet view non-substitutive, “fragmentary and scattered”
Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018) - video clips “usurped a market”
A&M Records, Inc. v. Napster, Inc., 239 F.3d 1004 (9th Cir. 2001) - file sharing substituted for purchases
American Geophysical Union v. Texaco Inc., 60 F.3d 913 (2d Cir. 1994) - photocopying harmed licensing market
Arnstein v. Porter, 154 F.2d 464 (2d Cir. 1946) - ordinary observer test for substantial similarity
Computer Associates Int’l, Inc. v. Altai, Inc., 982 F.2d 693 (2d Cir. 1992) - abstraction-filtration-comparison for software
Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2003) - thumbnail images transformative, “use did not supplant need for originals”
Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992) - intermediate copying for reverse engineering

us district courts

Kadrey v. Meta Platforms, Inc., No. 3:23-cv-03417 (N.D. Cal. July 2024) - factor 4 “undoubtedly the single most important element”

uk courts

Ashdown v. Telegraph Group Ltd. [2001] EWCA Civ 1142 - fairness includes market impact
Public Relations Consultants Association Ltd. v. Newspaper Licensing Agency Ltd. (Meltwater) [2013] UKSC 18 - licensing arrangements matter for fairness

statutory authority

17 U.S.C. §106 - exclusive rights including reproduction (§106(1)) and derivative works (§106(2))
17 U.S.C. §107 - fair use four factors

academic papers

memorization and extraction

Biderman, Stella, et al. (2023). “Emergent and Predictable Memorization in Large Language Models.” arXiv:2304.11158. k-extractibility threshold (k=32)
Prashanth, U. S., Biderman, Stella, et al. (2024). “Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon.” arXiv:2406.17746. recite/reconstruct/recollect taxonomy
Carlini, Nicholas, et al. (2023). “Extracting Training Data from Diffusion Models.” USENIX Security 2023. 1,000+ training images extracted from Stable Diffusion
Chen, Chuxiao, et al. (2024). “Exploring Local Memorization in Diffusion Models via Bright Ending Attention.” ICLR 2025 Spotlight. bright ending phenomenon distinguishes local vs global memorization
Wang, Wenhao, et al. (2024). “Image Copy Detection for Diffusion Models.” NeurIPS 2024. 10-20% replication ratios in Stable Diffusion V1.5
Wang, Zhe, et al. (2024). “Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models.” arXiv:2405.05846. formal definition of memorization in diffusion models
Webster, Ryan (2023). “A Reproducible Extraction of Training Images from Diffusion Models.” arXiv:2305.08694.
Nasr, Milad, et al. (2023). “Scalable Extraction of Training Data from (Production) Language Models.” arXiv:2311.17035. extraction from RLHF models
Lee, Katherine, et al. (2021). “Deduplicating Training Data Makes Language Models Better.” arXiv:2107.06499.

audio models and music generation

Bharucha, François G., et al. (2024). “Generation or Replication: Auscultating Audio Latent Diffusion Models.” ICASSP 2024. systematic memorization analysis in text-to-audio models
Epple, Pascal, et al. (2024). “Watermarking Training Data of Music Generation Models.” arXiv:2412.08549. watermark persistence demonstrates training data retention
Messina, Francisco, et al. (2025). “Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance.” arXiv:2509.14934. AMG trade-off between memorization and prompt fidelity

adversarial attacks and jailbreaking

Qu, Xiaoxuan, et al. (2024). “Automatic Jailbreaking of the Text-to-Image Generative AI Systems.” arXiv:2405.16567. automated jailbreaking reduced ChatGPT copyright filter from 84% to 6.25% block rate

paraphrasing and academic plagiarism

Gupta, Tarun and Danish Pruthi (2025). “All That Glitters is Not Novel: Plagiarism in AI Generated Research.” ACL 2025. 24% of AI-generated research paraphrased existing work; ICLR workshop paper scored 6,7,6 despite being “definitively not novel”

mechanisms: retrieval and reconstruction

Khandelwal, Urvashi, et al. (2020). “Generalization through Memorization: Nearest Neighbor Language Models.” ICLR 2020. kNN-LM explicit retrieval
Olsson, Catherine, et al. (2022). “In-context Learning and Induction Heads.” Transformer Circuits. induction heads copy patterns
Wang, Peng, et al. (2023). “Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination.” arXiv:2311.02960. hierarchical representations
Wen, Yuxin, et al. (2023). “Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust.” NeurIPS 2023. watermark persistence through training

similarity metrics and code generation

Sundaram, Stephanie, et al. (2023). “DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data.” NeurIPS 2023 Spotlight. learned perceptual similarity
Yeticstiren, Burak, et al. (2023). “Evaluating the Code Quality of AI-Assisted Code Generation Tools.” arXiv:2302.06590. Copilot 46.3% correctness
Nguyen, Nhan and Sarah Nadi (2023). “An Empirical Evaluation of GitHub Copilot’s Code Suggestions.” arXiv:2302.04728. 46% output variance on paraphrases
Radford, Alec, et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision (CLIP).” arXiv:2103.00020.

reports and policy

government and regulatory

US Copyright Office (2025). “Copyright and Artificial Intelligence, Part 3: Copyright in AI-Generated Works.” May 2025. market dilution theory endorsed

legal commentary and analysis

McIntosh, David M., et al. (2025). “A Tale of Three Cases: How Fair Use Is Playing Out in AI Copyright Lawsuits.” Ropes & Gray Alert, July 2025.
“Measuring Fair Use: The Four Factors.” Stanford Copyright and Fair Use Center.
“US Copyright Office Issues Pre-Publication Version of 3rd Report on AI Training and Fair Use.” ChatGPT Is Eating the World, May 10, 2025.

market impact studies (preliminary)

CISAC (International Confederation of Societies of Authors and Composers). “Global Economic Study on Creator Displacement.” Projections for 2024-2028.
SimilarWeb. “Stack Overflow Traffic Analysis.” ~40% YoY decline by Q2 2024.
Blood in the Machine. “Artist Displacement Reporting.” Anecdotal accounts of lost commissions.

real-world market substitution

“AI slop hits new high as fake country artist hits #1 on Billboard digital songs chart.” The Register, November 10, 2024. article
“An AI-Generated Country Song Is Topping A Billboard Chart, And That Should Infuriate Us All.” Whiskey Riff, November 8, 2024. article
“Who the Heck is Breaking Rust? The AI-Generated Artist Topping the Spotify and Billboard Charts.” Holler Country, November 2024. article
“Breaking Rust is a hot new country act on the Billboard charts. It’s powered by AI.” NPR, November 10, 2025. article

legal actions: music copyright

“Suno AI and Open AI: GEMA sues for fair compensation.” GEMA press release, January 21, 2025. announcement
“Fair remuneration demanded: GEMA files lawsuit against Suno Inc.” CISAC, January 21, 2025. article
“GEMA wins landmark ruling against OpenAI over ChatGPT’s use of song lyrics.” Music Business Worldwide, November 2024. article
“Munich hears first ever case over licences for generative AI in GEMA vs OpenAI.” JUVE Patent, November 2024. article

technical tutorials and tools

OpenAI. “OpenAI Cookbook: CLIP Embeddings.” Implementation guide.
Sentence Transformers documentation provides reference implementations for semantic similarity measurement using transformer models.

project resources

model weights contain copies: compression is not magic - technical foundation demonstrating verbatim extraction through compression mechanisms

the question this post answers

why substitution matters legally

what this post demonstrates

the path ahead

what is a “near copy” in generative systems?

the spectrum of similarity

why this taxonomy matters legally

connection to memorization taxonomy

how substitution happens: technical mechanisms

mechanism 1: prompting patterns

mechanism 2: training contamination and duplication

mechanism 3: retrieval-augmented and cache-like behaviors

mechanism 4: compositional reconstruction

the mechanisms combine

legal framework: fair use factor 4 and market harm

copyright beyond exact copies

the four factors and factor 4’s primacy

the case law progression

the google books distinction

market dilution theory

uk fair dealing and market impact

synthesis: factor 4 is decisive for AI

from training-time infringement to inference-time substitution

the four-step causal chain

legal causation: the “but for” test

foreseeability

the “intervening cause” defense

mitigation attempts and their limits

evidence of substitution across modalities

text: paraphrase and summarization

images: style transfer and composition

code: functional equivalence

audio: music and speech generation

cross-modal patterns

capability establishes likelihood

from capability to legal consequence

established with evidence

requires empirical study

why empirical gaps matter

architectural and jurisdictional variations

the honest assessment

counterarguments: steel-manning the opposition

common thread and burden of proof

conclusion: from capability to likelihood of harm

references

legal authorities

academic papers

reports and policy

project resources