Google AI Infringed Creators' Works in Developing Training Dataset for Imagen: Class Action
Three illustrators and a photographer sued Google Friday, alleging its Imagen AI text-to-image diffusion model used a dataset for training that contains their copyrighted works, said the Friday class action (docket 3:24-cv-02531) in U.S. District Court for Northern California in San Francisco.
Sign up for a free preview to unlock the rest of this article
If your job depends on informed compliance, you need International Trade Today. Delivered every business day and available any time online, only International Trade Today helps you stay current on the increasingly complex international trade regulatory environment.
Jingna Zhang, a Washington-based photographer, and cartoonist-illustrators Sarah Andersen of Oregon, Hope Larson of North Carolina and Jessica Fink of New York allege Imagen is trained by copying “an enormous quantity of digital images with associated text captions, extracting protected expression from these works, and transforming that protected expression into a large set of numbers called weights that are stored within the model.”
Weights are uniquely derived from the “protected expression in the training dataset,” the complaint said. When a diffusion model generates an image in response to a user prompt, “it is performing a computation that relies on these stored weights, with the goal of imitating the protected expression ingested from the training dataset,” the complaint said.
Training a model requires amassing a “huge corpus of data,” a dataset, and Imagen was trained on datasets comprising “millions of images” paired with descriptive captions, said the complaint. Each image-caption pair, called a training image, is directly copied “in full" and then completely ingested by the model, “meaning that protected expression from every training image enters the model,” it said. As the model copies and ingests vast numbers of training images, “the model progressively develops the ability to generate outputs that mimic the protected expression copied from the dataset,” it said.
Plaintiffs included in attached exhibits “nonexhaustive” lists of registered copyrights they own, along with a list of copyrighted images registered by them “and infringed by” Google in the LAION-400M dataset, the complaint said. Plaintiffs confirmed that these particular images were in the LAION-400M dataset by searching for their own names on two websites that allow searching of the LAION datasets, it said. “On information and belief, all of Plaintiffs’ works that were registered as part of the collections in Exhibit A and were online were scraped into the LAION-400M dataset,” it said.
Plaintiffs and class members own registered copyrights in certain training images that Google has admitted copying to train Imagen, but they never authorized Google to use their copyrighted works as training material, the complaint said. These copyrighted training images were copied “multiple times by Google during the training process for Imagen,” it said. Because Imagen contains weights that represent “a transformation of the protected expression in the training dataset, Imagen is itself an infringing derivative work,” it said.
Plaintiffs assert claims of direct copyright infringement vs. Google and vicarious copyright infringement vs. parent company Alphabet. They seek statutory and other damages; destruction “or other reasonable disposition” of all copies Google made or used in violation of plaintiffs’ rights; pre- and post-judgment interest; and litigation costs.