logo
LLM
AI Fundamentals

Self-Consuming Generative Models Go MAD

8/28/2024 • arxiv.org

Your browser doesn't support PDF viewing. Download the PDF instead.

Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood. We conduct a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity. Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. We term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease.

Read Full Article...

C4AIL Commentary

We consider this a great example of a misleading story that manages to generate headlines because of it appeals to people’s hopes and wishful thinking, with very limited consequences in the real world.

Summarized the findings of the study are: “Naive use of synthetic data leads to model collapse”, which is a well understood phenomenon that model creators are aware of for many years and that is actively mitigated as routine operation.

Model Collapse happens you train a new model on the output of a previous model over several iterations, after a few iterations a sudden phase shift occurs as model performance collapses.

Simplified: “if you feed cows to cows, eventually you get mad cow disease” and the obvious solution is to not feed cows to cows, while other solutions involve processing the cows before feeding them to their brethren

Not all synthetic data is the same

There’s many different classes of synthetic data. For example, recent google models simulating video games are trained on real game footage generated from AI agents playing the game. Such footage presumably, unless overly redundant, can effectively be used to augment a model (in fact, DeepMind has made the point that the necessary volume of data to train some models is only achievable via synthetic data already)

TL;DR:

Model collapse is basically a clickbait problem that does not represent a realistic risk for AI systems as nobody deploys systems without taking it into account.