Here we provide a theoretical intuition for the phenomenon of model collapse. We argue that the process of model collapse is universal among generative models that recursively train on data generated by previous generations. We quantify the sources of errors discussed in the previous section by examining two mathematical models, which prove to be simple enough to provide analytical expressions for quantities of interest, but also portray the phenomenon of model collapse: a discrete distribution in the absence of functional expressivity and approximation errors, and a multidimensional Gaussian approximation, portraying joint functional expressivity and statistical errors. We further illustrate the impact of all three jointly for a more complex setting of density estimation in Hilbert spaces in the Supplementary Materials.
The overall stochastic process we consider, which we call learning with generational data, is the following. The dataset at generation i is \({{\mathcal{D}}}_{i}\), comprising independent and identically distributed random variables \({X}_{j}^{i}\) with distribution pi, j ∈ {1,…, Mi} denotes the size of the dataset. Going from generation i to generation i + 1, we aim to estimate the distribution of samples in \({{\mathcal{D}}}_{i}\), with an approximation \({p}_{{\theta }_{i+1}}\). This step is what we refer to as functional approximation, \({p}_{{\theta }_{i+1}}={{\mathcal{F}}}_{\theta }({p}_{i})\). The dataset \({{\mathcal{D}}}_{i+1}\) is then generated by sampling from \({p}_{i+1}={\alpha }_{i}{p}_{{\theta }_{i+1}}+{\beta }_{i}{p}_{i}+{\gamma }_{i}{p}_{0}\), with non-negative parameters αi, βi, γi summing to 1, that is, they represent proportions of data used from different generations. This corresponds to a mixing of data coming from the original distribution (γi), data used by the previous generation (βi) and data generated by the new model (αi). We refer to this as the sampling step. For the mathematical models to come, we consider αi = γi = 0, that is, data only from a single step are used, whereas numerical experiments are performed on more realistic choices of parameters.
Discrete distributions with exact approximation
In this subsection, we consider a discrete probability distribution in absence of functional approximation and expressivity errors, that is, \({\mathcal{F}}(p)=p\). In this case, model collapse arises only because of statistical errors from the sampling step. At first, the tails (low-probability events) begin to disappear as a result of the low probability of sampling them and, over time, support of the distribution shrinks. Denoting the sample size as M, if we consider state i with probability \(q\le \frac{1}{M}\), the expected number of samples with value i coming from those events will be less than 1. In practice, this would mean that we lose information about them. Considering more generally some state i with probability q, using standard conditional probability, we can show that the probability of losing information (that is, sampling no data at some generation) is equal to 1 − q, implying that the distribution must converge to a delta function positioned at some state, with the probability of ending up at a certain state equal to the probability of sampling said state from the original distribution.
This can be shown directly by considering the process \({{\bf{X}}}^{i}\to {\mathcal{F}}\,\to \)\({p}_{i+1}\to {{\bf{X}}}^{i+1}\) as a Markov chain, as Xi+1 only depends on Xi. Furthermore, if all the \({X}_{j}^{i}\) have the same value, then at the next generation, the approximated distribution will be exactly a delta function and therefore all of \({X}_{j}^{i+1}\) will also have the same value. This implies that the Markov chain contains at least one absorbing state and therefore, with probability 1, it will converge to one of the absorbing states. This is a well-known fact, of which a proof is provided in the Supplementary Materials. For this chain, the only absorbing states are those corresponding to delta functions. As a result, as we follow the progress of model collapse, we are guaranteed to end up in a constant state, having lost all the information of the original distribution when the chain is absorbed. This argument also works in general owing to floating-point representations being discrete, making the Markov chain over the parameters of the model discrete. Thus, as long as the model parameterization allows for delta functions, we will get to it, because—owing to sampling errors—the only possible absorbing states are delta functions. On the basis of the discussion above, we see how both early model collapse, in which only the low-probability events get cut off, and late stage model collapse, in which the process begins to collapse into a single mode, must arise in the case of discrete distributions with perfect functional approximation.
Multidimensional Gaussian
Following the discussion about discrete distributions, we now present a more generic result, which can be shown in the Gaussian approximation setting, in which each generation is approximated using the unbiased estimates of the mean and the variance. A similar result holds more generally, which we detail in the Supplementary Materials.
Theorem 3.1 (Gaussian model collapse)
Assume the original data are sampled from distribution \({{\mathcal{D}}}_{0}\) (not necessarily Gaussian), with non-zero sample variance. Assume Xn are fit recursively using the unbiased sample mean and variance estimators from the previous generation, \({X}_{j}^{n}| {\mu }_{n},{\Sigma }_{n} \sim {\mathcal{N}}({\mu }_{n},{\Sigma }_{n})\), with a fixed sample size. Then,
$${\mathbb{E}}[{{\mathbb{W}}}_{2}^{2}({\mathcal{N}}({\mu }_{n},{\Sigma }_{n}),{{\mathcal{D}}}_{0})]\to \infty ;\,{\Sigma }_{n}\,\mathop{\to }\limits^{{\rm{a}}.{\rm{s}}.}\,0\,\,{\rm{a}}{\rm{s}}\,\,n\to \infty ,$$
in which \({{\mathbb{W}}}_{2}\) denotes the Wasserstein-2 distance between the true distribution and its approximation at generation n.
In words, this implies that not only does the nth generation approximation diverge arbitrarily far from the original one but it also collapses to be zero variance as the number of generations increases, with probability 1. The results are very analogous to that seen in the discrete case, with this theorem illustrating the effect of late stage model collapse, in which the process begins to collapse to be zero variance. The early stage model collapse can also be seen and the interested reader is referred to the Supplementary Materials for a more in-depth discussion.