AI Training on AI-Generated Content: The Risk of Model Collapse
The blog post discusses a phenomenon known as 'model collapse,' which occurs as AI-generated content proliferates on the internet and starts to be used as training data for AI models. This model collapse results in these models degrading over time, producing more errors and less varied responses. The phenomenon was investigated by a group of researchers from the UK and Canada, who found that as AI models are trained on data produced by other AI models, they quickly forget the original data they were initially trained on. This 'pollution' of the training data leads to the AI gaining a distorted perception of reality and may result in discrimination based on factors like gender, ethnicity, etc. These models tend to overfit popular data and misrepresent less popular data, leading to the loss of minority data characteristics over time. This problem is likened to the degradation in quality observed when a JPEG image is repeatedly copied or when a clone creates a clone of itself, leading to exponentially decreasing levels of intelligence.
