January 4, 2024

127 - Model Collapse

Over the past two decades, humans have cognitively adapted to the availability of search engines, no longer memorizing data that can be readily called upon. This remains viable, so long as the data is usually accurate, and more complete than human memory. See the problem?

We now face a conflagration of several issues at once. LLMs and other "generative" AI systems flood the internet with a high volume of garbage, which can't reliably be detected, and which is then fed into training subsequent AI systems, making them even worse, producing even higher volumes of garbage. On top of the flood of this garbage clogging search engines, the search engines are being directly compromised by the integration of those same AI systems, giving us results such as Bing famously claiming that Australia doesn't exist.

Now, consider this in the context of education. If children and young adults learn to rely on these systems the same way that the previous generations adapted to rely on search engines, what happens then?

The concept of "model collapse" refers to AI systems that eat their own tail, and/or the excrement of other AI systems, causing them to degenerate until they become incoherent, "collapsing". Humans aren't inherently fragile, the way that those systems are, but humans relying on those systems directly accelerate the collapse of the very systems they grow to rely upon. By using them more they create more garbage data and more market demand that feeds into more and larger garbage generators eating more generative excrement.

Any education system that encourages students to develop that crutch will see them fall when the crutch collapses.

Overcoming this challenge requires fundamentally different architectures than LLMs, things that Google, OpenAI, and Microsoft don't work on. They've already placed their bets on a dead horse, but it will never cross the finish line no matter how much they invest in it.

Systems with human-like and human-level concept learning don't require internet-scale data, so they can afford to be highly selective, avoiding the huge volumes of mediocre "faux-data" in favor of the 0.1% of top-quality material. For educational purposes, they can also learn the entire sum of human knowledge related to education and psychology, as well as each specific domain, serving as on-demand tutors rather than mechanisms of cheating, or crutches cultivating a state of dependence.