November 1, 2024

258 - Synthetic Data

“Synthetic Data” has been another popular trend over the past couple of years, with a handful of extremely niche cases where it genuinely works well. Of course, 99% of the time it is peddled for use cases far outside of those niches. People often accept it even when they know that it will harm their systems overall, so long as it increases their perceived “performance” according to some particular KPI or benchmark.

People are willing to make these kinds of poor trade-offs because the tools that they rely on are extremely inefficient at even mimicking the process of learning. This is why trash AI like LLMs often rely on “internet-scale data”, with companies like OpenAI and Anthropic frequently (and often illegally) scraping websites so hard and fast that web servers crash and CloudFlare mistakes their web-scraper bots for a DoS (Denial-of-Service) attack.

The technology that so many billions of dollars are being and have already been burned on is so grossly inefficient that the entire sum of data on the internet wasn’t enough to come even remotely close to human levels of cognitive performance in any meaningful sense. Even if they were to 1,000x the entire sum of internet data with derivative “synthetic” garbage and train more LLMs on that, they’d still be using the wrong fundamental architecture to accomplish anything like human cognition.

The reality is that less than 1 gigabyte of data, over 30,000 times less than the figures typically quoted for “internet-scale” datasets, is all that was required for a fundamentally different architecture to reach both human-like and human-level performance as early as 2019. This conservatively illustrates how over 5 orders of magnitude less data than companies fail with today is required to succeed when using the right architecture.

To put that into perspective, about 300,000 pages of peer-reviewed papers fit into ~1.1 gigabytes of raw text, and the NCBI medical database has millions of peer-reviewed papers, each many pages in length. A single publicly available and domain-specific body of high-quality data is more than sufficient, given the right architecture.

This means that there is virtually never a reason to use “Synthetic Data” outside of that tiny handful of viable niches. The rest of the time "Synthetic" is just a buzzword meaning "Fake" or "Inaccurate".

Again, the difference is astronomical, in addition to cleanly dividing success from failure.

Synthetic Data