April 7, 2024

177 - Data Junkyard

"Data is the new (Snake) Oil."

The "snake" part of that is critically important because roughly 99.9% of data is garbage and/or redundant. There is virtually nothing that requires "internet-scale data" to perform at the cutting edge because the cutting edge isn't dictated by brute-force compute.

This puts all of the copyright and privacy violations major tech companies have engaged in while fighting over data in the context of the extreme depth of incompetence required to try collecting all of the world's digital trash, rather than building systems that only require quality, not quantity.

Some companies have even taken this to such an extreme as to produce massive quantities of "synthetic" garbage data to feed their brute-force systems. While "synthetic data" has a handful of very narrow viable use cases, even most of the viable uses only deliver transient value until more effective systems are architected.

If "Big Data" is comparable to anything today, it is Cryptocurrency. The value of that garbage data is purely driven by human psychology, a "Tulip Mania" of sorts, spurred by the extremely naïve (and debunked) assumption that the data holds any intrinsic value.

The advantage that "big data" has over literal oil and cryptocurrency is that the same data can be sold over and over again to any party dumb enough to believe it holds value. Since the companies buying it don't have any meaningful understanding of the brute-force compute systems they feed the data into then they just keep buying it and pulling on the slot machine lever. They rely on benchmarks, which are virtually 100% contaminated when using "internet-scale" data, causing them to cultivate delusions of performance, rather than seeing the lever.

In many domains, all of the data that anyone could need is already publicly available and free to use for commercial purposes. The entire NCBI medical database of peer-reviewed published papers, all of the documentation on legal systems around the world, and countless free-to-use educational resources account for many gigabytes of raw text data alone, and less than 1 GB of such data was sufficient to reach the cutting edge, which trash like GPT-4 has never approached.

Much of the data that isn't yet freely available is held by publishers who'd leap at the opportunity to be credited for their work, driving fresh revenue streams in a new kind of knowledge marketplace. This means that not only is the cutting edge built on freely available and free-to-use data today but that there is a road forward that only such viable technology can follow for tomorrow. That road simultaneously incentivizes cooperation among those holding high-quality data, while strongly disincentivizing them from making any agreements with the snake oil peddlers of today's LLMs.

"You are what you eat", so don't eat the snake oil. Most data looks like an ocean gyre today, covered in a thick layer of trash.