155 - Benchmarks

The value of any given Benchmark in AI is comparable to the value of a new car. The moment you publish it, like driving it off of the lot, the value drops massively.

Before a benchmark is published then you won't find any third parties attempting to adversarially optimize against that metric. The moment it is published that changes, and the benchmark's value will continue to drop the longer it has been published. To continue the car metaphor, adversarial attempts to game any benchmark are like rapidly piling on mileage, and the more popular the benchmark is, the more rapidly that mileage adds up.

The vast majority of benchmarks in AI are also designed explicitly to compare narrow AI systems on an extremely narrow task-specific basis. This is understandable, both since it could rapidly become intractable to attempt scoring without that narrow specificity, and because benchmarks can't compare between non-narrow AI systems with a sample size of 1.

Consequently, the most popular benchmarks in AI being regurgitated across social media today aren't just worthless, they are hazardous, the rough equivalent of a rusty old Ford Pinto. They may have served a purpose once, but people are just playing Weekend at Bernie's with them today.

One of my favorite examples of the typical less-than-worthlessness of popular benchmarks was when a joke project from Yannic Kilcher named "GPT-4Chan" (predating GPT-4), an LM trained on several years worth of 4Chan data, scored SOTA on TruthfulQA.

Human stupidity is also measured in orders of magnitude for impact. Google could have invested $80m in viable AI technology, but instead chose to focus on benchmarks, causing their stock values to take another $80bn hit, because they were stupid enough to invest in the AI industry's most obvious frauds, and cultivated their unhinged belief in "guardrails". They quite literally lost 1,000 times what it would have cost them to invest in viable tech, and the viable tech could have produced a gain in stock values, making the effective loss far greater than even that. This also isn't the first time, as they made the same mistake last year, losing over $100bn then.

Some may be tempted to say that Microsoft and Nvidia are winning the race by virtue of their current valuations, but both companies are so deep in fraud that once the other shoe drops they may lose 80% of their value or more, which is a blow capable of bankrupting many companies. Such illusions are transient, not permanent, so the questions are when those two will lose value, and if they'll survive the loss.

As investment vehicles, benchmark chasers remain comparable to the Ford Pinto, regardless of who is currently winning the imaginary race.