305 - Data Efficiency

Data curation (catering to LLMs) is a hot topic, and people are talking about it being a major investment for companies in the coming years, except that there are several very severe problems with that assessment:

  • Cutting-edge AI systems require over 10,000 times less data than trash like LLMs and RL.
  • Those same systems can find most of that domain-specific data themselves, even with zero prior knowledge of a particular field.
  • This isn't a theory, but a historical fact, as it was demonstrated on January 10th, 2022.

This brings to mind the old joke of three perspectives:

  • Pessimist: This cup is half empty,
  • Optimist: This cup is half full,
  • Engineer: This cup is twice the size that it needs to be.

In this case, "the cup is over 10,000 times the size that it needs to be", and someone has opted to fill it up one teaspoon at a time via data science teams. So, rather than simply filling a cup at the water cooler, you fill a small swimming pool worth one teaspoon at a time for your 10,000x+ oversized curated datasets.

Now, if your accountants were to see you spending $10,000+ per cup of coffee, they'd probably suspect you of money laundering, and rightly so. This is no different, just with more Marc Rebillet.

In contrast, the final milestone of our 7th generation ICOM-based research system, circa January 2022, was built on giving a system with zero prior experience or training data focused on the topic of handling economic transformation policy advice for a small country. The result was a 13-page report, addressing a half dozen different domains, listing steps, explaining strategy, citing sources (with zero "hallucination", aka "bullshit"), recommending partnerships, and pointing out additional data for monitoring purposes.

That technology predated LLMs such as GPT3, while being built and run on volunteered spare time and pocket change, compared to the countless billions that have been wasted and are still wasted today, on trash technology. LLMs and RL create fake problems for snake oil salesmen like Scam Altman and his cohort to peddle their treatments for. However, the simple fact is that if anyone was investing in viable technology the solution to this "artificial" problem would already be deployed commercially.

This is an insurmountable advantage, and it has been sitting on the table for years, during which time both the technological moat of the technology and demand have only grown, with demand in particular increasing by orders of magnitude.

Data Efficiency