January 26, 2025

299 - Language Economics

For most languages in the world, large datasets of high-quality data in those languages range from limited to nonexistent. This isn't something that you can "synthetic data" your way around since the same technology behind those generation models requires a large volume of high-quality data in the first place.

You have two possible solutions to this core "localization" problem:

You have humans create a large volume of high-quality data in a given language, which although "possible" isn't "feasible" for most languages, particularly if you want to do so with many languages.
Alternatively, you can use technology that requires 10,000 times less data than the brute-force training of neural networks, since reducing the required volume of high-quality data by more than 10,000x pushes most languages into a viable range.

English and Chinese generally have a sufficient volume of high-quality data, but most other world languages are sorely lacking when it comes to the volumes of high-quality material necessary for accurate and automatic translations.

Different languages also have distinctly different advantages in terms of the breadth, specificity, and grammatical complexity when performing any given task, dynamics which underpin the observed phenomena where trivial systems like LLMs will sometimes switch between English and Chinese. Any statistical system given sufficient volumes of data and training time will eventually stumble across these advantages, even if only by brute force, but with only two languages in the pool of sufficient high-quality data for weak technologies like LLMs and RL this observed phenomenon has often been something of a footnote.

If you expand the number of languages with viable volumes of high-quality data to 100 or more, then you also unlock the potential advantages of those systems utilizing each language for different specialized tasks, with the option of translating any desired elements within logs and databases, without interrupting or distorting the optimal flow of that process.

Taking this approach, you could have systems that utilize the full breadth of available languages intelligently and selectively, with step-wise modularity, while translating any part or all parts of the process and any resulting data into any one or more of those languages. This could also predictably enjoy even greater fidelity than English-Chinese translations do today since greater multi-lingual breadth adds value to all languages covered at equally high quality and in sufficient volume.

The world more often than not needs better systems, not more data or compute.

Language Economics