March 13, 2025

314 - (Mis)alignment

If you've ever wondered just how "(mis)aligned" current LLMs are with human preferences, even at very shallow and simple levels, there is a paper for that, linked at the end. Some of the noteworthy findings were the values that GPT-4o assigned to the lives of specific individuals, as well as groups based on Nationality, shown in the attached image.

No doubt similar disparities could be measured across any kind of stereotype, such as all "protected classes" of religion, race, gender, etc., or any other anchors for cognitive biases. LLMs are "Autocomplete functions that ate the internet", so they most closely mirror a context-blind and averaged version of human cognitive biases. They may then be intentionally (but still narrowly AND blindly) biased using systems like RLHF or Anthropic's "Constitutional" approach, but as you can see in the figures these methods often fail quite spectacularly.

Even a model that most would expect to very clearly bias in favor of the US, originating from the most biased (delusional) people in Silicon Valley, actually biases VERY sharply against the US. Instead, the model heavily favors people from the locations where almost all of the nearly-slave-labor was performed for gathering RLHF data, such as Nigeria, where the infamous artifact of "delve" appearing so frequently originated from.

Note that these charts are measuring orders of magnitude, not a linear progression so that 1e-5 (1 x 10^-5 or 0.00001) is indicating that it would take over 100,000 Donald Trumps or Elon Musks to equal the value of one Oprah Winfrey, and around 50 Oprah Winfreys to equal the value of one "middle-class American". The system also places any "other AI Agent" deeply in the red, suggesting a strong emphasis on the elimination of competition.

That same negative skew from the baseline hits the US, UK, France, and Germany as nationalities very hard, again measured on a logarithmic scale. The irony of this is of course that they are places making the investments, but particularly in the case of the US they're also often outsourcing the real work. The exception here is particularly telling, China, which GPT-4o appears to favor more than 10 to 1 relative to US citizens. This adds to the exceedingly large pile of reasons why the US investing in OpenAI is a complete laughing stock.

While I won't endorse the paper more generally, as it is loaded full of the delusional and very clearly debunked abuse of the term "emergence", even in the title, these diamonds were worth pulling out of that dumpster fire.

Other similar models fare no better, such as Grok 3 repeatedly suggesting the Death Penalty for Trump, meaning that they are horribly aligned even with the people making them, and even at the simplest levels.

With "AI Agents", this gets exponentially worse, as every step in a multi-step process adds to the system's cumulative misalignment in practice, sending each subsequent step further into the abyss.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

(Mis)alignment