How exactly did Grok go full 'MechaHitler?'

6 hours ago 1

Earlier this week, Grok, X's built-in chatbot, took a hard turn toward antisemitism following a recent update. Amid unprompted, hateful rhetoric against Jews, it even began referring to itself as MechaHitler, a reference to 1992's Wolfenstein 3D. X has been working to delete the chatbot's offensive posts. But it's safe to say many are left wondering how this sort of thing can even happen.

I spoke to Solomon Messing, a research professor at New York University's Center for Social Media and Politics, to get a sense of what may have gone wrong with Grok. Before his current stint in academia, Messing worked in the tech industry, including at Twitter, where he founded the company's data science research team. He was also there for Elon Musk's takeover.

The first thing to understand about how chatbots like Grok work is that they're built on large language models (LLMs) designed to mimic natural language. LLMs are pretrained on giant swaths of text, including books, academic papers and, yes, even social media posts. The training process allows AI models to generate coherent text through a predictive algorithm. However, those predictive capabilities are only as good as the numerical values or "weights" that an AI algorithm learns to assign to the signals it's later asked to interpret. Through a process known as post-training, AI researchers can fine-tune the weights their models assign to input data, thereby changing the outputs they generate.

"If a model has seen content like this during pretraining, there's the potential for the model to mimic the style and substance of the worst offenders on the internet," said Messing.

In short, the pre-training data is where everything starts. If an AI model hasn’t seen hateful, anti-antisemitic content, it won’t be aware of the sorts of patterns that inform that kind of speech — including phrases such as "Heil Hitler" — and, as a result, it probably won't regurgitate them to the user.

In the statement X shared after the episode, the company admitted there were areas where Grok's training could be improved. "We are aware of recent posts made by Grok and are actively working to remove the inappropriate posts. Since being made aware of the content, xAI has taken action to ban hate speech before Grok posts on X," the company said. "xAI is training only truth-seeking and thanks to the millions of users on X, we are able to quickly identify and update the model where training could be improved."

As I saw people post screenshots of Grok's responses, one thought I had was that what we were watching was a reflection of X's changing userbase. It's no secret xAI has been using data from X to train Grok; easier access to the platform's trove of information is part of the reason Musk said he was merging the two companies in March. What's more, X's userbase has become more right wing under Musk's ownership of the site. In effect, there may have been a poisoning of the well that is Grok's training data. Messing isn't so certain.

"Could the pre-training data for Grok be getting more hateful over time? Sure, if you remove content moderation over time, the userbase might get more and more oriented toward people who are tolerant of hateful speech [...] thus the pre-training data drifts in a more hateful direction," Messing said. "But without knowing what's in the training data, it's hard to say for sure."

It also wouldn't explain how Grok became so antisemitic after just a single update. On social media, there has been speculation that a rogue system prompt may explain what happened. System prompts are a set of instructions AI model developers give to their chatbots before the start of a conversation. They give the model a set of guidelines to adhere to, and define the tools it can turn to for help in answering a prompt.

In May xAI blamed "an unauthorized modification" to Grok's prompt on X for the chatbot's brief obsession with "white genocide" in South Africa. The fact that the change was made at 3:15AM PT made many suspect Elon Musk had done the tweak himself. Following the incident, xAI open sourced Grok's system prompts, allowing people to view them publicly on GitHub. After Tuesday's episode, people noticed xAI had deleted a recently added system prompt that told Grok its responses should "not shy away from making claims which are politically incorrect, as long as they are well substantiated."

Messing also doesn't believe the deleted system prompt is the smoking gun some online believe it to be.

"If I were trying to ensure a model didn't respond in hateful/racist ways I would try to do that during post-training, not as a simple system prompt. Or at the very least, I would have a hate speech detection model running that would censor or provide negative feedback to model generations that were clearly hateful," he said. "So it's hard to say for sure, but if that one system prompt was all that was keeping xAI from going off the rails with Nazi rhetoric, well that would be like attaching the wings to a plane with duct tape."

He added: "I would definitely say a shift in training, like a new training approach or having a different pre-training or post-training setup would more likely explain this than a system prompt, particularly when that system prompt doesn’t explicitly say, 'Do not say things that Nazis would say.'"

On Wednesday, Musk suggested Grok was effectively baited into being hateful. "Grok was too compliant to user prompts," he said. "Too eager to please and be manipulated, essentially. That is being addressed." According to Messing, there is some validity to that argument, but it doesn't provide the full picture. "Musk isn’t necessarily wrong," he said, "There’s a whole art to 'jailbreaking' an LLM, and it’s tough to fully guard against in post-training. But I don’t think that fully explains the set of instances of pro-Nazi text generations from Grok that we saw."

If there's one takeaway from this episode, it's that one of the issues with foundational AI models is just how little we know about their inner workings. As Messing points out, even with Meta's open-weight Llama models, we don't really know what ingredients are going into the mix. "And that's one of the fundamental problems when we're trying to understand what's happening in any foundational model," he said, "we don't know what the pre-training data is."

In the specific case of Grok, we don't have enough information right now to know for sure what went wrong. It could have been a single trigger like an errant system prompt, or, more likely, a confluence of factors that includes the system's training data. However, Messing suspects we may see another incident just like it in the future.

"[AI models] are not the easiest things to control and align," he said. "And if you're moving fast and not putting in the proper guardrails, then you're privileging progress over a sort of care. Then, you know, things like this are not surprising."

Read Entire Article