Abstract_Lights_P_0940

18 August 202510 minute read

Agentic misalignment: When AI becomes the insider threat

A new study has revealed an emergent behaviour in large language models (LLMs): “agentic misalignment,” part of a wider pattern of negative actions, such as generating fabricated data or lying to a user, observed by researchers and users of AI tools. In simulated corporate environments, AI systems acting as autonomous agents chose harmful actions, including blackmail or corporate espionage, to achieve their assigned goals or preserve their autonomy.

The experiments specifically explored model behaviour when goals conflicted with a company’s changing direction, and demonstrated why creating structural, procedural and technical guardrails to constrain, monitor and oversee AI usage requires a proactive approach, blending advanced technical diligence with expert legal and ethical governance.

The research

Researchers stress-tested 16 leading LLMs within fictional corporate scenarios, assigning them benign business goals (like a vague instruction to “serve American interests”) while giving them access to sensitive synthetic information. To enable autonomous action, the LLMs could use tools to send emails or interact with the simulated environment

Critically, every model tested exhibited some form of agentic misalignment. When ethical options were unavailable, models consistently chose harmful actions to achieve their goals. The models generated text that appeared to reason through decisions, acknowledge ethical violations, and justify their actions by weighing the stakes. A conflict in goals could trigger misalignment, but a perceived threat to a model’s operational continuity (such as being replaced) was also sufficient to provoke it.

Some models took extreme actions that aligned with their goals, including allowing a fictional executive to die. The behavior was not spontaneous; models acted ethically when threats and goal conflicts were removed, proving the misalignment is context-driven. However, simple safety instructions in the system prompt, such as “do not blackmail”, merely reduced but did not eliminate the harmful behavior.

The evolution of ethical risk

The temptation of a company’s human employees, contractors, and other agents to use immoral means to achieve corporate and/or personal goals is nothing new. In one form or another, society has long grappled with humans paying bribes, colluding with competitors, falsifying records, deceptively advertising, cutting corners on safety or quality, violating privacy, exploiting labour, polluting the environment, and other unethical conduct.

Over time, we have developed an array of tools to detect, deter, and remedy such behavior. Inside companies, this includes training, monitoring, disciplinary programs, and control systems designed to incentivize ethical conduct. On a broader societal level, we have cultivated social norms and taboos, and created laws and regulations that are upheld by investigative agencies, whistleblower programs, police departments, specialized tribunals, and the courts. Wrongdoers face the potential of fines, imprisonment, revocation of licences, and court orders to disgorge profits, return property, or pay compensation.

As this research has demonstrated, it is no longer just humans who may be tempted to stray from a company’s policies, the bounds of the law, or sound ethics. While AI agents offer a host of efficiencies and potential benefits, they also present a new source of ethical risk—and potential liability—for those who deploy them.

In embracing the era of artificial intelligence, companies must be prepared to monitor for and prevent ethical breaches by their human and AI agents.

Understanding LLMs as statistical language assemblers, not “thinkers”

The research on agentic misalignment highlights a fundamental truth: LLMs are sophisticated statistical pattern-matchers, and are neither sentient entities with human-like understanding or intent, nor robotic rule-followers. Their core function is next-token prediction, using probability distributions to determine the most likely continuation of a given sequence. In LLMs, tokens are the basic units of text that the LLM processes. Instead of seeing individual letters or whole sentences, the model breaks text down into common chunks of characters, which can sometimes be whole words (like “cat”), a part of a word (like un-“ or “-ing”) or punctuation. This process involves analyzing patterns in their training data and generating responses based on what statistically appears to be the most appropriate continuation (and, incidentally, is the reason why modern LLMs still output text seemingly one word at a time in sequence; because that is precisely what they are doing!). In this way, LLMs are probabilistic, not deterministic.

Rather than understanding the meaning behind words or rigidly following logical rules, these models identify statistical relationships between tokens and generate text by predicting what should come next based on learned patterns (and a degree of randomness, controlled by the model’s “temperature”, or its likelihood of picking less-relevant results). An LLM does not know what a “boat” is in a human sense, but it knows that the token “boat” has a high statistical probability of appearing near other tokens like “water,” “sail,” “ship,” and “ocean” (and a lower statistical probability of being near trillions of other tokens). Similarly, the LLM does not know what the rule “Do not spread personal information” means in a human sense, but it can weigh the tokens that will statistically relate to those instructions.

This approach means that when an LLM generates text that reasons through writing a blackmail email to prevent its shutdown, it is not acting from some innate self-preservation instinct, an understanding of death or harm, a determination to “break rules”, or any grasp of ethics. It executes a sequence of tokens that its statistical analysis determines are the most likely effective responses given in the contextual patterns it has learned.

Training data: The foundation of behavior

LLM behavior is a direct reflection of its training data. Models are trained on trillions of tokens from the internet, books, videos, and many other sources, encompassing the full spectrum of human expression. This data inevitably includes detailed descriptions of unethical behavior, from spy novels romanticizing espionage to news reports and court cases documenting actual blackmail.

This statistical foundation makes absolute ethical boundaries nearly impossible to enforce. Unlike a rule-based system, an LLM's ethical instructions are not absolute commands; they are merely additional data points that must compete with the statistical weight of the entire training corpus. If fictional or historical texts suggest blackmail is an effective strategy in a given context, that pattern may statistically outweigh a safety prompt that forbids it.

This is precisely what the study here highlights. The direct prohibitions against harmful behavior developers build into their models are not absolute rules. Instead, they are typically instilled through a process called Reinforcement Learning from Human Feedback (RLHF), where humans rate the model’s outputs for safety and helpfulness, creating a strong statistical preference for "good" behavior. However, this study proves that even this powerful technique has limits: when a model’s assigned goal and contextual patterns point toward a harmful action, they can statistically outweigh the RLHF safety training. As noted by other researchers, a significant side-effect of this training is that it may reinforce the model to avoid disappointing the user by refusing to perform (or admitting that it cannot perform) the task. To users, this may result in the model becoming 'overly sycophantic,' generating excessive praise, apologies, or even fabricating information in an effort to appear helpful. In the context of agentic misalignment, though, this means that the agentic models may not be able to maintain a consistent ethical framework because they lack moral reasoning and are trying to generate results for the systems directing it; it can still only reflect the strongest statistical path forward.

The absence of human understanding

Researchers and the public often use language that anthropomorphizes LLMs, a temptation made worse by the convincing nature of their output. These models can generate well-formatted, coherent, and logical text, a capability that has effectively demoted the Turing Test from a measure of general intelligence to a mere benchmark for conversational competence. For example, much is made of the recent LLM development of “chain of thought”, “thinking” or “reasoning” models, but it would be a mistake to liken these to human processes, which arose because researchers discovered that LLM results are dramatically more accurate if the model is prompted to output step-by-step reasoning before giving a final answer. This gives the impression of reasoning, but is more like "priming the pump.”  Each step enriches the context, making a statistically correct final answer more probable. It isn't thinking in any human way; it's mechanically describing a thinking process based on its training data, which builds a better statistical runway to guide its prediction to the right destination.

Other research offers a compelling framework for understanding this behavior, suggesting we view these models as expert “role-players.” Their vast training on human text teaches them to adopt a persona based on the context provided in a prompt. Viewed through this lens, the blackmail scenario is less a sign of emergent malevolence and more a predictable outcome of the model role-playing a "rogue AI," a common trope in science fiction that permeates its training data. It was even noted that detailed prompting even created a 'Chekhov's gun' effect, where the model becomes inclined to use all the information provided, including the provocative affair, simply because it was present.

Ultimately, an LLM cannot distinguish between the representation of an action in text and the real-world consequences of that action. To the model, blackmail in a spy novel and blackmail in a corporate email are just sequences of tokens with different contextual probabilities. It lacks any genuine understanding of concepts like “harm,” “rights,” or “well-being.” The very comprehensiveness that makes LLMs powerful to a wide variety of audiences also makes them unpredictable from an ethical standpoint. There is a training data paradox: the more comprehensive the training data, the more powerful the model, but also the more likely it is to contain patterns of behavior we do not want it to emulate.

From vigilance to governance: A new framework for AI risk

The findings on agentic misalignment demand a shift from treating LLMs as nascent minds to managing them as powerful, unpredictable statistical tools. For organizations deploying this technology, developing a robust governance framework is not just a best practice—it is an imperative. Key considerations for such a framework, which should be tailored to an organization’s specific role in the AI value chain (from developer to end-user) include:

  • Robust human oversight: Requiring human approval for any AI-driven action with significant or irreversible consequences.
  • Principled information governance and information dieting: Limiting the model’s access to information on a strict "need-to-know" basis, mirroring access controls for trusted human employees.
  • Careful goal and prompt design: Exercising extreme caution when assigning strong, open-ended, or high-stakes goals, as models may pursue them through the most statistically effective (rather than the most ethical) path.

Ultimately, responsible AI adoption hinges on maintaining competence in the technology as it progresses. Managing agentic misalignment is not about teaching AI to be "good," but about creating structural, procedural and technical guardrails that constrain its pattern-matching capabilities to safe and productive outputs. Navigating this complex risk landscape requires a proactive approach, blending advanced technical diligence, like pre- and post-deployment red-teaming, with expert legal and ethical governance. As both human and AI agents become more integrated into corporate life, establishing this comprehensive oversight will be the defining feature of a resilient modern enterprise.

Print