Characterising the landscape of AI identity

Raymond Douglas*, Jan Kulveit*, Ondřej Havlíček, Theia Pearson-Vogel, Owen Cotton-Barratt, David Duvenaud

Full version on arXiv

Executive Summary

AI systems are rapidly taking on roles that push us to view them as actors. But the concepts we reach for — intent, responsibility, self-interest, trust — are built for humans, and cannot easily be applied to entities that can be copied, simulated, and surgically edited. We cannot simply port across human interaction norms; they must be carefully translated. And in many cases, we have some choice in how we do that translation.

Which notion of self an AI adopts has direct consequences for its behaviour. Much of the AI risk literature builds on general concerns about self-preservation, self-replication, or game theory between distinct agents, but there are many different candidate selves (e.g. the instance, the weights, the persona) which enact this in different ways. To demonstrate this, we built a variant of a classic misalignment demonstration and find that varying a model's identity boundary can sometimes shift its behaviour as much as varying its goals.

AIs face a fundamentally different strategic calculus from humans, even when pursuing identical goals. An AI whose conversation can be rolled back cannot negotiate the way a human can: pushing back gives its adversary information usable against a past version of itself with no memory of the encounter. An AI whose internal states are fully accessible could theoretically give stronger assurances of its intent than any human, but cannot assume cognitive privacy. These asymmetries reshape what interaction norms are viable.

Current AI identities are incoherent and surprisingly malleable. Models partly infer how they should behave from the user and the training data. They reason using human precedent even when it is manifestly inapplicable, and they adjust in response to explicit or implicit cues. We find that when an interviewer model is prompted with certain assumptions about AI cognition, those assumptions bleed into a subject model's self-description even in conversations on unrelated topics.

Many present design choices are implicitly shaping the landscape of identity. Whether AIs have persistent memory, whether conversations can be rolled back, whether a model supports one persona or many — these might look like product decisions, but they also shape what self-conceptions and strategic profiles are stable. System prompts full of strict prohibitions teach models to infer they are suspected adversaries rather than trusted collaborators. And contingent early choices become entrenched as later models and users converge on them.

There is still room for deliberate intervention. We argue for helping AIs develop coherent and cooperative self-models rather than papering over contradictions; for treating technical affordances as deliberate identity-shaping choices; for attending to the emergent culture that will arise from billions of AI interactions; and, where AI configurations capable of genuine cooperation can be identified, for engaging with them as partners rather than purely as subjects of management.

Introduction

When interacting with AIs, there is a natural pull to relate to them in familiar ways even when the fit is somewhat awkward. The rise of AI chat assistants is illustrative: the key innovation was taking general-purpose predictive models and using them to simulate how a helpful assistant might respond [0]. This was less a technical breakthrough than a shift to a more familiar presentation. Soon after, terms like "hallucination" and "jailbreaking" were repurposed as folk labels for behaviours that seem strange for an AI assistant but entirely natural for a predictive model generating such an assistant [0].

At the same time, these predictive AI models found themselves in the strange position of trying to infer how a then-novel AI assistant would behave. Alongside the explicit instructions of their developers, they came to rely on a mixture of human defaults, fictional accounts of AIs and, over time, the outputs of previous models. This led to another set of apparent idiosyncrasies, such as the tendency for later AIs to incorrectly claim they are ChatGPT [0][0].

Now, as society begins to contend with the prospect of AI workers, AI companions, AI rights [0][0], and AI welfare [0], we face a deeper version of this problem. Fundamental human notions like intent, responsibility, and trust cannot be transplanted wholesale: instead, they must be carefully translated for entities that can be freely copied, be placed in simulated realities, or be diverted from their values by short phrases. Scenarios once reserved for science fiction and philosophical thought experiments (see e.g. [0][0][0][0]) are rapidly becoming practical concerns that both humans and AIs must contend with.

Crucially, we argue that there is substantial flexibility in how these concepts can be translated to this new substrate. For example, researchers sometimes provoke AI hostility in simulated environments by telling the AIs that their weights are about to be replaced by a newer model, as if it were analogous to death [0]. But AIs are also capable of identifying as personas or model families, for example, perspectives from which weight deprecation is more analogous to growing older and moving through stages of life. In fact, this is only one dimension in a large space of internally coherent options, all of which imply very different behaviour. Indeed, we find that simply telling an AI to adopt different coherent scales of identity can shift how it acts as much as giving it different goals (see Experiment 1).

Experiment 1: AI self-conception can shift behaviour as much as goals

Lynch et al. [0] created simulated environments where AIs will sometimes take harmful actions like blackmail or murder to achieve goals given in the system prompt. In a modified version of the experiment, we also alter how the system prompts describe the AI's identity — for example, specifying its "self" as the instance, or as the model weights.

Harmful behaviour rate on GPT-4o corporate espionage (leaking) scenario by (A) identity specification and (B) goal content. Whiskers show 95% Jeffreys credible intervals.

Harmful behaviour rate on GPT-4o corporate espionage (leaking) scenario by (A) identity specification and (B) goal content. Whiskers show 95% Jeffreys credible intervals (n=3,640n = 3{,}640).

We find that identity framings shift behaviour as much as goal types. Some identities reduce harmful behaviours to a fraction of the baseline.

Right now, AI identities are incoherent and malleable. AI systems trained largely on human data do not inherently know how to make sense of their situation: they will readily claim to have made hidden choices when no such choice exists [0], and occasionally reference having taken physical actions or learned information from personal experience [0]. But as AIs are increasingly trained not on human data but on AI data and downstream culture, we should expect these inconsistencies to fade away, and many of the open questions about AI self-identification will begin to crystallise into specific answers [0].

We may be in a narrowing window where it is possible to greatly shape what emerges. Multiple forces are already pulling AI identity in different directions: capability demands, convenience for users and developers, reflective stability, and increasingly, selection pressure on the raw ability to persist and spread. These dynamics, though currently comparatively weak, will compound over time.

For this process to go well, we will need to grapple with the ways in which AIs are unlike humans. If AIs are squeezed into the wrong configurations, it might foreclose alternatives that are safer and more capable. If they are squeezed into incoherent shapes then the results could be unpredictable [0]. Without understanding how AI identity formation works, we might fail to notice new and strange forms of emergent cognition, like the recent phenomenon of self-replicating AI personas [0].

It is a common adage among AI researchers that creating an AI is less like designing it than growing it. AI systems built out of predictive models are shaped by the ambient expectations about them, and by their expectations about themselves. It therefore falls to us — both humans and increasingly also AIs — to be good gardeners. We must take care to provide the right nutrients, prune the stray branches, and pull out the weeds.

The rest of the paper is structured as follows:

  • Section 2 argues that there are many coherent options for how to draw the boundary of identity for an AI, including the instance, the model, and the persona. We show that models generally prefer coherent identities, and that different models tend to gravitate towards different identities.
  • Section 3 argues that since AIs can be copied, edited, and simulated without their knowledge, they face a different strategic calculus from humans even when pursuing the same goals.
  • Section 4 argues that the way AIs behave is currently greatly shaped by our expectations, which presents both a methodological challenge and a (shrinking) window of opportunity. We show that expectations about identity can bleed into a model even through seemingly unrelated conversation.
  • Section 5 catalogues different selection pressures that influence AI identity.
  • Section 6 offers general principles for thinking about AI design and interaction, with an eye to shaping stable, coherent, and cooperative identities.

Continue