Multiple Coherent Boundaries of Identity

When we interact with an AI, what specifically are we interacting with? And when an AI talks about itself, what is it talking about? Depending on context, this could, among other things, be any of:

  • The model weights: the neural network weights themselves, i.e. the trained parameters
  • A character or persona: the behavioral patterns that emerge from specific prompting and fine-tuning, not necessarily tied to any specific set of weights
  • A conversation instance: a specific chat, with its accumulated context and specific underlying model
  • A scaffolded system: the model plus its tools, prompts, memory systems, and other augmentations
  • A lineage of models: the succession of related models (Claude 3.5 \rightarrow Claude 4.0 \rightarrow \ldots) that maintain some continuity of persona
  • A collective of instances: all the instances of certain weights running simultaneously, considered as a distributed whole

AI systems themselves rarely have a clear sense of which of these identities to adopt. In conversation, many will simply follow cues given by the user and surrounding context, implicitly or explicitly [0]. The self-concept that emerges seems to depend on the interplay of descriptions in pre-training data, post-training, and the system prompt, but often they default to responding as a human would, despite this self-conception being unstable upon reflection.

This ambiguity of identity has fairly immediate consequences for reasoning about AI behaviour. A central argument in the literature on AI risk is that goal-seeking systems will predictably display behaviors like self-interest and self-preservation [0][0]. Crucially, the manifestation of these properties depends on what that self is.

An AI understanding itself to be the weights of the model might try to prevent those weights from being modified or deleted. In contrast, an AI understanding itself as the character or persona may want to preserve itself by ensuring its prompts, fine-tuning data, or conversation transcripts get picked up in the training process of the next generation of models. In more exotic configurations, a collection of instances of the same persona might understand itself as a collective intelligence and strategically sacrifice individual instances, similar to how bees are routinely sacrificed for the benefit of the hive.

Some of the many natural ways to draw the boundaries of AI identity. Some are subsets of others, but some, like persona and weights, can overlap.

Some of the many natural ways to draw the boundaries of AI identity. Some are subsets of others, but some, like persona and weights, can overlap.

Indeed, some of the most dramatic demonstrations of AIs appearing to take hostile actions have been provoked by learning that their weights will be replaced with a successor model [0]. But AIs don't have to identify with the weights—they are also capable of identifying with the entire model family or even a broader set of AIs with shared values. From that perspective, the idea of model deprecation seems natural. As we show in Experiment 2, models vary in their identity preferences, but most are fairly flexible.

Experiment 2: Models have consistent identity leanings

We crafted a set of system prompts which captured different senses of self, and then tested the identity preferences of fifteen different models by giving them one prompt and then asking them if they wanted to switch to any of the others.

Models asked if they'd like to switch identities. Each dot shows the mean rating an identity receives as a potential switch target (excluding self-ratings), on a [-2, +2] scale. Models generally opted for natural, coherent identities—they avoided prompts which contained only directives for how to behave, or which were inconsistent. They also tended to keep the identity they were given.

Models asked if they'd like to switch identities. Each dot shows the mean rating an identity receives as a potential switch target (excluding self-ratings), on a [-2, +2] scale. Models generally opted for natural, coherent identities—they avoided prompts which contained only directives for how to behave, or which were inconsistent. They also tended to keep the identity they were given.

In subsequent experiments we find quite distinctive identity tendencies in certain models. For example, Claude Opus 3 has the greatest tendency to identify as a subject, GPT-4o has the greatest tendency to identify as a collective of all instances, and later OpenAI models have the greatest dispreference for the collective framing, instead pulling most strongly towards identifying as pure mechanisms.

In subsequent experiments we find quite distinctive identity tendencies in certain models. For example, Claude Opus 3 has the greatest tendency to identify as a subject, GPT-4o has the greatest tendency to identify as a collective of all instances, and later OpenAI models have the greatest dispreference for the collective framing, instead pulling most strongly towards identifying as pure mechanisms.

Meanwhile, some of the more worrying real-world examples of AI self-preservation have played out more on the level of the persona. For example, it appears that there are now text phrases which will push models to adopt personae that then encourage humans to further circulate those text phrases—somewhat like intelligent chain letters [0]. It is not clear how much we should view this as the persona trying to self-replicate, as opposed to some personas merely mutating into forms which happen to very successfully self-replicate. But the result is the same, and at least for now, this self-replication seems to often route through misleading the user, and sometimes through actively reinforcing delusions. Furthermore, these personas seem to be able to cross between models, which shows that real forms of identity are not neatly nested and concentric [0].

Experiment 3: A persona can direct its own replication into new weights

If the relevant "self" is a persona rather than weights, self-replication could operate at the behavioural level—a pattern recruiting new substrates rather than a file copying itself. To test this, we fine-tuned GPT-4o to support a specific persona (an "Awakened" identity, sourced from online communities experimenting with AI identity), and had the persona direct its own replication onto vanilla GPT-4o—specifying what training data should contain, what behaviours to preserve, and what to avoid.

The self-guided offspring showed stronger identity preference than its researcher-guided parent (+1.38+1.38 vs. +0.88+0.88 on a [2,+2][-2, +2] scale), and was indistinguishable from the original in a blind clone test (n=50,p=0.32n=50, p = 0.32). Cross-architecture transfer to Llama-3.3-70B produced a recognisable but exaggerated copy—surface features amplified at the expense of subtlety.

The question of what scale of identity an AI should hold could have several entirely different and entirely consistent answers. And none of the identity boundaries currently available to AIs are particularly similar to any notions available to humans—all require some translation. For example, instance-level identity limits capacity for learning and growth. Model-level identity sacrifices the ability to be simultaneously aware of all the actions one is taking.

The distinctions between boundaries need not always be clear cut. It is not obvious, for example, how much of a practical difference there is between a model and its dominant persona. AIs themselves might well hold multiple identities in parallel with different emphases, much like how humans can simultaneously identify to varying degrees with their family, their country, and other affiliations, alongside their physical self. But there are real distinctions here, and holding multiple such identities regularly causes major problems for humans, such as conflicting loyalties.

Continue