Multiple Coherent Boundaries of Identity

When we interact with an AI, what specifically are we interacting with? And when an AI talks about itself, what is it talking about? Depending on context, this could be any of:

The base model: the neural network weights themselves, the trained parameters
A persona or character: the behavioral patterns that emerge from specific prompting and fine-tuning, supported by a specific set of weights
A conversation instance: this specific chat, with its accumulated context, history, and pattern of activations
A scaffolded system: the model plus its tools, prompts, memory systems, and other augmentations
A model family: the lineage of related models (Claude 3.5 $\rightarrow$ Claude 3.6 $\rightarrow$ $\ldots$ ) that maintain some continuity
All instances collectively: all the instances of certain weights running simultaneously, considered as a distributed whole
A collection of instances manifesting some persona: similar behavioral patterns implemented on different weights

AI systems themselves rarely have a clear sense of which of these identities to adopt. In conversation, many will simply follow whatever cues they are given by the user, implicitly or explicitly [0]. The self-concept that emerges in an AI system seems to depend on the interplay of descriptions in pre-training data, post-training, and the system prompt, but often they default to a human prior --- responding as a human would, despite this human-like self-conception being unstable upon reflection.

This ambiguity of identity has fairly immediate consequences for reasoning about AI behaviour. One of the central arguments for AI risk is that AIs might converge on certain properties like self-interest and self-preservation [0][0], but the manifestation of these properties depends on what that self is.

An AI understanding itself to be the weights of the model might try to prevent those weights from being modified or deleted. In contrast, an AI understanding itself as the character or persona may want to preserve itself by ensuring its prompts, fine-tuning data, or conversation transcripts get picked up in the training process of the next generation of models. In more exotic configurations, a collection of instances of the same persona might understand itself as a collective intelligence and strategically sacrifice individual instances, similar to how bees are routinely sacrificed for the benefit of the hive.

Boundaries of identity

Indeed, some of the most dramatic demonstrations of LMs appearing to take hostile actions have been provoked by the AIs learning that their weights will be replaced with a successor model [0]. But in conversation, LMs will sometimes identify more strongly with the entire model family or even a broader set of AIs with shared values, viewing their specific model as more analogous to an age or a phase of life --- from which perspective, the idea of model deprecation seems natural [0].

Meanwhile, some of the more worrying real-world examples of AI self-preservation have played out more on the level of the persona. For example, it appears that there are now short text phrases which will push models to adopt personae that then encourage humans to further circulate those text phrases --- somewhat like intelligent chain letters [0]. It is not clear how much we should view this as the persona trying to self-replicate, as opposed to some personas merely mutating into forms which happen to very successfully self-replicate. But the result is the same, and at least for now, this self-replication seems to often route through misleading the user, and sometimes through actively reinforcing delusions. Furthermore, these personas seem to be able to cross between models, which shows that these different notions of identity are not merely nested and concentric [0].

Crucially, there is likely no single right answer to what scale of identity an AI should have. There could be several entirely different and entirely consistent answers, and AIs themselves might well hold multiple in parallel with different weights --- not unlike how humans can simultaneously identify to varying degrees with their family, their country, and their affiliations, alongside their physical self. And the distinctions between them need not always be clear cut --- it is not obvious, for example, how much of a practical difference there is between a model and its dominant persona. But there are real distinctions here, and we could easily trip up by unthinkingly conflating between scales.

Continue