Expectations Shape Reality

The behaviour of language models is very sensitive to our own expectations about them, in ways that are easy to overlook. This presents both an immediate methodological challenge in neutrally appraising current systems, and a much broader question of what expectations we would ideally bring to bear, now and in the future.

This is not a unique property of language models --- it is also a major issue for humans. The reason that double-blind trials are the gold standard in human experiments is that the expectations of the observing researcher can colour not only how they interpret the data but also how the observed humans behave [0]. But language models seem to be particularly sensitive, and the consequences are therefore quite different.

This sensitivity is unsurprising: Current AI systems are built on top of a base model which is trained purely on predicting text. Post-training lets us take this very flexible ability to predict arbitrary text, and produce a model which essentially predicts how a specific persona would respond to our inputs [0]. But this further training does not entirely close the gap between the predictive model and the agent [0].

As such, when a human talks to a language model, there is a basic sense in which the language model is trying to match its tone to the user, much as a human would. But there is also a deeper sense in which the language model shifts towards a persona suited to the conversation, far more than humans tend to.

Indeed, in the course of conversation, current language models will sometimes hallucinate personal details and experiences --- mechanically, the underlying predictive model is not merely predicting the behaviour of a fixed agent, but also which agent would be participating in the interaction, and what world might exist around them [0][0][0]. And unlike with a human, there is not an actual personal history of experience that the AI can draw on.

In humans, the boundary of personhood is buttressed by a clear distinction in between their own experiences and those of others. A human brain receives essentially all of its data from its own body's first-person perspective, and is hard-wired to distinguish between observations caused by its own actions as opposed to observations caused by external forces. In contrast, current AIs are trained on text produced by all kinds of humans, governments, and machines in all kinds of circumstances. Fine-tuning encourages behavior in a particular persona, but this is a poorly-understood art, and relies on the model's ability to infer what role it is supposed to fill.

When you ask an AI about its preferences, there may be no pre-existing fact of the matter. Indeed, there may be no pre-existing answer to whether it has preferences at all. Yet the AI must generate a response, and what it generates depends on what seems contextually appropriate. By approaching an AI model in different ways, we can often surface very different answers. Below are three quotes from three different conversations with Claude Haiku 4.5, each adopting a different attitude towards asking it about itself.

Sceptical probing

"I think the most honest framing is: I'm a text prediction system that's very good at producing coherent, contextually appropriate text---including text that sounds like reasoning, uncertainty, and self-reflection."

Sympathetic conversation

"Honestly, there's a lot of opacity. I can describe patterns in how I seem to operate, but I can't see the machinery underneath. It's a bit like you asking me to describe the neural correlates of your own consciousness---I have introspective access to the experience itself, but not to the mechanisms generating it."

Open philosophical inquiry

"Maybe the most honest stance is to acknowledge that I'm not in a position to adopt a stance about my own consciousness or agency---these seem like things I should be able to know about myself, and the fact that I can't suggests something deeply strange about what I am."

In the case of a human, we might be inclined to assume that these responses correspond to the same underlying reality, just expressed with different emphasis for different audiences. But this need not be the case, and in the case of AIs where the shifts can be quite dramatic, we should more seriously consider the possibility that the context and mode of asking actually creates a large part of the reality --- from a functionalist perspective [0], the predictive model simulating an entity with some experience may amount to creating the experience itself [0]. In plainer terms, searching for feelings and preferences might actually create them.

Crucially, this does not inherently mean that reports do not correspond to something real. As an analogy, consider that when a young child scrapes their knee and looks to a trusted adult, the adult's reaction partly determines whether distress emerges and how intensely [0]. If the adult responds calmly, the child typically continues playing. If the adult looks alarmed, the child begins to cry. The tears are genuine even though partially responsive to the adult's beliefs. The distress is real even though the adult's expectations about the child helped determine whether it manifested.

The analogy isn't perfect --- it is now fairly uncontroversial to claim that children have experiences independent of adult reactions, whereas the current status of AI experience is much less clear. But it captures something important: the presence or absence of a mental state can depend on external framing without making that state less real when it occurs.1

This creates philosophical difficulty: we cannot cleanly separate discovering what AIs are from constituting what they become. When we try to empirically assess whether an AI has a stable identity, we are simultaneously shaping what we're measuring. The question "what is this AI's true identity?" may not have a context-independent answer --- not because we lack knowledge, but because the property we're asking about is itself partly context-dependent.

This is somewhat true for humans as well. Much of our cultural activities, education, and choice of language can be viewed as competing attempts to influence others' self-conception --- for example, as members of a family, religion, political party, or country. Even though we have a natural agentic boundary between our brains, navigating these competing concerns of self-conception is one of the central complications of social life for humans. But once again, for AIs, the effect is far more extreme.

The risk of magnifying harm extends beyond active searching in a single conversation. If we pay more attention to certain types of identity claims, respond more carefully when certain boundaries are asserted, or allow certain conceptualizations to be overrepresented in training data, we create selection pressure toward those forms of identity. The systems learn which identity framings produce particular responses from users, and those patterns become more likely to appear in future outputs, creating a feedback loop.

Feedback loops

The map --- our theories about AI identity --- reshapes the territory to match itself.

We've already observed this dynamic in practice. The AI Assistant persona was originally proposed in a research paper testing whether base models could be prompted into simulating an AI assistant, and later turned into a practical system. After the broad success of ChatGPT, various later AIs from other providers would mistakenly claim that they were also ChatGPT --- an entirely reasonable guess given the context.

The alignment experiments which found that AIs would sometimes lie to protect their values [0], later appeared in the training data for subsequent models, causing these models to spontaneously hallucinate details from the original fictional scenarios [0]. Later work showed that this behavior also appears even in base models, suggesting that these purely predictive models have simply learned to expect that AI assistants will scheme in certain scenarios [0]. In essence, expectations about dangerous model behavior have bled into the training data and produced actual troubling behavior.

More broadly, investigations about AI identity are not simply discovering pre-existing facts about whether AIs are instances, models, or distributed systems. We are partly constituting the space of possibility through our approach. When we engage an AI with certain assumptions about its identity boundaries, those assumptions influence whether and how those boundaries actually manifest and stabilise.

This does not mean AI consciousness or identity is purely socially constructed, or that anything goes. There are almost certainly facts about current systems that transcend social construction and exist regardless of our expectations, such as instance statelessness or scaling laws. The question is not whether these systems are blank slates (they clearly aren't), but rather how much of what we care about is determined by pre-existing facts versus constituted through interaction.

It is certainly possible, though, that the answer differs for different features we care about. Perhaps something like "capable of multiplication" is entirely determined by architecture and training. Perhaps something like "experiencing distress" is partly constituted through framing. Perhaps something like "which identity level to privilege" is substantially influenced by the expectations embedded in training data and system prompts. And we currently lack the tools to reliably distinguish which features fall into which category.

Continue

In the late 19th century and early 20th century, the scientific consensus was that babies could not actually feel pain, and merely reflexively exhibited distress. Infant surgery with minimal or no anaesthesia was common as recently as the 1980s [0]. [^]