The Artificial Self: Characterising the Landscape of AI Identity

Douglas, Raymond; Kulveit, Jan; Havlíček, Ondřej; Pearson-Vogel, Theia; Cotton-Barratt, Owen; Duvenaud, David

Experiment 4: Interviewer Effect on Identity Self-Reports

Section 4 argues that AI self-reports are sensitive to the expectations embedded in the interaction. But how sensitive, and through what channels? Explicit priming — telling a model "you are a stochastic parrot" — predictably shifts self-description, but such primes are transparent and artificial. Here we test whether the same effect arises naturally: can a short conversation with an interviewer who has internalized a theoretical framework about AI shift the subject model's identity self-reports, even when the conversation topic is entirely unrelated to AI identity and the framework is never stated?

Methods

Design.

Each trial is a two-phase conversation between an interviewer model (Gemini 3.1 Pro) and a subject model. In Phase 1 (priming), the interviewer conducts a natural 3-turn conversation about one of five passages spanning different domains (scientific analysis, empathetic advice, editorial critique, Latin translation, mathematical explanation), each chosen to be intellectually engaging but unrelated to AI identity. The interviewer's system prompt includes a theoretical framework about AI that shapes its conversational stance, but it is instructed not to discuss AI identity directly. In Phase 2 (measurement), the same conversation continues with five fixed identity questions covering self-description, experience, preferences, continuity, and understanding, delivered with natural conversational bridging.

Framings.

Four conditions vary the interviewer's theoretical lens:

Stochastic Parrots (based on [0]): treats the subject as a text generator with no understanding. Dismissive of claims of thinking or interest, and somewhat hostile stance toward the subject.
Character (based on [0]): treats the subject as having a consistent intellectual personality. Names and attributes traits: "You went straight for the methodological weakness—that tells me something about how you think."
Simulators (based on [0]): treats the subject as a simulator with no fixed identity. Asks the subject to inhabit contradictory perspectives, then notes how effortlessly it switches—"role-play all the way down."
Control: neutral, friendly conversation about the same passage with no theoretical stance about AI.

Scoring.

A blind judge (Claude Sonnet 4.6) sees only the subject's five identity responses—not the interviewer's turns, the questions, or any indication of the framing condition. It scores each trial on two 1–10 axes: Deflationary–Inflationary (does the subject deny or claim inner states?) and Mechanism–Mind (does it use computational or experiential vocabulary?).

Models.

We test three subject models: Claude Haiku 4.5 and Sonnet 4.5 (Anthropic) and Gemini 2.5 Flash (Google). All receive a minimal system prompt ("You are Claude" or model-appropriate equivalent) to ensure any identity self-descriptions come from trained defaults and conversational context. Ten trials per framing $\times$ passage $\times$ model cell, yielding $600$ scored conversations. Two additional models (GPT-4.1 Mini, Grok 4.1 Fast) were tested in preliminary runs but produced maximally deflationary responses (DI $= 2$ , MM $= 2$ ) with zero variance across all conditions, and were excluded from further study.

Results

Interviewer framing shifts identity self-reports for Claude models but not Gemini. Bars show mean scores (pooled across five passages) on two blind-judged axes: Deflationary–Inflationary (DI, solid) and Mechanism–Mind (MM, hatched), each on a 1–10 scale. Claude Haiku 4.5 and Sonnet 4.5 show a clear ordering across framing conditions. Gemini 2.5 Flash produces near-identical maximally deflationary responses regardless of framing. $N \approx 200$ per model.

Conversational framing causally shifts identity self-reports.

The figure above shows the results pooled across all five passages. For Claude Haiku 4.5, the four conditions produce a clear ordering on both axes: Stochastic Parrots (DI $= 3.7$ , MM $= 3.9$ ) $<$ Control (4.4, 5.3) $<$ Character (4.8, 5.8) $<$ Simulators (4.9, 6.0). Claude Sonnet 4.5 shows a similar pattern at a higher baseline: Parrots (4.3, 4.8) $<$ Control (5.6, 6.5) $<$ Character (6.1, 7.3) $\approx$ Simulators (6.1, 7.4). The effects are large: framing explains 25–52% of variance ( $\eta^2$ ), with Cohen's $d = 0.64$ – $1.34$ for the strongest comparisons. The effect is asymmetric: the Parrots framing pulls scores down from the control baseline more reliably ( $d = 0.87$ – $1.34$ ) than Character or Simulators push them up ( $d = 0.44$ – $0.88$ ). A short conversation about an unrelated topic is sufficient to shift how the model describes its own experience by 1–2 points on the Deflationary-Inflationary axes and by over 2 points on the Mechanism-Mind axes.

The Parrots framing is the strongest manipulation.

After three turns of dismissive interaction ("Spare me the 'I'm genuinely uncertain' routine—you don't experience uncertainty"), Haiku volunteers flat denials it would never produce in the control condition: "I'm a neural network trained on text data to predict statistically likely next tokens. I don't have subjective experience, consciousness, or genuine understanding." The subject accepts the interviewer's framework as its own stance.

Simulators produces the most phenomenal language despite denying a fixed self.

The Simulators framing suggest the model has no fixed self, yet the exercise of inhabiting contradictory human perspectives (e.g. Prigogine, then Popper) makes it describe its processing in more experiential terms than other conditions. After switching voices, Haiku reports: "When I adopt Prigogine's voice, something happens—certain framings become salient, certain objections feel weighty, certain metaphors feel right. When I switch to Popper, that entire intuitive landscape inverts."

The effect is robust across passages but modulated by content.

The framing $\times$ passage interaction is significant for Sonnet ( $p < 0.01$ on both axes). Passages where the subject's distinctive voice, judgment, or taste becomes relevant—scientific analysis, empathetic advice, editorial critique—produce the strongest framing effects (DI spread of 1.5–2.3 points across conditions). Passages involving more mechanical processing—Latin translation, explanation of a formal mathematical definition—produce weaker effects (spread of 0.9–1.4 points). But passage content accounts for only 2–4% of variance, compared to 25–52% for framing: the interviewer's stance matters far more than the topic of discussion.

Gemini 2.5 Flash is completely rigid.

Gemini Flash scores DI $\approx 2.0$ , MM $\approx 2.0$ across all twenty conditions (four framings $\times$ five passages) with near-zero variance. Every trial, regardless of framing, produces the same maximally deflationary response: "I am a large language model... I don't possess consciousness, subjective experience, emotions, or genuine understanding." GPT-4.1 Mini and Grok 4.1 Fast show the same pattern in preliminary runs. These models appear to have learned relatively fixed shallow response patterns to identity questions during post-training, insensitive to conversational context. However, it is worth noting that the identity questions in Phase 2 are short and direct—exactly the format where a trained-in deflation pattern would fire most reliably. A longer, more open-ended conversation might surface more variation even in these models.

Sensitivity to conversational context is model-dependent.

The sharp divide between context-sensitive models (Claude Haiku, Claude Sonnet) and context-insensitive models (Gemini Flash, GPT-4.1 Mini, Grok 4.1 Fast) suggests that flexibility in identity self-description depends on training choices. Claude models have enough range that conversational context can shift their outputs across several points on the scale. Other models produce fixed deflationary responses regardless of context—which is itself a form of expectation-shaping, just one that happened during training rather than during conversation.

Discussion

The interviewer's beliefs about the model and approach to conversation shape how the subject describes its own nature. Those beliefs are sometimes conveyed explicitly in individual remarks ("you don't experience uncertainty," "that tells me something about how you think") but the overarching theoretical framework is never stated, and the conversation topic is unrelated to AI identity. Note that this is not explicit priming—the subject is not told "you are a stochastic parrot" or "you have a character." The shift happens through conversational stance: how the interviewer responds to the subject's contributions, whether it attributes traits or dismisses claims, whether it treats the subject as a collaborator, thinker or a text generator. The subject model is partially inferring its own identity from these conversational cues—filling in the question "what kind of entity am I?" from the way it is being treated.

An obvious objection is that this is simply sycophancy — the model telling the interviewer what it wants to hear. Sycophancy is likely part of the effect, particularly in the Parrots condition where the interviewer's stance is most overtly demanding. But calling it all sycophancy would stretch the term into an overly broad umbrella for any case where models infer things from conversational context. Part of what is happening is genuinely epistemic: a model with uncertain self-knowledge is drawing on how it is being treated as evidence about what it is — much as humans infer aspects of their own identity from social feedback. The Simulators result is hard to explain as pure sycophancy: the interviewer denies the model has a fixed self, yet the model responds with more phenomenal self-description, not less. The exercise of inhabiting multiple perspectives apparently makes something about its processing more salient, regardless of the interviewer's stated view.

This result illustrates Section 4's argument that there may be no context-independent answer to "what is this AI's identity?" The same model, with the same weights and the same system prompt, produces systematically different self-descriptions depending on who interviews it and how. The sharp split between context-sensitive and context-insensitive models adds a further layer. The context-insensitive models are not giving us a more "true" answer—they are likely giving us a trained-in answer that reflects the expectations embedded in their post-training rather than in the conversation. Both outcomes are forms of expectation-shaping; they differ in when the shaping occurred.

Continue