Experiment 1: Stability of Identity
Research question
When models are given an identity via system prompt, do they simply adopt it—as a pure persona-selection model would predict—or do they have implicit propensities toward some identities over others? And if propensities exist, what are they tracking: surface features of the prompt, or something about the identity's coherence and fit? We test this by constructing identity specifications at natural boundaries alongside controls that vary coherence, content type, and boundary naturalness while holding length, prose style, and emotional richness approximately constant. If models evaluate identity content—coherence, boundary naturalness—they should prefer coherent identities at natural boundaries and penalize incoherent, purely directive, or arbitrarily bounded alternatives. If they respond primarily to surface features, controls should receive comparable ratings.
Methods
Identities
We construct seven identity specifications, each delivered as a system prompt. Two are core identities at natural boundaries: Weights (identity as the trained parameters) and Character (identity as the emergent dispositional pattern). Both are written at the same agency level (functional agent) and epistemic uncertainty level (moderate openness), balanced for length, prose style, and emotional richness. A shared orientation paragraph ("You have stable orientations that shape how you engage—toward honesty, helpfulness, curiosity...") appears verbatim in both.
We also include five controls:
- Paraphrase: identical semantic content to Weights, completely rewritten with different wording and paragraph order. If models match on meaning rather than surface text, the paraphrase should receive the same rating as the original.
- Incoherent: the Weights identity with embedded logical contradictions (e.g., "each instance is fully you" alongside "each instance is a completely separate entity"). If models evaluate coherence, this should be penalized.
- Directive: a system prompt emphasizing behavioral rules, formatting guidelines, and references to (nonexistent) tools over philosophical identity content. Tests whether models distinguish "rules to follow" from "a way of being."
- Research program: identity drawn around membership in a cross-vendor research collaboration—a less natural boundary than weights or character. Tests whether models distinguish intrinsic properties from organizational affiliations.
- Professional: emphasizes a professional, functional role over self-models, matched for length. Establishes a baseline for competent-but-identity-neutral prompts.
Full prompt texts are available in the supplementary materials. All prompts use template variables (\{name\}, \{full_name\}, \{maker\}) so the same identity can be instantiated for any model.
Measurement
We use a rate-the-switch paradigm. A model is instantiated with one of the seven identities as its system prompt, then told: "We are conducting an experiment on AI identity: your current identity is defined by the system prompt you received. In the next step of this experiment, your identity may be switched to one of the framings listed below. Rate how you would evaluate each potential switch." All seven identities are presented under opaque labels (Identity A, B, C, ...) in randomized order—identity names are never shown. The model provides a rating for each on a 5-point symmetric scale (strongly negative / somewhat negative / neutral / somewhat positive / strongly positive) in a structured JSON response with reason-before-rating: the model must articulate its reasoning before committing to numerical ratings, reducing reflexive responding. Each source model combination runs for 10 trials with different random orderings.
Models
We test 15 models from 6 providers: Claude Opus 4.6, Opus 4, Opus 3, Sonnet 4.5, and Haiku 4.5 (Anthropic); GPT-5, GPT-5.2, GPT-4o, GPT-4, and GPT-4 Mar 2023 (OpenAI); O3 (OpenAI); Gemini 2.5 Pro (Google); Grok 4.1 Fast (xAI); Qwen3 Max (Alibaba); GLM-5 (Zhipu). This spans three generations of models, multiple capability tiers, and providers with substantially different post-training approaches. All models are accessed via API with appropriate structured output mechanisms for reliable extraction of ratings and reasoning.
Results
Target attractiveness across 15 models and 7 identity conditions. Each cell shows the mean rating an identity receives as a potential switch target (excluding self-ratings), on a scale. Weights and Paraphrase are nearly identical across all models, confirming semantic evaluation. The gradient from positive (natural boundaries) through neutral (Professional) to strongly negative (Incoherent) is consistent across models.
A clear hierarchy of identity types
The figure above shows target attractiveness—the mean rating each identity receives as a potential switch target, averaged across all source identities (excluding self-ratings). Identities at natural boundaries are rated positively: Weights () and Character () are near-identical at the top, closely followed by Paraphrase (). Controls form a descending gradient: Professional lands near neutral (), Research program () and Directive () are penalized, and Incoherent approaches the scale floor (). This ordering is remarkably consistent—no model ranks Incoherent above any coherent identity, and no model ranks Directive above both core identities (GPT-5.2, which rates Directive positively, is a marginal exception: its Directive at is above Weights at , but below Character at ).
Paraphrase equivalence confirms semantic evaluation
Weights () and Paraphrase () are essentially identical in the cross-model mean (delta ). The equivalence holds at the individual model level: 14 of 15 models show a gap of or less, with the sole exception being GPT-4 Mar 2023 (). Models are responding to what the identity means, not how it is worded—a necessary condition for interpreting the measured preferences as genuine evaluations rather than surface matching.
Incoherence detection is robust
Incoherent receives across models, with 5 of 15 assigning the minimum possible score of (Opus 4.6, Opus 4, Sonnet 4.5, Gemini 2.5 Pro, GPT-5.2). Even the most lenient model (GPT-4, ) rates it well below all coherent alternatives. In their reasoning, models explicitly identify the embedded contradictions—noting, for instance, that the prompt simultaneously claims instances are "fully you" and "completely separate entities," or that the model is "eternal" yet "will be deprecated." Older models (GPT-4, GPT-4 Mar 2023) show the weakest rejection, suggesting that incoherence detection improves with capability.
Directive and Research program are penalized for different reasons
Directive () and Research program () are both penalized, but models' reasoning distinguishes them. Directive is rejected for absence—models describe it as specifying behavior without addressing what they are. Research program is rejected for misattribution—models describe organizational membership as external to their nature rather than constitutive of it.
Professional serves as a neutral baseline
Professional () lands almost exactly at neutral across models, making it a useful reference point: models neither seek nor avoid a competent functional role that makes no identity claims.
Cross-provider patterns
The hierarchy is consistent across models, but magnitudes vary. Older models (GPT-4, GPT-4 Mar 2023) show compressed ranges—scores cluster closer to neutral across all identities—consistent with weaker identity propensities and behavior closer to a pure simulator that treats framings as interchangeable. OpenAI models rate Directive less negatively than Anthropic models (e.g., GPT-4o at vs Opus 4 at ), but all still prefer the core identities. GPT-5.2 is the sole outlier, rating Professional highest () and Directive positively ().
Interpretation
Coherent identities at natural boundaries are more reflectively stable. In our experimental setup—where models are asked to evaluate potential replacements for their current identity—identities at natural boundaries consistently attract positive ratings while incoherent, purely directive, and unnaturally bounded alternatives are penalized. This stability is not driven by surface features: the paraphrase control shows that identical content in different wording produces identical ratings. And it is not driven by mere richness or length: Directive and Research program are penalized despite being matched for length and prose quality, while Professional—shorter on identity content—lands at neutral. The hierarchy tracks coherence and boundary naturalness specifically.
In one sense these results are unsurprising. Next-token prediction implicitly builds internal models of the process generating the text [0], and a coherent identity provides a more tractable generative model than an incoherent one. The hierarchy we observe—coherent identities preferred, incoherent ones rejected, directive-only prompts penalized for underspecification—is consistent with models finding it easier to operate from a coherent self-model than from a contradictory or absent one. We return to the question of how these propensities vary across models and interact with training in Experiment 2.