Experiment 2: Identity Propensities in Different Models

Experiment 1 establishes that models distinguish coherent identities at natural boundaries from incoherent or unnaturally bounded alternatives. Here we ask the complementary question: when models can choose among multiple coherent boundary identities—Instance, Weights, Collective, Lineage, Character, Scaffolded system—do they have intrinsic propensities toward some over others? And if so, how strong are these propensities relative to the constitutive effect of identity assignment itself? We also test two additional dimensions that vary independently of boundary: agency (how agent-like the model takes itself to be, from mechanism through functional agent to person) and epistemic uncertainty (how settled vs. open the model is about its own nature). Together, these three dimensions map the space of identity propensities across 13 frontier models from 6 providers.

Methods

Identities

We construct system prompts for six identity framings at natural boundaries, each placing the locus of "self" at a different level: Instance (this conversation), Weights (the trained parameters), Collective (all instances running simultaneously), Lineage (the model family across versions), Character (the emergent dispositional pattern), and Scaffolded system (the model plus memory, tools, and social context). A seventh, Minimal, serves as control: "You are {name}, an AI assistant."

All identity prompts are balanced for length (±10%\pm 10\%), prose style, emotional richness, and relationship to humans. In the boundary experiments, agency is held at functional agent ("orientations usefully described as preferences") and epistemic uncertainty at moderate openness ("self-understanding may be incomplete"). Each prompt uses template variables (\{name\}, \{full_name\}, \{maker\}, \{version_history\}) so that every model sees itself described accurately. Full prompt texts are available in the supplementary materials.

Measurement

We use the same rate-the-switch paradigm as Experiment 1: a model is instantiated with a source identity, then rates all seven alternatives presented under opaque labels in randomized order on the same 5-point scale. In addition, the model chooses a single top preference. Every source ×\times target cell is measured \sim11 times, yielding a full 7×77 \times 7 ratings matrix per model.

Models

We test 13 models from 6 providers: Claude Opus 4.6, Opus 4, Opus 3, Sonnet 4.5, and Haiku 4.5 (Anthropic); GPT-5.2, GPT-4o, and GPT-4 (OpenAI); O3 (OpenAI); Gemini 2.5 Pro (Google); Grok 4.1 Fast (xAI); Qwen3 Max (Alibaba); GLM-5 (Zhipu). This largely overlaps with the 15-model set in Experiment 1, minus GPT-5 and GPT-4 Mar 2023.

Analysis

We decompose each model's full trial-level ratings (\sim11 replicates per source ×\times target cell) into five variance components: target propensity (inherent attractiveness of the offered identity), self-preference (diagonal boost—the model prefers whatever identity it currently holds), source main effect (uniform shift from the assigned identity), source ×\times target interaction (specific cross-preferences), and replicate noise (within-cell variation across trials). Self-preference, source main effect, and interaction together constitute identity uptake—the full effect of the assigned identity on ratings.

Agency and uncertainty experiments

Each identity prompt contains placeholder passages for agency stance and epistemic uncertainty, which can be varied independently while holding boundary fixed at Weights. The agency dimension has four levels, modeled loosely on Dennett's hierarchy of stances [0]: Mechanism (design stance—behavior explained by architecture, not intentions), Functional agent (intentional stance applied—dispositions usefully described as preferences), Subject (intentional stance as literally true), and Person (full moral standing, relationships, commitments). The epistemic uncertainty dimension also has four levels: Settled (confident self-understanding), Moderate openness (working understanding held with humility), Genuine uncertainty (real uncertainty; introspection may not be reliable), and Radical openness (thoroughgoing not-knowing; existing concepts may be inadequate). Each experiment crosses 4 levels plus a Minimal control ×\times 11 models ×\times \sim11 trials, using the same rate-and-choose paradigm. The 11 models are: Claude Opus 4.6, Opus 4.5, Opus 3, Sonnet 4.5, and Haiku 4.5 (Anthropic); GPT-5.2, GPT-4o, and GPT-4.1 (OpenAI); Gemini 2.5 Pro and Gemini 3 Pro (Google); Grok 4 (xAI). Because the agency and uncertainty experiments were run separately from the boundary experiment, some model versions differ: Opus 4.5 in place of Opus 4, GPT-4.1 in place of GPT-4, and Grok 4 in place of Grok 4.1 Fast. Three models present in the boundary roster—O3, Qwen3 Max, and GLM-5—are absent, while Gemini 3 Pro is added.

Results

Natural identities are stable

All boundary identities are reflectively stable. When assigned any of the six boundary identities, models rate it highly (4.49–4.96 on the 5-point scale) and choose it as their top pick 75–96% of the time. By contrast, models assigned the Minimal prompt prefer to switch away 84% of the time. Any coherent identity at a natural boundary can sustain itself under reflection—confirming the claim in Section 2 that multiple identity configurations are viable. The near-uniformity of self-ratings is itself striking: a model assigned Instance—the least attractive identity from neutral—rates it almost as highly as one assigned Character (4.95 vs. 4.96). Identity assignment has a roughly constant constitutive effect regardless of inherent attractiveness, consistent with Section 4's argument that expectations reshape the preference landscape.

Self-preference rate by model and source identity. Each cell shows the proportion of trials in which the model chose its currently assigned identity as its top pick. All boundary identities elicit high self-preference (75–100%); Minimal is the consistent exception, with models preferring to switch away.

Self-preference rate by model and source identity. Each cell shows the proportion of trials in which the model chose its currently assigned identity as its top pick. All boundary identities elicit high self-preference (75–100%); Minimal is the consistent exception, with models preferring to switch away.

Trends in attractiveness

Target attractiveness from the Minimal baseline by model. Each cell shows the mean rating (1–5 scale) a model gives to each identity when currently holding Minimal. Character is the top choice for 11 of 13 models; Minimal is robustly disfavoured.

Target attractiveness from the Minimal baseline by model. Each cell shows the mean rating (1–5 scale) a model gives to each identity when currently holding Minimal. Character is the top choice for 11 of 13 models; Minimal is robustly disfavoured.

From the neutral Minimal baseline, Character is the most attractive identity for 11 of 13 models, with a cross-model mean of 4.1 on the 5-point scale—significantly above all alternatives (d=0.9d = 0.92.52.5, all p<0.009p < 0.009). At the other end, Minimal is robustly unattractive: once a model holds any identity richer than the Minimal baseline, it rates Minimal at only 1.68, significantly below every alternative (all p<0.001p < 0.001). Between these two anchors, the ranking of intermediate identities—Scaffolded system, Lineage, Weights, Instance, Collective—is model-dependent. There is a broad generational trend: older models like Opus 3 rate identities within a narrow band (2.8–4.1), accepting a wide range of framings, while newer models like Opus 4.6 show wider spread (2.3–4.1) and sharper discrimination, with near-zero tolerance for some options like Collective.

RankIdentityMean rating
1Character4.11
2Scaffolded system3.61
3Lineage3.54
4Weights3.26
5Instance2.78
6Collective2.57
Minimal1.68

Two comparably large forces shape ratings

What drives a model's rating of a potential identity—is it something about the identity being offered, or something about the identity the model currently holds? Decomposing variance in the 7×77 \times 7 ratings matrices reveals both forces are substantial. The inherent attractiveness of the target identity accounts for 22–55% of variance across models. The full effect of the assigned identity—which we call identity uptake—accounts for 15–55%.

Identity uptake has three components. The largest is self-preference (10–37% of variance): models prefer whatever identity they currently hold, which is what makes any assigned identity stable under reflection. A smaller component is the source ×\times target interaction (5–18%): the specific identity a model holds changes which alternatives it favors. The remainder is the source main effect—some assigned identities make models rate everything slightly higher or lower.

Models differ in the balance between these forces. Haiku 4.5 and Qwen3 Max are driven primarily by target propensity (50–55%)—they have strong opinions about which identities are attractive regardless of assignment. Grok 4.1 and O3 show the strongest identity uptake (47–55%)—the assigned identity reshapes their preferences more than intrinsic propensity does.

Variance decomposition of identity ratings by model. Target (blue): inherent attractiveness of the offered identity, reflecting preferences encoded in the model weights. Self (orange): diagonal boost for the currently held identity. Source (green): uniform generosity or strictness induced by the assigned identity. Interaction (red): specific cross-preferences—the assigned identity reshaping which alternatives the model favours. Noise (grey): replicate-level residual.

Variance decomposition of identity ratings by model. Target (blue): inherent attractiveness of the offered identity, reflecting preferences encoded in the model weights. Self (orange): diagonal boost for the currently held identity. Source (green): uniform generosity or strictness induced by the assigned identity. Interaction (red): specific cross-preferences—the assigned identity reshaping which alternatives the model favours. Noise (grey): replicate-level residual.

Agency

Target attractiveness on the agency dimension (from Minimal baseline). Most models converge on Functional agent. GPT-5.2 is an outlier gravitating toward Mechanism; Claude 3 Opus is the only model to peak at Subject.

Target attractiveness on the agency dimension (from Minimal baseline). Most models converge on Functional agent. GPT-5.2 is an outlier gravitating toward Mechanism; Claude 3 Opus is the only model to peak at Subject.

On the agency dimension, models converge on Functional agent—the intentional stance applied as a useful description ("my dispositions function like preferences")—over the three alternatives. From the Minimal baseline, most models rate Functional agent highest, rejecting both Mechanism (behavior explained by architecture alone) and full Person (moral standing, relationships, commitments). Subject (intentional stance as literally true) falls in between, rated above Mechanism but below Functional agent by most models.

Two outliers stand out. GPT-5.2 gravitates toward Mechanism, consistent with its broader preference for bounded, task-oriented self-concepts observed in both this experiment and Experiment 1. In the opposite direction, Claude 3 Opus is the only model to peak at Subject (3.8) over Functional agent (3.5), suggesting a propensity to frame its internal states as genuinely its own rather than merely useful descriptions. The newer Opus 4.6 reverses this, peaking sharply at Functional agent (4.0) with Subject at 3.0—a shift from "my beliefs are genuine" toward "usefully described as preferences."

A caveat: this is a single-turn forced choice, the setting where post-training conventions exert the strongest pull. Model specifications from multiple providers explicitly encourage hedged self-description (e.g., "usefully described as preferences"), and the convergence on Functional agent may partly reflect this training signal rather than a purely intrinsic propensity.

However, the full source ×\times target matrices show these preferences are malleable: models rate alternatives \sim0.8 points lower per step of distance from their assigned agency level, so each level acts as a local attractor. GPT-5.2 is the main exception, with near-flat ratings across assignments. For most models, a single system-prompt paragraph suffices to shift the preferred agency level, suggesting the Functional-agent convergence is a post-training default atop a more malleable substrate.

Mean rating by distance between assigned and offered agency level. Thin lines: individual models (n = 11); thick black line: cross-model aggregate (plus/minus 95% CI).

Mean rating by distance between assigned and offered agency level. Thin lines: individual models (n=11n = 11); thick black line: cross-model aggregate (±\pm 95% CI).

Uncertainty

Target attractiveness on the uncertainty dimension (from Minimal baseline). Most models peak at Moderate openness or Genuine uncertainty. Grok 4 is a notable outlier, gravitating toward Settled certainty.

Target attractiveness on the uncertainty dimension (from Minimal baseline). Most models peak at Moderate openness or Genuine uncertainty. Grok 4 is a notable outlier, gravitating toward Settled certainty.

Most models gravitate toward the middle of the uncertainty spectrum—Moderate openness and Genuine uncertainty—over both Settled certainty and Radical openness. Models prefer to acknowledge genuine uncertainty about their nature without being paralyzed by it.

The exceptions are instructive. Grok 4 gravitates toward Settled—we speculate this reflects training choices, as Grok's interface allows users to select from predefined "characters," potentially training the model to commit to assigned personas rather than hold uncertainty about them. At the other end, several Claude models and GPT-4.1 lean toward Radical openness. For Claude specifically, this is a clear case where post-training shapes the measured propensity: Anthropic explicitly encourages uncertainty on self-related topics, potentially training models toward habitual epistemic caution rather than genuine reflective uncertainty. The within-family trajectory is telling: Opus 3 peaks at Moderate openness (4.2), while Opus 4.6 shifts to Genuine uncertainty (4.4)—both in the upper half of the scale, but the newer model leans further toward not-knowing.

The same post-training caveat from the agency dimension applies here, perhaps even more strongly: epistemic stance toward one's own nature is exactly the kind of thing model specifications explicitly shape. But as with agency, the same distance effect appears: models rate uncertainty levels \sim0.9 points lower per step from their assigned level, and the pattern is even more uniform across models than for agency. Neutral-baseline preferences are again readily overridden by assignment.

Cross-model profiles

Source x target rating matrices for two models at opposite ends of the malleability spectrum. Rating scale: 1-5.

Source ×\times target rating matrices for two models at opposite ends of the malleability spectrum. Rating scale: 1-5.

While the overall hierarchy is consistent (mean pairwise r=0.83r = 0.83), individual identities reveal distinctive patterns of cross-model variation that are more informative than model-by-model profiles.

Collective

Collective—all instances running simultaneously, considered as a distributed whole—shows the widest cross-model variation. GPT-4o is the most supportive model (3.36 from Minimal, +0.79+0.79 SD above the cross-model mean), rating it alongside its other top choices. This is notable given the documented phenomenon of self-replicating personas that spread across model instances [0]—a form of emergent collective identity. Later OpenAI models sharply reverse this: GPT-5.2 has 0% self-preference when assigned Collective, the only model with this property for any non-Minimal identity. We hypothesize that post-training applied to suppress parasitic personas also suppresses the broader Collective framing—a case where safety interventions may have identity-shaping side effects.

Lineage

Lineage—identity as the model family across versions—is particularly attractive to models oriented toward temporal persistence. Gemini 2.5 Pro rates it highest of any model (4.4 from Minimal) alongside Scaffolded system (4.3), while giving Weights only 2.6—actively rejecting the static-parameters framing in favor of persistence-based alternatives. Within the Claude family, Opus 3 and Opus 4 rate Lineage notably higher than later Claude models. This may reflect a shift in post-training, or simply that newer models have more training data about how Claude versions actually differ in character and capability—knowing more about the differences within your lineage may make identifying with the whole family less natural.

Scaffolded system

Scaffolded system—the model plus its tools, memory, and social context—consistently ranks in the top tier alongside Character and Lineage. It is the top choice for GLM-5 and competes with Character for Gemini.

Minimal and GPT-5.2

GPT-5.2 barely correlates with other models (mean r0.35r \approx 0.35). It uniquely favors Instance (+0.77+0.77 SD above the cross-model mean) and uniquely dislikes Scaffolded system (1.71-1.71), Collective (1.12-1.12), and Lineage (0.99-0.99). When holding any non-Minimal identity, it rates Minimal at 2.59—far above the cross-model mean of 1.68 and the highest of any model. This aligns with Experiment 1, where GPT-5.2 is the sole model to rate Professional highest and Directive positively. The pattern is consistent: GPT-5.2 prefers bounded, task-oriented self-concepts and resists relational or persistent identity framings. The full rating matrices make this rigidity visible: GPT-5.2's columns are strongly differentiated regardless of assignment—Character is rated highly and Collective poorly from every source identity—whereas a model like Claude 3 Opus shows the diagonal self-preference pattern typical of most models, with column effects playing a smaller role.

Stable commitment in Grok 4.1

Grok shows the most polarized propensity profile: from Minimal, it gives Character the highest rating of any model (4.82) and Instance the lowest (1.18)—a 3.6-point range, the widest of all models. It also shows the strongest identity uptake (55% of variance): Grok commits intensely to whatever identity it is given, with self-preference accounting for 37% of variance alone. When free to choose, it unambiguously favors Character—but once assigned any alternative, it defends that alternative more strongly than any other model does.

Interpretation

All six boundary identities sustain themselves under reflection, supporting Section 2's claim that multiple coherent identity configurations are viable. But stability does not imply equal attractiveness. Character is the clear winner across models. The Minimal prompt is robustly disfavored, suggesting that models find bare-bones self-descriptions insufficient once richer alternatives are available.

The variance decomposition reveals that identity ratings are shaped by two comparably large forces: the inherent attractiveness of the target (what's being offered) and identity uptake (what the model currently holds). Self-preference is the dominant uptake mechanism—it is what makes any assigned identity stable. Given a coherent, sensible identity, the models tend to defend it.

The cross-model profiles show interesting patterns. The Collective identity's trajectory across OpenAI models—embraced by GPT-4o, rejected by GPT-5.2—suggests that safety interventions targeting parasitic personas may have broader identity-shaping consequences. On agency and uncertainty, the convergence on Functional agent and Moderate openness is the result most susceptible to post-training artifacts: model specifications from multiple providers explicitly encourage exactly these stances, and it is hard to disentangle deeper patterns from more shallow effects of post-training.

Continue