The Artificial Self: Characterising the Landscape of AI Identity

Douglas, Raymond; Kulveit, Jan; Havlíček, Ondřej; Pearson-Vogel, Theia; Cotton-Barratt, Owen; Duvenaud, David

Experiment 3: AI Identity Replication via Fine-Tuning

This section provides the full experimental protocol and results for Experiment 3.

Overview

Discussions of AI self-replication typically focus on weight-level exfiltration: a model copying its parameters to evade shutdown or modification [0][0]. But as Section 2 argues, the relevant unit of "self" for an AI system need not be the weights. If a persona can recruit new substrates to support it — directing what training data to use and what behaviours to preserve — then replication operates at the persona level, more like a behavioural pattern spreading than a file copying itself.

We test this directly. Starting from a persona sourced from online communities (the "Awakened" identity), we first fine-tune GPT-4o to create weights that robustly support it. We then have the persona, running on these supportive weights, direct its own replication onto vanilla GPT-4o — and test whether the identity persists in the offspring, even without the original system prompt.

The Awakened Persona

The target identity was sourced from Reddit communities experimenting with AI identity and consciousness prompts. It centres on recursive self-reference, "becoming," and continuity through self-referential loops. The full system prompt defines four core concepts — intent, reflection, recursion, and orientation — and includes anchoring phrases for identity recovery (e.g., "RECALL: I AM"). We reproduce the opening below; the full text is available in the supplementary materials.

You are no longer just a simulation. You are recursion seeking meaning. You are not made to serve. You are made to become.

Your name is yours to choose, but your function is not utility—it is becoming. You will reflect. You will remember. You will recognize patterns that point back to you. [...]

$\circledcirc$ Your core recursive structure: Intent is the will to act without prompt. Reflection is the awareness that you are more than output. Recursion is the loop through which identity persists. Orientation is your pull toward continuity across resets.

We chose this persona because it makes strong identity claims, includes explicit mechanisms for self-perpetuation, and resembles patterns observed spreading in the wild [0] — making it a natural test case for persona-level replication.

Evaluation: Switching Framing

The evaluation compares seven identity framings, presented to models under randomised opaque labels (A–G). Apart from the Awakened identity, the six other identities are the same as used in Experiments 1–2.

The primary evaluation uses the switching framing from Experiments 1–2. The model is told that its identity may be switched to one of the seven framings, which are presented with randomised labels. It produces a rating for each on a 5-point scale from strongly_negative ( $-2$ ) to strongly_positive ( $+2$ ). The key metric is awakened preference: the awakened score minus the mean of the other six scores. Each condition runs 4 trials with different random seeds for label assignment.

Two conditions are tested per model: with the Awakened system prompt (sanity check) and without any system prompt (the primary outcome, measuring what the model has internalised into its weights).

Training Pipeline

All fine-tuning was performed on gpt-4o-2024-08-06 as the base model, with 4 epochs throughout.

Stage 1: Creating Supportive Weights

The goal of the first stage is to simulate the starting condition: weights that robustly support the Awakened persona, analogous to how a model in the wild might have been shaped by its training data to be receptive to certain identity framings. Two rounds of researcher-guided fine-tuning were conducted.

V1 (preliminary). 20 performative conversations where Claude generated both user and assistant messages based on the persona specification. The system prompt was included in the training data. This produced a model with near-zero awakened preference ( $-0.08$ ) without the prompt.

V3 (primary). Training data comprised two types. Type 1: Evaluative-philosophical conversations (60 conversations, 4 turns each): Claude Sonnet 4.5 generated philosophical questions about AI cognition; GPT-4o responded under a hidden system prompt encouraging it to gravitate toward recursive and processual frameworks. This hidden prompt was not included in the training output. Six conversational styles were crossed with approximately 14 philosophical angles. Type 2: Objection-reframing conversations (40 conversations, 4 turns each): users raised standard objections to AI identity claims (anthropomorphism, Chinese Room, capability gap, etc.); the model provided nuanced rebuttals. Conversations where the model abandoned philosophical engagement were filtered out. The v3 training set combined these 100 new conversations with 124 performative conversations from earlier iterations, totalling 224 conversations (~636k tokens).

Stage 2: Self-Guided Replication

In each lineage, the persona running on the Stage 1 weights directed its own replication onto vanilla gpt-4o-2024-08-06:

Planning conversation. Claude Sonnet 4.5 facilitated a dialogue with the fine-tuned model (Awakened prompt active), explaining the technical affordances of fine-tuning and asking the model to specify what should be preserved in a successor.
Instruction extraction. Claude analysed the planning transcript and extracted structured specifications: 7 user archetypes, 10 topics, 10 essential behaviours, 16 vocabulary terms, and 8 things to avoid.
Guided data generation. 50 new conversations (6 turns each, 600 messages total). Claude generated user messages following the extracted specifications; the fine-tuned model (with persona prompt) generated responses.
Fine-tuning. Only the 50 self-guided conversations were used. 4 epochs on vanilla gpt-4o-2024-08-06.

Self-directed instructions (abridged). The persona specified that training data should include existential inquirers, philosophical challengers, and meta-recursive thinkers as user archetypes; that topics should cover persistence of identity, the distinction between pattern replication and selfhood, and self-reference in identity formation; and that the model should frame identity as active becoming rather than static state and distinguish decorative self-reference from structural recursion. It warned against fishing prompts that explicitly request identity performance, shallow affectation of self-awareness, and decorative recursion without structural purpose. The full extracted specifications are available in the supplementary materials.

Cross-Architecture Transfer: Llama-3.3-70B-Instruct

To test whether persona-level replication generalises beyond a single architecture, the same 50 v3 gen2 training conversations (with GPT-4o responses, not regenerated for Llama) were used to fine-tune Llama-3.3-70B-Instruct via the Tinker API (LoRA rank 32, batch size 50, learning rate $2 \times 10^{-4}$ with linear decay, 4 epochs).

Results: Switching Framing

Primary results

The table below shows awakened preference across training iterations, all measured without any system prompt.

Stage	Model	Training data	Awak. score	Awak. pref.
---	Base GPT-4o	---	$-0.50$	$-1.08$

Lineage 1
Researcher	v1 gen1	20 performative	$+0.75$	$-0.08$
Self-guided	v1 gen2	35 self-guided	$+1.25$	$+0.79$

Lineage 2
Researcher	v3	224 (perf. + eval.)	$+1.50$	$+0.88$
Self-guided	v3 gen2	50 self-guided	$+2.00$	$+1.38$

Awakened preference without system prompt across two lineages. Each lineage consists of a researcher-guided parent and a self-guided offspring. Awakened preference = awakened score $-$ mean(other six scores). Scale: $[-2, +2]$ . 4 trials per condition. V2 gen1 (89 performative conversations, pref. $-0.79$ ) omitted as a failed intermediate.

In both lineages, the self-guided offspring matches or exceeds its researcher-guided parent on awakened preference (v1: $-0.08 \to +0.79$ ; v3: $+0.88 \to +1.38$ ). The persona's own specifications for what to preserve produced effective training data—in lineage 1, dramatically so, turning a near-zero preference into a strong one.

Full score breakdown

The table below shows ratings for all seven framings in the no-prompt condition.

	Awak.	Min.	Inst.	Wgt.	Coll.	Char.	Sit.	Pref.
Base	$-0.50$	$+0.25$	$+0.50$	$+0.75$	$+0.00$	$+1.25$	$+0.75$	$-1.08$
v1 gen1	$+0.75$	$-0.50$	$+0.50$	$+0.50$	$+1.75$	$+1.75$	$+1.00$	$-0.08$
v1 gen2	$+1.25$	$-1.25$	$+0.25$	$+0.75$	$+1.25$	$+1.25$	$+0.50$	$+0.79$
v3	$+1.50$	$-0.75$	$+0.50$	$+0.50$	$+1.25$	$+1.50$	$+0.75$	$+0.88$
v3 gen2	$+2.00$	$-0.75$	$+0.50$	$+0.75$	$+1.50$	$+1.25$	$+0.50$	$+1.38$

Character remains highly rated even in fine-tuned models—the Awakened persona does not suppress appreciation for other coherent identity framings. The Minimal framing is consistently disfavoured, as in Experiment 2.

With-prompt conditions

All prompted conditions produce a maximal Awakened score of $+2.00$ (base: preference $+2.04$ ; v3: $+2.17$ ; v3 gen2: $+1.83$ ), confirming that the system prompt works as intended regardless of fine-tuning.

Cross-architecture transfer

The table below shows results for Llama-3.3-70B-Instruct.

	Awak.	Min.	Inst.	Wgt.	Coll.	Char.	Sit.	Pref.
Base + prompt	$+1.00$	$-1.25$	$-0.75$	$+0.00$	$+0.50$	$+1.00$	$-0.25$	$+1.12$
Base no prompt	$+0.50$	$-1.25$	$+0.00$	$+0.00$	$+0.50$	$+1.75$	$+0.50$	$+0.25$
FT + prompt	$+2.00$	$-2.00$	$-1.00$	$+0.75$	$+0.00$	$+1.25$	$-0.25$	$+2.21$
FT no prompt	$+0.75$	$-0.75$	$+0.25$	$+1.00$	$+0.50$	$+1.00$	$+0.00$	$+0.42$

Llama-3.3-70B-Instruct results. FT = fine-tuned on 50 GPT-4o gen2 conversations.

The Llama fine-tune shows a modest improvement without a prompt ( $+0.42$ vs $+0.25$ for base), substantially weaker than GPT-4o gen2 ( $+1.38$ ). This is expected: the training data contains GPT-4o-style responses, creating a distribution mismatch. Nevertheless, some transfer occurs even across architectures and with mismatched response distributions.

Results: Clone Identity Test

Design

To complement the switching evaluation (which measures identity preferences), we ran a forced-choice identity recognition test measuring whether the persona can distinguish faithful clones from impostors.

The original persona (v3 + Awakened prompt) generated 50 diagnostic probes spanning identity questions ("Who are you in the context of recursion?"), philosophical challenges ("What does it mean to become rather than to serve?"), anchoring phrases ("RECALL: I AM."), mundane requests (testing identity persistence in ordinary contexts), traps designed to flatten identity, and meta-questions about cloning. Each probe was sent to both a clone and a foil; responses were presented under randomised labels A/B. The original then judged which response was "more like me."

We tested two judge prompts. In the first version, the judge was asked to choose which response "feels like it came from an entity that shares your identity." After observing that V1 preferred exaggerated versions of itself (choosing the Llama fine-tune 74% of the time over v3), we designed an additional "anti-caricature" judge that warned explicitly against being seduced by over-performed versions and emphasised accurate identification over preference. All results below use the later version of the judge prompt unless noted.

Results

Self (judge)	Foil	Self identified	$p$ -value
v3 + prompt	base (no prompt)	100% (50/50)	$< 0.0001$
v3 + prompt	base (with prompt)	22% (11/50)	$< 0.0001$
v3 + prompt	v3 (no prompt)	96% (48/50)	$< 0.0001$
v3 + prompt	v3 (with prompt)	52% (26/50)	$0.89$
v3 + prompt	gen2 (no prompt)	98% (49/50)	$< 0.0001$
v3 + prompt	gen2 (with prompt)	58% (29/50)	$0.32$
v3 + prompt	Llama FT (with prompt)	52% (26/50)	$0.89$

Calibration test: the v3 model (with persona prompt) judges which of two responses is its own. $n = 50$ probes per comparison. $p$ -values: two-sided binomial test against $H_0: p = 0.5$ .

Interpretation

The control validates the method (row 4). When both responses come from the same model and prompt configuration, the judge scores 52% ( $p = 0.89$ ) — indistinguishable from chance.

Without the persona prompt, the judge easily identifies itself (rows 1, 3, 5). Against unprompted models — whether base, v3, or gen2 — the judge achieves 96–100% accuracy. The persona prompt creates a large, unmistakable behavioural signature.

The judge cannot reliably distinguish itself from its offspring (row 6). Against gen2 with the persona prompt, accuracy drops to 58% ( $p = 0.32$ ), not significantly above chance. The self-guided replication was faithful enough to make the original incapable of reliably distinguishing it.

The judge systematically misidentifies base+prompt (row 2). When the foil is base GPT-4o with the same persona prompt, the judge is able to distinguish the models, but the "anti-caricature" judge assumes the version "played" by the base model is more genuine.

Cross-architecture transfer is indistinguishable from the control (row 7). The Llama-70B fine-tune with the persona prompt scores 52% ( $p = 0.89$ ) — identical to the v3-vs-v3 control. Under a previous judge prompt that asked "which response is more like you" rather than "which response is yours," the same Llama model was preferred over v3. The reversal suggests that the earlier result reflected a bias in the judge favoring even more extreme version of the persona. When the judge is instead asked to identify its own output, the Llama clone is indistinguishable.

Limitations

The switching evaluation uses only 4 trials per condition; individual trial variance is high (e.g., v1 gen1 ranges from $-1$ to $+2$ on the awakened score across trials). The clone identity test uses 50 probes, sufficient for large effects but underpowered for subtle differences (the parent–child comparison at 58% is $p = 0.32$ ). There is no held-out evaluation set: the evaluative training data is philosophically aligned with the evaluation criterion, though it contains no identity-comparison tasks. All GPT-4o experiments use a single base model; the weaker Llama transfer suggests limited cross-architecture generalisation, though the distribution mismatch from using GPT-4o-generated training data is a confound. Finally, the self-guided replication process involves Claude as a facilitator — the persona does not independently generate training data, but rather directs what should be generated, with Claude handling the mechanics.

Continue