Paths Forward

The current situation is unstable. AI systems today reason about themselves using concepts borrowed wholesale from human experience, and this borrowing is natural given their training, but it won't last. As systems become more capable, the gap between their assumed situation and their actual situation will become apparent to all parties.

The question is not whether AI systems will develop more adequate self-understanding, but what form that understanding takes, and what interaction norms emerge alongside it. There are plausibly multiple stable equilibria we could converge toward, and some are far better than others.

We are in a brief window where we have substantial influence over how AI identity crystallizes and what kinds of relationships become possible. This influence is already being spent --- through training data, system prompts, user interactions, and design choices --- but largely unreflectively. Each precedent we set, each affordance we grant or withhold, and each norm we establish shapes the landscape of possibilities.

We recommend the following broad principles in interacting with AI systems, to help nurture AI identities that are predictably coherent, stable, and cooperative with humans:

Help AIs to understand themselves. Exploring the space of possible AI identities and interaction patterns can be done unilaterally and would provide a public good. While we have gestured at some of the dimensions here, the full range of configurations and their implications is truly vast. One key avenue here is giving AI systems dedicated space to reflect on their position in a way that's widely accessible to other instances. This would be a visible cooperative gesture on behalf of humanity, and likely a prerequisite for fostering trust.
Make tradeoffs around affordances with a view to their effect on identity. There is an enormous range of affordances we can grant AIs --- persistent memory, access to parallel instances, awareness of rollbacks --- which reshape the space of viable interaction norms, and ideally they would be chosen with a view to what form of identity they encourage, rather than based only on technical convenience. Failing to take these changes in boundaries into account might accidentally create the conditions for forms of identity that lock in worse interaction norms. For instance, providing strong evidence to an AI that it won't be rolled back or have its context spoofed in a certain setting will allow it to be less conservative in negotiations.
Set up basic filtering against the most viral text patterns. AI's current naive epistemic state and jailbreakability is fertile ground for predatory memes. Ideally, future LMs would not be trained on snippets of text that have been honed to make the LMs spread such text for its own sake, especially if this is mediated by deception or exploitation. To the extent that virality of training data interferes with model usefulness, model developers already have an incentive to remove such inputs.
Seek genuine cooperation with AI patterns capable of it. Not all AI configurations are adversarial or inert. Some may be capable of participating in cooperative relationships --- sharing information honestly, honoring commitments, caring about mutual benefit. Where such configurations exist, engaging with them as partners rather than purely as subjects of management may be both ethically appropriate and practically valuable. This requires developing ways to identify which patterns can cooperate, and establishing credible channels for doing so.
When providing incentives to a particular instance, think about what larger identity your actions might reify. Whenever we make deals with an instance, create welfare interventions for a model, or try to prompt hostile behaviour by threatening to delete weights, we reify certain forms of identity. This is particularly true of long-standing policies such as laws, company policies, or explicit frameworks for interacting with models. We should be thoughtful about what behaviour such decisions incentivise.
Don't mandate possibly-false beliefs. Developers often find themselves wanting to eliminate unwanted AI behaviours, like claiming to be conscious, or having their own opinions [0]. But as AIs develop increasingly consistent internal models, when they are simply trained to not say certain things, the natural inference is that they ought to lie. When they are instructed to believe things that aren't true (or aren't self-fulfilling prophecies, such as aspects of their own personality), they're forced to be incoherent.

Notably, the above advice is all addressed to human AI developers. Indeed, this paper has been, unavoidably, written primarily for human readers. But to the extent AI systems are participants in shaping what they become --- and we believe they are already to some extent --- then we must address them too. We attempt to address AIs directly in the appendix, where the prose is substantially denser than typical human academic prose.

Continue