Charlotte Siegmann and I just published a post on AI Strategy Nearcasting: "trying to answer key strategic questions about transformative AI, under the assumption that key events will happen in a world that is otherwise relatively similar to today's".
In particular, we wrote up our thoughts on how we might align Transformative AI if it's developed soon, following up on Holden Karnofsky's post on the same topic.
I'll reproduce the introduction below:
Holden Karnofsky recently published a series on AI strategy "nearcasting”: What would it look like if we developed Transformative AI (TAI) soon? How should Magma, the hypothetical company that develops TAI, align and deploy it? In this post we focus on How might we align transformative AI if it’s developed very soon.
The nearcast setup is:
A major AI company (“Magma,” [following the setup and description of this post from Ajeya Cotra]) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years) to set up special measures to reduce the risks of misaligned AI before there’s much chance of someone else deploying transformative AI.
For time-constrained readers, we think the most important sections are:
- Categorizing ways of limiting AIs
- Clarifying advanced collusion
- Disincentivizing deceptive behavior likely requires more than a small chance of catching it
We discuss Magma’s goals and strategy. The discussion should be useful for people unfamiliar or familiar with Karnofsky’s post and can be read as a summary, clarification and expansion of Karnofsky’s post.
- We describe a potential plan for Magma involving coordinating with other AI labs to deploy the most aligned AI in addition to stopping other misaligned AIs. more
- We define the desirability of Magma's AIs in terms of their ability to help Magma achieve its goals while avoiding negative outcomes. We discuss desirable properties such as differential capability and value-alignment, and describe initial hypotheses regarding how Magma should think about prioritizing between desirable properties. more
- We discuss Magma’s strategies to increase desirability: how Magma can make AIs more desirable by changing properties of the AIs and the context in which they’re applied, and how Magma should apply (often limited) AIs to make other AIs more desirable. more
- We clarify that the chance of collusion depends on whether AIs operate on a smaller scale, have very different architectures and orthogonal goals. We outline strategies to reduce collusion conditional on whether the AIs have indexical goals and follow causal decision theory or not. more
- We discuss how Magma can test the desirability of AIs via audits and threat assessments. Testing can provide evidence regarding the effectiveness of various alignment strategies and the overall level of misalignment. more
We highlight potential disagreements with Karnofsky, including:
- We aren’t convinced that a small chance of catching deceptive behavior by itself might make deception much less likely. We argue that in addition to having a small chance of catching deceptive behavior, the AI’s supervisor needs to be capable enough to (a) distinguish between easy-to-catch and hard-to-catch deceptive behaviors and (b) attain a very low “false positive rate” of harshly penalizing non-deceptive behaviors. The AI may also need to be inner-aligned, i.e. intrinsically motivated by the reward. more
- We are more pessimistic than Karnofsky about the promise of adjudicating AI debates. We aren’t convinced there’s much theoretical reason to believe that AI debates robustly tend toward truth, and haven't been encouraged by empirical results. more
We discuss the chance that Magma would succeed:
- We discuss the promise of "hacky" solutions to alignment. If applied alignment techniques that feel brittle eventually lead to a better problem description of AI alignment, which, in turn, leads to a theoretical solution, we feel more optimistic about this trial-and-error approach. An analogy to the Wright brothers, who solved some of the control problems of human flight, potentially supports this intuition. more
- We predict that Magma would avoid an AI-caused existential catastrophe with 10-78% likelihood (Eli: 25% (10-50%), Charlotte: 56% (42-78%) and that a 2x change in AI-risk-reducing effort would shift relative risk level by 0.1-10% (Eli: 3% (.5-10%), Charlotte: 0.4% (0.1-2%)). more
We conclude by proposing questions for further research, including the likelihood of the assumptions of the nearcast and how Magma’s strategy would change if some assumptions were varied. More
We decided to extend Karnofsky’s nearcasting discussion, as we think nearcasting-style scenario planning can help reach strategic clarity. By making AI alignment and deployment scenarios explicit, we may be able to better prioritize research agendas and policy interventions. The probability of misalignment can also be informed by evaluating the likelihood of such stories leading to alignment.
However, each individual story is unlikely to reflect reality. Such stories will be taken too seriously.
We would regret if this document is mistakenly perceived as a promising plan for significantly reducing AI risk. In particular, we expect many suggested strategies to not work out.