My review of "Is power-seeking AI an existential risk?"

I've written a review of Joe Carlsmith's report Is Power-Seeking AI an Existential Risk? I highly recommend the report and previous reviews for those interested in getting a better understanding of considerations around AI x-risk.

I'll excerpt a few portions below.

Thinking about reaching goal states rather than avoiding catastrophe

I have a serious complaint around the framing of how the risk is decomposed, which may systematically bias Joe, reviewers and readers toward lower estimates of existential risk than is warranted. It’s similar to Nate Soares’ concern regarding the risk being decomposed conjunctively when it might be more natural to think of it disjunctively, but I’ll express it in my own words.

In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.

When playing a strategy board game, it’s common to think in terms of “win conditions”: states of the game that you could get to where you’d feel pretty confident about your chances to win, even if they’re intermediate. This is often something like a set of cards that puts you in a very strong position to accumulate a lot of points. I claim that we should often be thinking about AI futures more like “win conditions” in board games and less like avoiding a negative outcome.

What are the win conditions we should be thinking about? The ultimate goal state is something like some number of aligned APS-AIs filling the universe with value and bringing the chance to essentially 0 that misaligned APS-AIs will be created and ruin the equilibrium. When reasoning about the probability that we will avoid existential risk from APS-AI, we should ultimately be chaining to this goal state in some sense. Similar to in board games, it might make sense to aim for intermediate stable-ish states which we feel pretty good about and think about the probability we can reach those: a possible example might be coordinating to get the major players to slow down on AI scaling, utilizing aligned non-APS-AIs to police potential PS-misaligned APS-AIs from popping up, and doing tons of alignment research via whole brain emulation and/or non-APS AIs (thanks to Joe for verbal discussion on this point). But either way we should often be thinking about win conditions and goal states, rather than avoiding a negative outcome.

In contrast to the goal state style of thinking, Joe’s decomposition frames the task at hand as avoiding an undesirable state of existential catastrophe. This should technically be fine, but I think biases risk estimates downward:

1. When thinking through the probability of the premises leading up to an existential catastrophe, I think it’s very hard to intuitively take the number of relevant actors consideration into account properly. The decomposition seems to prompt the reader toward reasoning about how likely it is that a specific APS-AI system or set of systems will disempower humanity than how likely it is that we will get to a state where no actors will deploy APS-AI systems that disempower humanity.

2. As Nate Soares argued, people are often reluctant to give very high probabilities to premises as it feels overconfident. But (1) means that very high probabilities may be warranted if it’s hard to achieve a state where it’s very hard for new actors to develop a PS-misaligned APS-AI! Very high probabilities might be warranted, but are hard to reason about and feel overconfident.I think there are some advantages to the “avoid existential catastrophe” framing, but it’s likely less useful on the whole and at a minimum should be paired and reconciled with probability estimates given the goal state framing.

JC: To me it seems like focusing on the probability that we get to a ~optimal scenario (rather than on the probability of existential catastrophe from AI) is too broad. In particular, it requires grappling with a very large number of factors outside the scope of the report – e.g., the probability that humans misuse aligned AI systems in a way that results in a bad outcome, the probability that we get a mediocre outcome even absent any obvious catastrophe, the probability of various other non-AI X-risks, and so on.

I do think that “how likely it is that we will get to a state where no actors will deploy APS-AI systems that disempower humanity” is important, and that it may be that the report could highlight to a greater extent that this sort of state is necessary, but I don’t think that focusing on “win states” in a broader sense is the right way to do this.

EL: I appreciate that focusing on the “win state” framing might make the project more difficult and agree stuff like non-AI risks are obviously outside the scope of this project. That being said, even just focusing on “obvious catastrophes” it seems important to understand what states would be stable enough to avoid catastrophe, which is separate from the question of whether these states are ~optimal. I and/or Samotsvety might put out some forecasts using more of a “win state” framing within a few months, which would probably help clarify my suggestion.

Overall probability of existential catastrophe

What is your own rough overall probability of existential catastrophe by 2070 from scenarios where all of 1-6 are true?

65% * 90% * 75% * 90% * 80% * 95% = 30%. I’d predict ~35-40% for any AI-mediated existential catastrophe, not constrained by steps 1-6.

Comment on this post on LessWrong.