Discussion on utilizing AI for alignment

Summarized/re-worded from a discussion I had with John Wentworth. John notes that everything he said was off-the-cuff, and he doesn’t always endorse such things on reflection.

Eli: I have a question about your post on Godzilla Strategies. For background, I disagree with it; I’m most optimistic about alignment proposals that utilize weaker AIs to align stronger AIs. I had the thought that we are already using Godzilla to fight Mega-Godzilla[1]. For example, every time an alignment researcher does a Google Search BERT is being used. I recently found out Ought may make a version of Elicit for searching through alignment research, aided by GPT-3. These already improve alignment researchers’ productivity. It seems extremely important to me that as AI advances we continue to apply it to differentially help with alignment as much as possible.

John: I agree that we can use weak AIs to help some with alignment. The question is whether we can make non-scary AIs that help us enough to solve the problem.

Eli: It seems decently likely that we can have non-scary AIs contribute at least 50% of the productivity on alignment research.[2]

John: We are very likely not going to miss out on alignment by a 2x productivity boost, that’s not how things end up in the real world. We’ll either solve alignment or miss by a factor of >10x.

Eli: (after making some poor counter-arguments) Yes, I basically agree with this. The reason I’m still relatively optimistic about Godzilla strategies is (a) I’m less confident about how much non-scary AIs can help, like 90%+ productivity outsourced to AI doesn’t feel that crazy to me even if I have <50% credence in it and (b) I’m less optimistic about the tractability of non-Godzilla strategies, so Godzilla strategies still seem pretty good even if they are absolutely not that promising.

John: Makes sense, agree these are the cruxes.

Comment on LessWrong here.


  1. This point is heavily inspired by Examples of AI Increasing AI Progress. ↩︎

  2. See also Richard’s comment that I really liked related to this. ↩︎