What could a once-and-for-all solution to the alignment problem actually look like? It'll be very different from what we do today. This is my attempt to sketch it out: aligned.substack.com/p/alignment-solution

Thread

What could a once-and-for-all solution to the alignment problem actually look like?

It'll be very different from what we do today.

This is my attempt to sketch it out:

aligned.substack.com/p/alignment-solution

On a high-level, it has 4 parts:

1. A formal theory for alignment

This allows us to state what it means for an AI system to be aligned using formal mathematics.

2. An adequate process to elicit values

This gets everyone to say what they actually care about and then we aggregate it somehow.

3. Techniques to train AI systems such that they are fully aligned

So we can actually build them.

4. Formal verification tools for cutting-edge AI systems

This allows us to prove a formal theorem of the form "the system from part 3 is aligned with the values from part 2" that we express using the theory from part 1.

Mentions

There are no mentions of this content so far.

Thread by Jan Leike

Thread

Mentions