Thread
What could a once-and-for-all solution to the alignment problem actually look like?
It'll be very different from what we do today.
This is my attempt to sketch it out:
aligned.substack.com/p/alignment-solution
It'll be very different from what we do today.
This is my attempt to sketch it out:
aligned.substack.com/p/alignment-solution
On a high-level, it has 4 parts:
1. A formal theory for alignment
This allows us to state what it means for an AI system to be aligned using formal mathematics.
1. A formal theory for alignment
This allows us to state what it means for an AI system to be aligned using formal mathematics.
2. An adequate process to elicit values
This gets everyone to say what they actually care about and then we aggregate it somehow.
This gets everyone to say what they actually care about and then we aggregate it somehow.
3. Techniques to train AI systems such that they are fully aligned
So we can actually build them.
So we can actually build them.
4. Formal verification tools for cutting-edge AI systems
This allows us to prove a formal theorem of the form "the system from part 3 is aligned with the values from part 2" that we express using the theory from part 1.
This allows us to prove a formal theorem of the form "the system from part 3 is aligned with the values from part 2" that we express using the theory from part 1.
Mentions
There are no mentions of this content so far.