Class 3
the alignment problem
Before you start:
π Complete the pre-class exercise. [30 min]Β
π Download the class slides here.
1. Introduction
What does it mean to align an AI system with human values, and why does it matter?Β Learn about bias, prompt injections and other hacks that might cause misalignment (and harm).Β
π¦Ύ Try out the game shown. Can you win against the AI?
2. The Technical Foundations to Alignment
How do we actually align a model? In this video, we explore the concepts of pre-training, fine-tuning with demonstration and comparison data.
3. Learning from Comparisons
Let's go one step further: in this video, we investigate two different alignment approaches: Reinforcement Learning from Human Feedback (RLHF) vs Constitutional AI (RLAIF).Β
4. Eliciting Preferences
We have one more problem to solve: how to agree on our respective values and preferences. To do so, we can either choose on the margins or define principles. Can we create a democratic AI?
π Follow along the activity in the video. Which response do you find more appropriate?