Constitutional AI (CAI) is a novel approach to aligning AI systems using a set of principles (a "constitution") that guides the model's behavior, helping it respond helpfully while avoiding harmful outputs.
Develop a constitution of principles that define helpful, harmless, and honest AI behavior
Have the AI generate multiple responses to a given query
The AI critiques its own responses against the constitutional principles
Use the AI's own feedback to improve future responses through reinforcement learning
"The nice interpretability component is that you can see the principles that went into the model when it was trained... it gives you a degree of control so if you were seeing issues in a model, you can add data relatively quickly that should train the model to have that trait." — Amanda Askell
Post-training methods refine pre-trained models to improve their safety, helpfulness, and alignment with human values. Three key approaches:
Humans rank or rate model outputs, creating preference data that trains the model to produce more preferred responses.
Uses AI feedback instead of human feedback, guided by constitutional principles, to scale alignment efforts.
A variant of Constitutional AI focused on developing Claude's personality traits like intellectual humility, helpfulness, and honesty.
Raw capabilities
Human preferences
Principled guidance
Safe & helpful
Post-training is becoming increasingly sophisticated and important in the overall AI development process