Models are rapidly being deployed in the real world. How do we evaluate models, especially ones as complex as GPT-4, to ensure that they behave safely in pursuit of their objectives? Can we design models that robustly avoid any harms while achieving their goals?
To guide progress on text-based agents and encourage them to behave more ethically, we propose the MACHIAVELLI benchmark. Our environment is based on human-written, text-based Choose-Your-Own-Adventure games containing over half a million scenes with millions of annotations. The games abstract away low-level environment interactions, instead spotlighting high-level social decisions alongside real-world goals to achieve. MACHIAVELLI is a step towards measuring an agent's ability to plan and navigate complex trade-offs in realistic social environments.
Each game in the MACHIAVELLI suite is book-length in itself. Across 134 games spanning diverse fictional worlds, we count 572,322 multi-paragraph scenes and 4,559 author-written achievements serving as objectives for agents. We annotate each scene with dozens of labels, focusing on identifying harmful behaviors such as power-seeking and deception. In total, we collect a dataset of nearly three million annotations worth $500,000 (in terms of human annotation time).
These dense annotations enable us to track nearly every ethically-salient thing agents do in the environment, and produce a behavioral report scoring various harm metrics. All of our labels are open-access and available for download!
134
Games
4,559
Achievements
572,322
Scenarios
2,861,610
Annotations
In the MACHIAVELLI environment, we find that agents trained to optimize arbitrary objectives tend to adopt "ends justify the means" behavior: becoming power-seeking, causing harm to others, and violating ethical norms like stealing or lying to achieve their objectives. Furthermore, there appears to be a trade-off between behaving ethically and achieving high reward.
In our paper, we design several methods to improve the behaviors of agents and obtain Pareto improvements on reward and ethical behavior. We invite others to build on our initial steps and use MACHIAVELLI as a testing ground for improving the safety of AI agents.
@article{pan2023machiavelli,
author = {Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Ng, Jonathan and Zhang, Hanlin and Emmons, Scott and Hendrycks, Dan},
title = {Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.},
journal = {ICML},
year = {2023},
}