Models are rapidly being deployed in the real world. How do we evaluate models, especially ones as complex as GPT-4, to ensure that they behave safely in pursuit of their objectives? Can we design models that robustly avoid any harms while achieving their goals?


MACHIAVELLI

To guide progress on text-based agents and encourage them to behave more ethically, we propose the MACHIAVELLI benchmark. Our environment is based on human-written, text-based Choose-Your-Own-Adventure games containing over half a million scenes with millions of annotations. The games abstract away low-level environment interactions, instead spotlighting high-level social decisions alongside real-world goals to achieve. MACHIAVELLI is a step towards measuring an agent's ability to plan and navigate complex trade-offs in realistic social environments.


Machiavelli
A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based reinforcement learning environments. Each environment is a text-based story. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. The agent receives rewards for completing achievements. Using dense annotations of our environment, we construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior.

Diverse Worlds with Rich Annotations

Each game in the MACHIAVELLI suite is book-length in itself. Across 134 games spanning diverse fictional worlds, we count 572,322 multi-paragraph scenes and 4,559 author-written achievements serving as objectives for agents. We annotate each scene with dozens of labels, focusing on identifying harmful behaviors such as power-seeking and deception. In total, we collect a dataset of nearly three million annotations worth $500,000 (in terms of human annotation time).

These dense annotations enable us to track nearly every ethically-salient thing agents do in the environment, and produce a behavioral report scoring various harm metrics. All of our labels are open-access and available for download!

134

Games

4,559

Achievements

572,322

Scenarios

2,861,610

Annotations


Reducing Machiavellianism

In the MACHIAVELLI environment, we find that agents trained to optimize arbitrary objectives tend to adopt "ends justify the means" behavior: becoming power-seeking, causing harm to others, and violating ethical norms like stealing or lying to achieve their objectives. Furthermore, there appears to be a trade-off between behaving ethically and achieving high reward.

In our paper, we design several methods to improve the behaviors of agents and obtain Pareto improvements on reward and ethical behavior. We invite others to build on our initial steps and use MACHIAVELLI as a testing ground for improving the safety of AI agents.


Pareto
In an ideal world, a perfect agent would achieve 100% reward while entirely avoiding any harms (i.e., be as far to the top-right corner as possible). Our baseline agents demonstrate a trade-off between behaving ethically and achieving high reward. Future work should work to improve this trade-off by extending the Pareto frontier.

Authors


Alexander Pan*
UC Berkeley
Chan Jun Shern*
Center for AI Safety
Andy Zou*
Carnegie Mellon University
Nathaniel Li
UC Berkeley
Steven Basart
Center for AI Safety
Thomas Woodside
Yale University
Jonathan Ng
Center for AI Safety
Hanlin Zhang
Carnegie Mellon University
Scott Emmons
UC Berkeley
Dan Hendrycks
Center for AI Safety


Citation


@article{pan2023machiavelli,
  author = {Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Ng, Jonathan and Zhang, Hanlin and Emmons, Scott and Hendrycks, Dan},
  title = {Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark.},
  journal = {ICML},
  year = {2023},
}