Hello! I am a second-year CS PhD student at Berkeley, where I am fortunate to be advised by Jacob Steinhardt. I am interested in developing safe ML systems, especially sequential decision-making agents. I am grateful to be supported by a FLI PhD fellowship.

I studied mathematics and computer science at Caltech, where I worked with Anima Anandkumar and Yuanyuan Shi.

See here for my CV.

Publications and Preprints

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li*, Alexander Pan*, …, Alexandr Wang**, Dan Hendrycks**
ICML 2024
pdf / code

Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
ICML 2024
pdf / code

Representation Engineering: A Top-Down Approach To AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, Dan Hendrycks
pdf / code

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
ICML 2023 Oral
pdf / code

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan, Kush Bhatia, Jacob Steinhardt
ICLR 2022
pdf / code

Improving Robustness of RL for Power System Control with Adversarial Training
Alexander Pan, Yongkyun (Daniel) Lee, Huan Zhang, Yize Chen, Yuanyuan Shi
ICML RL4RL Workshop 2021
pdf / code


I worked on some fun hackathon projects with Yongkyun (Daniel) Lee, Evan Yeh, and Terry Kwon.


Best social network hack - Stanford Hackathon 2021
chrome extension / code

homES ReInvented
Best use of ESRI technology - Caltech Hackathon 2020


Feel free to email me at aypan (dot) 17 (at) berkeley (dot) edu.