About

Hello! I work on safety at xAI, where I lead a team posttraining Grok on misuse, alignment, and robustness. My PhD was in CS at Berkeley, advised by Jacob Steinhardt and supported by a FLI PhD fellowship.

I studied mathematics and computer science at Caltech, where I worked with Anima Anandkumar and Yuanyuan Shi.

See here for my CV.

Publications and Preprints

LatentQA: Teaching LLMs to Decode Activations Into Natural Language
ICLR 2026
Alexander Pan, Lijie Chen, Jacob Steinhardt
pdf / code

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li*, Alexander Pan*, …, Alexandr Wang**, Dan Hendrycks**
ICML 2024
pdf / code

Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
ICML 2024
pdf / code

Representation Engineering: A Top-Down Approach To AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, Dan Hendrycks
pdf / code

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
ICML 2023 Oral
pdf / code

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan, Kush Bhatia, Jacob Steinhardt
ICLR 2022
pdf / code

Improving Robustness of RL for Power System Control with Adversarial Training
Alexander Pan, Yongkyun (Daniel) Lee, Huan Zhang, Yize Chen, Yuanyuan Shi
ICML RL4RL Workshop 2021
pdf / code

Contact

Feel free to email me at aypan (dot) 17 (at) berkeley (dot) edu.