Jan Leike

Machine learning & alignment researcher

Optimizing for a post-AGI future where humanity flourishes

jan@anthropic.com

@janleike

blog

publications

Publications

My publication list on Google Scholar

Alignment of large language models

Since joining OpenAI in January 2021, my focus has been on training large language models to follow human intend.

Prover-Verifier Games improve legibility of LLM outputs
Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, Yuri Burda. 2024.
LLM Critics Help Catch LLM Bugs
Nat McAleese, Rai Michael Pokorny, Juan Felipe Cerón Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike. 2024.
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. 2024.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeffrey Wu. International Conference on Machine Learning, 2024.
Language models can explain neurons in language models
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeffrey Wu, William Saunders. 2023.
Self-critiquing models for assisting human evaluators
William Saunders, Catherine Yeh, Jeffrey Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike. 2022.
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Neural Information Processing Systems, 2022.
Recursively summarizing books with human feedback
Jeffrey Wu, Long Ouyang, Daniel M Ziegler, Nissan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano. 2021.

Reinforcement learning from human feedback

During my time at DeepMind 2016-2020, I prototyped learning reward functions for deep reinforcement learning.

Safe Deep RL in 3D Environments using Human Feedback
Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike, 2022.
Quantifying Differences in Reward Functions
Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, and Jan Leike. International Conference on Learning Representations (Spotlight), 2021.
Pitfalls of Learning a Reward Function Online
Stuart Armstrong, Jan Leike, Laurent Orseau, and Shane Legg. International Joint Conference on Artificial Intelligence, 2021.
Learning Human Objectives by Evaluating Hypothetical Behavior
Siddharth Reddy, Anca D Dragan, Sergey Levine, Shane Legg, and Jan Leike. International Conference on Machine Learning, 2020. Blog post.
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. Blog post. Video.
Reward learning from human preferences and demonstrations in Atari
Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Neural Information Processing Systems, 2018.
Learning to understand goal specifications by modelling reward
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Pushmeet Kohli, and Edward Grefenstette. International Conference on Learning Representations, 2019.
AI Safety Gridworlds
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. 2017. Blog post. Video.
Deep Reinforcement Learning from Human Preferences
Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Neural Information Processing Systems, 2017. Blog post. Video.

Theoretical reinforcement learning

I wrote my PhD thesis on general reinforcement learning, reinforcement learning in non-ergodic, partially observable environments. My most interesting results are that Bayesian reinforcement learning agents can misbehave drastically if given a bad prior, that Thompson sampling learns to act optimally in any environment, and a formal solution to an open problem in game theory. If this interests you, take a look at my short introduction to general reinforcement learning.

Nonparametric General Reinforcement Learning.
Jan Leike. PhD Thesis, 2016.
Exploration Potential.
Jan Leike. European Workshop on Reinforcement Learning, 2016.
A Formal Solution to the Grain of Truth Problem.
Jan Leike, Jessica Taylor, and Benya Fallenstein. Uncertainty in Artificial Intelligence, 2016.
Thompson Sampling is Asymptotically Optimal in General Environments.
Jan Leike, Tor Lattimore, Laurent Orseau, and Marcus Hutter. Uncertainty in Artificial Intelligence, 2016. Best student paper award.
Loss Bounds and Time Complexity for Speed Priors.
Daniel Filan, Jan Leike, and Marcus Hutter. AI & Statistics, 2016.
On the Computability of Solomonoff Induction and Knowledge-Seeking.
Jan Leike and Marcus Hutter. Algorithmic Learning Theory, 2015.
Solomonoff Induction Violates Nicod’s Criterion.
Jan Leike and Marcus Hutter. Algorithmic Learning Theory, 2015.
Sequential Extensions of Causal and Evidential Decision Theory.
Tom Everitt, Jan Leike, and Marcus Hutter. Algorithmic Decision Theory, 2015. Source code to the examples.
On the Computability of AIXI.
Jan Leike and Marcus Hutter. Uncertainty in Artificial Intelligence, 2015.
Bad Universal Priors and Notions of Optimality.
Jan Leike and Marcus Hutter. Conference on Learning Theory, 2015.
A Definition of Happiness for Reinforcement Learning Agents.
Mayank Daswani and Jan Leike. Artificial General Intelligence, 2015.
Indefinitely Oscillating Martingales.
Jan Leike and Marcus Hutter. Algorithmic Learning Theory, 2014.

Software Verification

During my Master’s degree at the University of Freiburg I developed the termination analysis tool Ultimate LassoRanker together with Matthias Heizmann. This tool can automatically prove termination and nontermination properties of C programs. It won two second places and two first places in the termination category of the SV-COMP from 2015 to 2018. The following papers are mostly related to that work.

Geometric Nontermination Arguments.
Jan Leike and Matthias Heizmann. Tools and Algorithms for the Construction and Analysis of Systems, 2018.
Ranking Templates for Linear Loops.
Jan Leike and Matthias Heizmann. Logical Methods in Computer Science, 2015.
Geometric Series as Nontermination Arguments for Linear Lasso Programs.
Jan Leike and Matthias Heizmann. International Workshop on Termination, 2014.
Ranking Templates for Linear Loops.
Jan Leike and Matthias Heizmann. Tools and Algorithms for the Construction and Analysis of Systems, 2014.
Synthesis for Polynomial Lasso Programs.
Jan Leike and Ashish Tiwari. Verification, Model Checking, and Abstract Interpretation, 2014. Source code to the experiments.
Linear Ranking for Linear Lasso Programs.
Matthias Heizmann, Jochen Hoenicke, Jan Leike, and Andreas Podelski. Automated Technology for Verification and Analysis, 2013.
Ranking Function Synthesis for Linear Lasso Programs.
Jan Leike. Master’s Thesis. University of Freiburg, 2013.

Unfortunately my papers sometimes contain technical errors; you can see my list of errata. If you find a mistake not listed there, please let me know!