3

Modeling Human Beliefs about AI Behavior for Scalable Oversight

We explain how modeling human evaluator beliefs about AI behavior can help to better interpret their feedback.

Leon Lang, Patrick Forré

Factored space models: Towards causality between levels of abstraction

We develop a new foundation for a theory of causality, based on factored space models

Scott Garrabrant, Matthias Georg Mayer, Magdalena Wache, Leon Lang, Sam Eisenstat, Holger Dell

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

We theoretically analyze to what extent an error in a learned reward function translates into regret of resulting policies

Lukas Fluri, Leon Lang, Allesandro Abate, Patrick Forré, David Krueger, Joar Skalse

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Abstract Markov Random Fields

We use the recently generalized Hu Theorem to develop a theory of purely abstract Markov random fields.

Leon Lang, Clélia de Mulatier, Rick Quax, Patrick Forré

Abstract Markov Random Fields