Computer Science - Artificial Intelligence

Modeling Human Beliefs about AI Behavior for Scalable Oversight

We explain how modeling human evaluator beliefs about AI behavior can help to better interpret their feedback.

Leon Lang, Patrick Forré

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

We theoretically analyze to what extent an error in a learned reward function translates into regret of resulting policies

Lukas Fluri, Leon Lang, Allesandro Abate, Patrick Forré, David Krueger, Joar Skalse

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

We theoretically and empirically study safety issues of using RLHF with human evaluators that have limited information

Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

We analyze in textual scenarios whether language models show the instrumental reasoning to avoid shutdown

Teun van der Weij, Simon Lermen, Leon Lang

Last updated on Jul 3, 2023