Leon Lang
Leon Lang
Home
Publications
Blog
Contact
Light
Dark
Automatic
Computer Science - Artificial Intelligence
Modeling Human Beliefs about AI Behavior for Scalable Oversight
We explain how modeling human evaluator beliefs about AI behavior can help to better interpret their feedback.
Leon Lang
,
Patrick Forré
Cite
arXiv
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
We theoretically analyze to what extent an error in a learned reward function translates into regret of resulting policies
Lukas Fluri
,
Leon Lang
,
Allesandro Abate
,
Patrick Forré
,
David Krueger
,
Joar Skalse
Cite
arXiv
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
We theoretically and empirically study safety issues of using RLHF with human evaluators that have limited information
Leon Lang
,
Davis Foote
,
Stuart Russell
,
Anca Dragan
,
Erik Jenner
,
Scott Emmons
Cite
arXiv
Reviews
Video
Blogpost
Podcast
TAIS 2024
Poster
Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
We analyze in textual scenarios whether language models show the instrumental reasoning to avoid shutdown
Teun van der Weij
,
Simon Lermen
,
Leon Lang
Last updated on Jul 3, 2023
Cite
arXiv
Cite
×