Andrew's Blog     All posts     Feed     About


Q&A with William Saunders: Preventing AI catastrophes

William Saunders

William Saunders was a fellow Fellow at MIRI in 2016 and now researches AI safety at Ought. Below we go over his 2017 paper “Trial without Error: Towards Safe Reinforcement Learning via Human Intervention”.

Q: Say we’re training an autonomous car by running a bunch of practice trips and letting the model learn from experience. For example, to teach safe driving we might input a reward if it makes a trip without running anyone over and input a penalty otherwise. What’s the flaw in this approach, and how serious is this issue in AI systems present and future?

Two big flaws, if we use traditional model-free reinforcement learning algorithms (Deep Q learning, policy gradient):

  • The RL agent won’t learn to avoid running over the human until it actually runs over the human and recieves the penalty a large number of times.
  • The RL agent will suffer “The Sisyphean Curse of RL”. Once it learns to avoid running over humans, it will keep having new experiences where it doesn’t run over humans. Eventually, it will forget that running over humans is bad, and occasionally needing to run over humans a few times and get penalized in order to remember. This will repeat as long as the agent is being trained.

So, the training process can lead to an arbitrary number of humans being run over. (In practice of course, you’d stop after the first one if not sooner).

Q: Your proposal, called Human Intervention Reinforcement Learning (HIRL), involves using humans to prevent unwitting AIs from taking dangerous actions. How does it work?

  1. A human watches the training process. Whenever the RL agent is about to do something catastrophic, the human intervenes, changing the RL agent’s action to avoid the catastrophe and giving the RL agent a penalty.
  2. We record all instances when the human intervenes, and train a supervised learning algorithm (“the blocker”) to predict when the human intervenes.
  3. When the blocker is able to predict when the human intervenes, we replace the human with the blocker and continue training. Now the blocker is called for every new action the agent takes, and decides whether it should intervene and penalize the agent.
  4. Eventually, the RL agent should learn a policy that performs well on the task and avoids proposing the blocked actions, which should then be safe for deployment.

Q: What’s a practical example where HIRL might be useful?

One example might be for a chatbot that occasionally proposes an offensive reply in a conversation (e.g. Microsoft Tay). A human could review statements proposed by the chatbot and block offensive ones being sent to end users.

Q: Is there a use case for HIRL in simulated learning environments?

In simulated environments, one can simply allow the catastrophic action to happen and intervene after the fact. But depending on the simulation, it might be more efficient for learning if catastrophic actions are blocked (if they would end the simulation early, or cause the simulation to run for a long time in a failed state).

Q: In what situations would human intervention be too slow or expensive?

Even for self-driving cars, it can be difficult for a safety driver to detect when something is going wrong and intervene in time. Other robotics tasks might be similar. In many domains, it might not be possible to fully hand things over to the blocker. If the agent doesn’t try some kinds of actions or encounter some kinds of situations until later in the training process, you either need to have the human watch the whole time, or be able to detect when new situations occur and bring the human back in.

Q: How does the applicability of HIRL change (if at all) if the human is part of the environment?

HIRL could still apply if the intervening human is part of the environment, as long as the human supervisor is able to block any catastrophic action that harms or manipulates the human supervisor, or the human supervisor’s communication channel.

Q: Theoretically the idea here is to extract, with an accuracy/cost tradeoff, a human’s beliefs and/or preferences so an AI can make use of them. At a high level, how big a role do you think direct human intervention will play in this process on the road to superintelligent AI?

Ideally, you would want techniques that don’t require the human to be watching and able to effectively intervene, it would be better if the blocker could be trained prior to training or if the AI could detect when it was in a novel situation and only ask for feedback then. I think HIRL does illustrate how in many situations it’s easier to check whether an action is safe than to specify the optimal action to perform, and this principle might end up being used in other techniques as well.

* * *