Skip to main content

DPhil the Future: Why We Need Fallible Humans in AI Alignment

Posted:

A blue-grey banner image with text reading 'DPhil the Future' and 'Our students are 100% part of our success. DPhil the future is our way of giving our students a platform to share their insights and views on all things computer science.

DPhil student Tiffany Horter discusses why we need humans making mistakes when designing AI systems.

How many times have you said to someone, ‘You know what I mean’ after you misspoke? We often assume that people will use common sense to interpret our statements — and we expect the same of the AI agents we’ll interact with in the future. Imagine coming home and directing a robot carrying a sack of heavy groceries to ‘Just toss it wherever,’ hoping to move quickly to your next task, only for the robot to take you at your word and huck it straight into the wall, breaking the groceries and making a mess. Technically, it did everything you said, but you probably wouldn’t be too happy.  Our research focuses on how to recover from mistaken or confusing human instructions. 

Why AI Alignment is Essential in Embodied AI 

As we transition to using LLMs (like ChatGPT) as the ‘brains’ for a robot body, bringing them into our physical space to act alongside us, the problem of alignment (making robot behaviour match human preferences) becomes more important than ever.  

To work with a robot partner, we need more than an agreed goal. We also need a shared understanding of the situation, the robot’s abilities, the human’s preferences, and how to achieve the goal. Without this, behaviour can become unpredictable or unsafe. 

How we tend to do AI alignment now 

One of the major ways we currently train and align machine learning models is RLHF – reinforcement learning from human feedback. This process begins by showing people different outputs to the same AI prompt and asking for feedback like ranking those outputs – ‘which one sounds better?' Those rankings are then used to build a model that can predict human preferences.  

What are the problems with human feedback? 

There’s one problem with this method: we as humans are not perfectly rational and consistent. We might say we want the blue mug when we meant the green one, or not notice the milk already spoiled.  

We may mix up our words. We may lack knowledge about the world. We may leave out important information from our instructions. We may act in ways that are difficult to justify by our own values. We may overestimate our own abilities. Are robots free from these issues? By no means! They may lack understanding of what goal has been requested, not know elements of the world state, misjudge their own capabilities, or simply prioritise differently than you might.  

Why we still need humans 

At this point, it might seem we should give up on the idea of human oversight. However, we should keep using humans, even when they are fallible. Drawing on my background in cognitive science, I argue that even mistaken human feedback provides useful signals for us to interpret what humans truly want when we take into account common human biases or propensities for error.  

Overview of my research 

How do we deal with the fact that humans can make mistakes when giving feedback – whether because of a quick slip of the tongue or because we believed something about the world that was not precisely true? In my research, I focus on how to best use human feedback despite the possibility of errors, to align the behaviours of artificial intelligence and robots.  

One project we are working on is trying to make a robot comply with the user’s intention rather than purely relying on their spoken words. If a human error is suspected, we use LLMs that encode common sense preferences of what is the most likely alternative, and can check with the user to determine what they actually wanted. For example,  

  • If you say 'get me the tomato' but there are no tomatoes in the house, the robot could ask if you’d like a similar ingredient instead 
  • If you say 'put those dishes in the dishwasher' but some are clearly not dishwasher-safe, it could double-check before acting 

This approach lets us recover from instructions that are impossible, risky, or inconsistent with the user’s likely goals - all without ignoring the human’s feedback in the process. 

Looking to the future 

To build a future where humans can work naturally with AI, we need to know how to deal with situations where humans make errors in well-intentioned requests. This is key to avoid robots becoming like genies unhelpfully granting wishes, or like Dionysus with King Midas – only obeying the strict wording of the instruction and tragically disregarding the intention behind it.  

Our goal is to get the most out of every bit of feedback, even when it’s imperfect. Discarding 'bad' feedback wastes opportunities to learn more about human preferences. Ignoring humans entirely creates systems that may act in ways we never intended. So we must design AI that can handle our occasional slips of the tongue and still deliver what we meant.