When multiple people are paying attention to the same object, we call this joint attention. It is one of the skills we humans have and use a lot, without realizing how important or difficult it is.
During the master AI, in a group of 3 we created a model which learns to establish joint attention on an object, by following the gaze of a person to the object. In specific, we used a developmental approach, which takes inspiration from the development of an infant, and implemented the model in a Nao robot.
The scenario which we investigated was that a person is surrounded by a number of interesting objects, but is looking at only one. The goal is for the robot to establish joint attention by looking at the same object.
Thus the goal of the robot is to find the interesting object the person is looking at, which in this case is a green ball. The idea is that following the gaze of the person is the quickest and most consistent way to finding an interesting object. Instead of hard-coding this decision, it is learned using reinforcement learning (Q-learning).
To establish joint attention, the robot will perform a number of steps:
Detecting eye contact, and determining the direction of the gaze is done using the MIT Gaze following model. Finally, learning to perform the correct action was done using Q-learning.
First face detection was done using the Haar Cascades face detector from OpenCV. Furthermore, we also detected the location of the eyes, as that was needed for the gaze following model.
To establish eye contact, the robot has to determine where the person is looking. To do this, we used the gaze following model from MIT, which can be found here.
Given an image and the coordinate of an eye of the person, the model will generate a line from the eye to the object where the person is looking. If there is no object in the direct vicinity of the person, the model generates a line in the direction of where the person is looking.
The picture above is an example from the MIT website. Overall the model works quite well, although accuracy degrades when the person is looking at an object outside of the picture.
The model can also detect when the person looks straight ahead, which we use to detect eye contact for this step.
At some point, the person will break eye contact with the robot and look at an interesting object, which is a green ball in our scenario. Imagine the robot to be a human baby. The child had eye contact, but now the person looks away, leading to the child becoming distracted. What does a child do when it gets distracted or bored? Look for something interesting.
In our model the robot can perform three actions, which will hopefully lead to something interesting:
Learning to always use gaze following was done using reinforcement learning, and in specific Q-learning.
Reinforcement learning works with rewards, where a good outcome gains a positive reward, and a negative outcome a smaller or negative reward. In our scenario, a green ball indicates the ball being looked at by the human, which gives a big positive reward. A pink ball indicates an interesting but incorrect object. These also gave a positive reward, but a smaller one. Why not give a negative reward? The reason for that choice lies with the developmental approach. If you are a baby, any interesting object is a positive reward for your action. However, finding the correct ball is rewarded extra by the caretaker or parent in some way or another.
In addition to learning which direction to look, the robot also learns whether it should pick a close ball to the (gaze) line it is following, a far ball, or ignore the ball and continue looking.
The idea here is that if the robot chose to follow the gaze line, it might encounter other interesting objects in that direction, but which are incorrect. These incorrect balls most likely do not lie perfectly on the gaze line, but are further away. As such, the robot can also choose three actions here:
Going through the training phase, the robot learned that following the gaze of the person was the most consistent way of getting to an interesting object. Furthermore, it learned to ignore balls which did not lie on the (gaze) line, and only pick balls close to the gaze line.
Below is a video of the training process of this model. It took only 25 trails for the model to learn the correct actions in a cluttered environment, where the location of the balls were randomized every few trials.
The Q-values of the look left, look right or follow gaze actions are shown in the upper-right plot. The Q-values of the actions “pick close ball”, “pick far ball”, “or ignore the ball and continue looking” are shown in the bottom-right plot.
For further details see the report and code, linked at the bottom of this page.
For a robot to be able to deal with a changing environment, it needs to be able to learn and adapt. The developmental approach here achieves two things:
From a developmental viewpoint, our approach is lausible compared to human infants. Human infants also have an inborn tendency to look at faces and establish eye contact. Furthermore, children like interesting objects.
For further details, see the report which can be found here, and the code which can be found on GitHub.