Algorithm uses RL to break high score records on Atari games

An artificial intelligence system that can remember its previous successes and uses reinforcement learning (RL) has achieved record high scores on some of the hardest Atari games, thanks to a team of researchers.

UBC Computer Science Researcher Jeff Clune and his co-authors Adrien Ecoffet, Joost Huizinga, Joel Lehman, and Ken Stanley had their results published in Nature on February 24, 2021.

What is reinforcement learning?

RL is an area of machine learning concerned with how intelligent agents ought to take actions in a certain environment to maximize cumulative reward. The algorithm is given continual positive or negative feedback as it progresses toward its goal. This technique was used to train AlphaGo, which beat a world champion Go player in 2016.

Jeff Clune, who joined UBC Computer Science in January this year, explains the methodology, “RL struggles with problems requiring lots of exploration before a reward is received. A traditional approach is called ‘intrinsic motivation', where the agent is simply rewarded for discovering new areas,” Jeff says. “A problem with that idea is that the agent can 'forget' about areas that still need exploring. We call that 'detachment'. Another problem is that agents can struggle to return to previously interesting states, especially since we often have the agent perform random actions from time to time in the hopes it will discover something new. Say you have a game with many hazards that can instantly kill you. Taking random actions in that situation can prevent you from reaching the area to explore from. We call this 'derailment'.”

First return, then explore

Jeff and his co-authors write about their method that solves the detachment and derailment problems: separating returning from exploring, so that random actions are only ever taken at appropriate times. The idea is that the agent first returns (without taking any random actions) to an interesting place it previously discovered, then explores from that point. Thus the name of the paper: 'First return, then explore'.

Atari games don’t normally allow players to revisit any point in time, but the researchers used software that mimicked the Atari system – with the added ability to save stats and reload them at any time – so the algorithm didn’t have to play the game from the start each time. In their experiment, the algorithm beat state-of-the-art algorithms 85.5 per cent of the time. They then went on to show that the algorithm also works without the ability to restore to a previous point in time, although it takes longer to solve the game that way.

Montezuma’s Revenge is a particularly complex game, but the algorithm scored higher than the previous record for reinforcement learning software and also beat the human world record. Once the algorithm had reached a sufficiently high score, the researchers then used the solution to train a neural network to replicate the same strategy to play the game.

Robotics, disaster searches and self-driving cars

The work can translate into many applications, including home or industrial robotics, drug design and also potentially searching disaster zones. “The great thing about reinforcement learning,” says Jeff, “is that AI figures out how to solve complex problems all on its own, including many real-world problems we need solutions for.”

In addition to robotics, Go-Explore has already seen some experimental research in language learning, where an agent learns the meaning of words by exploring a text-based game, and for discovering potential failures in the behaviour of a self-driving car in order to avoid those failures in the future.

YouTube video explaining First Return, Then Explore

Journal reference: Nature, DOI: 10.1038/s41586-020-03157-9

More about Jeff Clune