AlphaDev: Learning To Play By Doing by Doing Using Reinforcement Learning to Learn To Play Mathematical Programming
Reinforcement learning is the same technique used by Deepmind to help their players improve in chess and Go. This kind of software learns by doing. It works like a game in which the artificial intelligence gets rewards for making smart moves that increase the program’s efficiency. The system will work to maximize this reward over time, leading to a Go strategy or a quicker assembly program. This differs from the sort of AI found in large language models like GPT-4, which rely on huge amounts of data to learn how to write words or code. That’s great for producing writing that mirrors the tone of the internet or producing common segments of code. But it’s not so good at producing novel, state-of-the-art solutions to coding challenges the AI has never seen before.
“I think this work is incredibly exciting,” says Armando Solar-Lezama, an expert in program synthesis at MIT, who wasn’t involved in the research. It is useful to have an artificial intelligence come up with a new sorting system and then be able to learn how to write state-of-the-art code on top of that. AlphaDev is learning something new about the art of coding.
To see where AlphaDev eked out its gains, the team took a closer look at its algorithms. For sorting, they found two new tactics, which they called the AlphaDev swap move and the AlphaDev copy move. Mankowitz compared them to an unusual move AlphaGo made in 2016 against human Go champion Lee Sedol at an exhibition match in Korea. “It’s something that, in hindsight, was actually fundamental to winning the game and influenced how we thought about strategies,” he says.
Even though Alpha Zero has been around for six years, the science is not very deep according to Michael Littman, a computer scientist in Providence, Rhode Island. “But the engineering is phenomenal.” He adds that the researchers behind DeepMind are good at fitting the method to new problems. Last year, DeepMind also modified AlphaZero to create AlphaTensor3, which invented faster ways to multiply grids of numbers.