For the past few months I’ve been studying social learning in multi-agent reinforcement learning as part of the Spring 2020 OpenAI Scholars program. This is last in a series of posts I’ve written during the program, and in this post I’ll discuss some experiments I’ve been conducting to study social learning by independent RL agents. The first post in this series has lots of context about why I’m interested in multi-agent reinforcement learning. Before continuing I wanted to express my tremendous gratitude to OpenAI for organizing the Scholars program, and to my mentor Natasha Jaques for her incredible support and encouragement.
I’ve been using Proximal Policy Optimization (PPO, Schulman et al. 2017) to train agents to accomplish gridworld tasks. The neural net architectures I’ve been using include LSTM layers – this gives the agents the capacity to remember details from earlier in an episode when choosing actions later in the episode. This capacity is particularly important in partially observed environments that are ubiquitous in multi-agent reinforcement learning (MARL).
I’ve been thinking about ways to construct challenging gridworld scenarios in which the behavior of expert agents might provide cues that ease learning for novice agents. I’ve focused on tasks in which navigation is central, since the an agent’s movement can always in principle be visible to other agents. Tasks like this often resemble random maze navigation: an agent spawns in an environment with a random layout and receives a reward after navigating to a certain (perhaps randomly placed) goal tile. The episode ends either after the agent reaches the goal, or after a fixed time limit (with the agent respawning each time it reaches the goal).
Multi-agent reinforcement learning (MARL) is pretty tricky. Beyond all the challenges of single-agent RL, interactions between learning agents introduce stochasticity and nonstationarity that can make tasks harder for each agent to master. At a high level, many interesting questions in MARL can be framed in ways that focus on the properties of the interactions between abstract agents and are ostensibly agnostic to the underlying RL algorithms, e.g. “In what circumstances do agents learn to communicate?”. For such questions, the structure of the environment/reward function/etc is usually more important.
Q learning is a classic and well-studied reinforcement learning (RL) algorithm. Adding neural network Q-functions led to the milestone Deep Q-Network (DQN) algorithm that surpassed human performance on a suite of Atari games (Mnih et al. 2013). DQN is attractive due to its simplicity, but the DQN-based algorithms that are most successful tend to rely on many tweaks and improvements to achieve stability and good performance.
RL agents whose policies use only feedforward neural networks have a limited capacity to accomplish tasks in partially observable environments. For such tasks, an agent may need to account for past observations or previous actions to implement a successful strategy.
Gridworlds are popular environments for RL experiments. Agents in gridworlds can move between adjacent tiles in a rectangular grid, and are typically trained to pursue rewards solving simple puzzles in the grid. MiniGrid is a popular and flexible gridworld implementation that has been used in more than 20 publications.
I’m excited to be participating in the 2020 cohort of the OpenAI Scholars program. With the mentorship of Natasha Jaques, I’ll be spending the next few months studying multi-agent reinforcement learning (MARL) and periodically writing blog posts to document my progress. In this first post, I’ll discuss the reasons I’m excited about MARL and my plan for the Scholars program.
I spent some time recently exploring reinforcement learning in the excellent MineRL minecraft environments. I haven’t played much Minecraft, and I haven’t actually accomplished the personally accomplished the holy grail objective of mining a diamond. The prospect of building a bot that can learn to accomplish a task that I haven’t completed – one that is as human-accessible as this – is incredibly exciting!