After more than a year of hard work, we have finally uploaded an article building on the great work done by the people in Google DeepMind. We expand the framework proposed by DeepMind to allow multiple AI-s coexisting in the same environment. Making use of the Pong game environment, we basically ask the question: What happens if we pit two AI-s against each other or, instead, try to make them collaborate?
In Deep Q-learning (and Deep Reinforcement Learning generally), the agents have no prior knowledge of the task and essentially learn to play by trial an error. To do that they keep track of the game situations, their own actions and the rewards they receive. Based on this information they have to learn to avoid negative rewards and maximize the chance to receive positive ones.
In the competitive setting the agents get rewarded each time the opponent misses the ball and get punished if they themselves miss the ball (“zero sum game” in game theory). Before training the randomly initialized agents never even manage to hit the ball, but after playing many games against each other they become good and parrying shots and sending fast balls back towards the opponent. The behaviour is best described by the accompanying video . In the article we also give quantitative measures to characterize the emerging strategy.
In the collaborative setting agents get punished when either one of them misses the ball. There are no positive rewards in this type of game. The goal is thus to avoid losing the ball. Trained agents are clever enough to figure out that never serving the ball is the easiest way to avoid losing it. When we force them to still serve the ball, they learn to reach a strategy where they can keep the ball for a very long time (basically infinitely). This behaviour is well illustrated by the accompanying video. Quantitative measures of behaviour are also provided in the article.
The only difference between the above training paradigms is that we change the reward the agents get for putting the ball past their opponent – in competitive case they receive a positive reward of +1 and in collaborative a negative reward of -1. While comparing the two strategies is already interesting in itself, we go further and explore what happens if this “reward for scoring” is given different values between the -1 and +1. That is to say we investigate the transition between competition and collaboration.
This is only our first contribution to the fields of multi-agent learning and reinforcement learning. There are many ideas left unexplored in the current introductory study (see Future Work in the article) and we hope to follow up on them in the near future.