Hello, Habr!

As our regular readers know, we have long and successfully published books by Unity . As part of the study of the topic, we were particularly interested in the ML-Agents Toolkit tools. Today we bring to your attention a translation of an article from the Unity blog about how to effectively train game agents using the “with yourself” method; in particular, the article helps to understand why this method is more effective than traditional reinforced learning.

ITKarma picture

Enjoy reading!

This later gives an overview of the self-play technology (playing with oneself) and demonstrates how it helps to ensure stable and effective training in the Soccer demo environment from the ML-Agents Toolkit .

In the Tennis and Soccer demo environments from the Unity ML-Agents Toolkit, agents pit each other like rivals. Training agents in such a competitive scenario is sometimes a very non-trivial task. In fact, in previous releases of the ML-Agents Toolkit, in order for the agent to confidently learn, a serious study of the award was required. In version 0.14 , an opportunity was added that allows the user to train agents using the reinforcement learning (RL) method based on self-play, a mechanism of fundamental importance in achieving some of the most high-quality learning outcomes with reinforcement, for example, OpenAI Five and DeepMind's AlphaStar . Self-play at work pits with each other the current and past hypostases of the agent. Thus, we get an adversary for our agent, who can gradually improve using traditional reinforcement learning algorithms. A fully trained agent can successfully compete with advanced human players.

Self-play provides a learning environment that is built on the same principles as competition from a human perspective. For example, a person who learns to play tennis will choose to spar opponents at about the same level as himself, since an opponent too strong or too weak is not so convenient for mastering the game. From the point of view of developing their own skills, it can be much more valuable for a beginner tennis player to beat the same beginners, rather than, say, a preschool child or Novak Djokovic. The first one will not even be able to hit the ball, and the second will not give you such a serve that you can beat. When a beginner develops sufficient strength, he can move on to the next level or apply for a more serious tournament to play against more skilled opponents.

In this article, we will look at some technical details related to the dynamics of the game with ourselves, as well as examples of working in virtual environments Tennis and Soccer, refactored in such a way as to illustrate the game with itself.

The story of a game with yourself in games

The phenomenon of playing with oneself has a long history, reflected in the practice of developing artificial game agents designed to compete with people in games. One of the first to use this system was Arthur Samuel, who developed a chess simulator in the 1950s and published this work in 1959. This system became the forerunner of a landmark result in reinforcement learning achieved by Gerald Tesauro in TD-Gammon; results published in 1995.TD-Gammon used the TD (λ) time-difference algorithm with the function of playing with itself to train the agent to play backgammon so that he could compete with a professional person. In some cases, it has been observed that TD-Gammon has a more confident vision of positions than world-class players.

Playing with yourself is reflected in many of the iconic achievements associated with RL. It is important to note that playing with yourself helped the development of agents for playing chess and go , with superhuman abilities, elite agents DOTA 2 , as well as complex strategies and counter-strategies in games like wrestling and hide and seek . In the results achieved by playing with oneself, it is often noted that game agents choose strategies that surprise expert people.

Playing with yourself gives agents a certain creativity that is independent of the creativity of programmers. The agent receives only the rules of the game, and then - information about whether he won or lost. Further, based on these basic principles, the agent must develop competent behavior. According to the creator of TD-Gammon, such an approach to learning liberates, "... in the sense that the program is not constrained by human inclinations and prejudices, which may turn out to be erroneous and unreliable." Thanks to this freedom, agents discover brilliant game strategies that completely change the way engineers think about some games.

Training with reinforcements in competitive games

Within the framework of the traditional task of reinforced learning, the agent is trying to develop a line of behavior that maximizes the total reward. A rewarding signal encodes an agent’s task - such a task could be, for example, plotting a course or collecting items. Agent behavior is subject to environmental restrictions. Such, for example, gravity, obstacles, as well as the relative influence of the actions taken by the agent himself - for example, the application of force for his own movement. These factors limit the agent’s behavior and are external forces that he must learn to handle in order to receive a high reward. Thus, the agent competes with the dynamics of the environment, and must move from state to state in such a way that the maximum reward is achieved.

ITKarma picture

On the left is a typical reinforcement training scenario: the agent acts in the environment, transfers to the next state and receives a reward. The training scenario is shown on the right, where the agent competes with an opponent, which, from the agent’s point of view, is actually an element of the environment.

In the case of competitive games, the agent competes not only with the dynamics of the environment, but also with another (possibly intellectual) agent. We can assume that the opponent is built into the environment, and his actions directly affect the next state that the agent “sees”, as well as the reward that he will receive.

ITKarma picture

Example Tennis from ML-Agents Toolkit

Consider the demo of ML-Agents Tennis. The blue racket (left) is the learning agent, and the purple (right) is his opponent. To throw the ball over the net, the agent must take into account the trajectory of the ball flying from the opponent, and make an adjustment to the angle and speed of the flying ball, taking into account environmental conditions (gravity). However, in a competition with an opponent, throwing the ball over the net is only half the battle. A strong opponent can respond with an irresistible blow, and as a result, the agent will lose. A weak opponent may hit the ball into the net. An equal opponent can return the serve, and therefore the game will continue. In any case, both the next state and the corresponding reward depend on both environmental conditions and the opponent. However, in all these situations, the agent makes the same pitch.Therefore, both learning in competitive games and pumping rival behaviors by an agent are complex problems.

Considerations for a suitable opponent are not trivial. As is clear from the above, the relative strength of the opponent significantly affects the outcome of a particular game. If the opponent is too strong, then the agent may find it difficult to learn how to play from scratch. On the other hand, if the opponent is too weak, then the agent can learn to win, but these skills may be useless in competition with a stronger or simply different opponent. Therefore, we need an opponent who will be approximately equal in strength to the agent (unyielding, but not insurmountable). In addition, as our agent’s skills improve with each game completed, we must increase the strength of his opponent to the same extent.

ITKarma picture

When playing with yourself, a snapshot of the past or an agent in its current state is the opponent built into the environment.

This is where the game with ourselves comes in handy! The agent himself satisfies both requirements for the desired opponent. He is definitely roughly equal in strength to himself, and his skills improve over time. In this case, the agent’s own policy is built into the environment (see the figure). Those familiar with gradually becoming more difficult to learn (curriculum learning), we will show that this system can be considered a naturally developing curriculum, following which the agent learns to fight against more and more powerful opponents. Accordingly, playing with yourself allows you to use the environment itself to train competitive agents for competitive games!

In the next two sections, more technical details of training competitive agents will be discussed, in particular regarding the implementation and use of the game with oneself in the ML-Agents Toolkit.

Practical Considerations

Some practical problems arise regarding the framework for playing with yourself. In particular, retraining is possible, in which the agent learns to win only with a certain style of play, as well as the instability inherent in the learning process, which can arise due to the unsteadiness of the transition function (that is, due to constantly changing opponents). The first problem arises because we want our agents to have a general knowledge and ability to fight opponents of different types.
The second problem can be illustrated in the Tennis environment: different opponents will hit the ball at different speeds and at different angles. From the point of view of the learning agent, this means that, as you learn, the same decisions will lead to different outcomes and, accordingly, the agent will be in different subsequent situations. In traditional reinforcement learning, stationary transition functions are implied. Unfortunately, having prepared a selection of various opponents for the agent in order to solve the first problem, we, being careless, can aggravate the second.

To cope with this, we will maintain a buffer with past agent policies, from which we will choose potential rivals for our "student" for the long term. Choosing an agent from past policies, we get for him a selection of diverse opponents. Moreover, allowing the agent to train with a fixed opponent for a long time, we stabilize the transition function and create a more consistent learning environment. Finally, such algorithmic aspects can be controlled using hyperparameters, which are discussed in the next section.

Implementation and details of use

Choosing hyperparameters for playing with ourselves, we, first of all, keep in mind a compromise between the level of the opponent, the universality of the final policy and the stability of training. Training in competition with a group of opponents that change slowly or do not change at all, which means they give a smaller scatter of results, is a more stable process than training in competition with many diverse opponents that change quickly.Available hyperparameters allow you to control how often the current agent policy will be saved for later use as one of the opponents in the sample, how often a new opponent will be saved, subsequently selected for sparring, how often a new opponent will be selected, the number of opponents saved, and the likelihood that in this case, the student will have to play against his own alter ego, and not against an opponent selected from the pool.

In competitive games, the “cumulative” award issued by the environment is perhaps not the most informative metric for tracking learning progress. The fact is that the accumulative award entirely depends on the level of the opponent. An agent with a certain game skill will receive a greater or lesser reward, depending on a less skilled or more skillful opponent, respectively. We offer the implementation of ELO rating system , which allows you to calculate the relative game skill of two players from a certain population when playing with a zero amount. During a single training run, this value should steadily increase. You can track it, along with other learning metrics, for example, a common award, with TensorBoard .

Playing with yourself in Soccer

ITKarma picture

The latest releases of the ML-Agent Toolkit do not include agent policies for the Soccer learning environment, because the reliable training process was not built in it. However, using the game with ourselves and some refactoring, we can train the agent in non-trivial behaviors. The most significant change is the removal of “game positions” from the agent’s characteristics. Earlier in the Soccer environment, the “goalkeeper” and “striker” clearly stood out, so the whole gameplay looked more logical. this video introduces a new environment that shows how role behavior spontaneously forms, in which some agents begin to act as attackers, and others as goalkeepers. Now the agents themselves are learning to play these positions! The reward function for all four agents is defined as +1.0 for a goal scored and -1.0 for a goal conceded, with an additional penalty of -0.0003 per step - this penalty is provided to stimulate agents to attack.

Here we emphasize once again that the agents in the Soccer learning environment themselves learn cooperative behavior, and for this, no explicit algorithm is used related to multi-agent behavior or role assignment. This result demonstrates that an agent can be trained in complex behaviors using relatively simple algorithms - provided that the task is well-formulated. The most important condition for this is that agents can observe their teammates, that is, they receive information about the relative position of the teammate. Forcing an aggressive fight for the ball, the agent indirectly tells the teammate that he should move in defense. On the contrary, moving away in defense, the agent provokes a teammate to attack.

What's next

If you have ever used any of the new features from this release - tell us about them. We draw your attention to the ML-Agents GitHub issues page, where you can talk about bugs found, as well as to the page Unity ML-Agents forums , where general questions and issues are discussed.