Good day. Our names are Tatyana Voronova and Elvira Dyaminova, we are engaged in data analysis at Center 2M. In particular, we train neural network models for detecting objects in images: people, special equipment, animals.

At the beginning of each project, the company agrees with customers on an acceptable recognition quality. This level of quality must not only be ensured when the project is completed, but also maintained during the further operation of the system. It turns out that we must constantly monitor and train the system. I would like to reduce the costs of this process and get rid of the routine, freeing up time to work on new projects.

Automatic retraining is not a unique idea; many companies have similar internal conveyor tools. In this article, we would like to talk about our experience and show that for the successful implementation of such practices it is not at all necessary to be a huge corporation.

One of our projects is counting people in queues . Due to the fact that the customer is a large company with a large number of branches, people accumulate at certain hours as scheduled, that is, a large number of objects (people's heads) are regularly detected. Therefore, we decided to start the introduction of automatic retraining precisely on this task.

This is what our plan looked like. All items, except the scribbler, are carried out automatically:

  1. Once a month, all camera images from the last week are automatically selected.
  2. The names of images are added to the shared xls page in sharepoint, and the status of the image files is entered by default: “Not Viewed.”
  3. Using the last (currently working) version of the model, image markup is generated - xml files with markup (coordinates of the goals found) are also added to the folder, and the total number of objects found by the model is automatically entered on the page - this number will be needed in the future tracking the quality of the model.
  4. Markers once a month look at marked-up files in the “Not Viewed” status. Markup is corrected and the number of corrections is entered in the xls-page (separately - the number of deleted tags, separately - the number of added ones). The statuses of files viewed by the scribbler change to “Viewed”. Thus, we understand how the quality of our model has degraded.

    In addition, we clarify the nature of the error: whether the excess is usually marked out (bags, chairs) or, conversely, we don’t find a part of people (for example, because of medical masks). A graph of changing model quality metrics is displayed as a report panel.
  5. Once a month, the xls-file looks at the number of files in the “Viewed” status and the number of changes > 0. If the quantity is above the threshold value, retraining of the model on the extended set is started (with the addition of corrected markup). If the file was previously part of the training dataset, the old markup on the file changes to the new one. For files taken for training, the status changes to "Taken in training." The status needs to be changed, otherwise the same files will be re-enrolled in retraining. Retraining is performed starting with the checkpoint remaining during the previous training. In the future, we plan to introduce further education not only according to the schedule, but also exceeding the threshold of the number of changes that had to be made in the markup.
  6. If the number of files in the “Viewed” status is 0, a notification is necessary - the markup for some reason does not check the markup.
  7. If, despite the model being further trained, the accuracy continues to fall, and the metrics fall below a threshold value, an alert is required. This is a sign that you need to understand the problem in detail with the involvement of analysts.

As a result, this process helped us a lot. We tracked an increase in errors of the second kind, when many goals unexpectedly became “masked”, enriched the training dataset with a new type of head in time, and upgraded the current model. Plus, this campaign allows you to take into account seasonality. We constantly adjust the dataset to suit the current situation: people often wear hats or, conversely, almost all come to the institution without them. In autumn, the number of people in hoods increases. The system becomes more flexible, responds to the situation.

For example, in the image below - one of the departments (on a winter day), whose frames were not presented in the training dataset:


If we calculate the metrics for this frame (TP=25, FN=3, FP=0), it turns out that recall is 89%, accuracy is 100%, and the harmonic mean between accuracy and completeness is about 94, 2% (about metrics a little lower). Fairly good result for a new room.

In our training dataset there were hats and hoods, so the model was not taken aback, but with the onset of the mask mode, it began to make mistakes. In most cases, when the head is clearly visible, there are no problems. But if a person is far from the camera, then at a certain angle, the model ceases to detect the head (the left image is the result of the old model). Thanks to the semi-automatic markup, we were able to fix such cases and retrain the model in time (the right image is the result of the new model).


Lady Near:


When testing the model, frames were selected that did not participate in the training (dataset with a different number of people on the frame, from different angles and different sizes), we used recall and precision to evaluate the quality of the model.

Recall - the completeness shows what proportion of objects that really belong to the positive class, we predicted correctly.

Precision - the accuracy shows what proportion of objects recognized as objects of a positive class we predicted correctly.

When the customer needed one digit, a combination of accuracy and completeness, we provided a harmonic mean, or F-measure. Learn more about metrics.

After one cycle, we got the following results:


The completeness of 80% before any changes is due to the fact that a large number of new branches have been added to the system, new angles have appeared. In addition, the time of year has changed, before that, “autumn-winter people” were presented in the training dataset.

After the first cycle, the completeness became 96.7%. If compared with the first article, then there the completeness reached 90%. Such changes are due to the fact that now the number of people in the departments has decreased, they began to overlap much less (voluminous down jackets have ended), and the variety of hats has diminished.

For example, before the norm was about the same number of people as in the image below.


This is the case now.


To summarize, let’s name the advantages of automation:

  1. Partial automation of the markup process.
  2. Timely response to new situations (full wearing of medical masks).
  3. Quick response to incorrect model answers (the bag began to be detected like a head and the like).
  4. Monitoring model accuracy on an ongoing basis. When you change metrics for the worse, the analyst is connected.
  5. Minimization of the analyst’s labor costs when upgrading the model. Our analysts are engaged in various projects with full involvement, so I would like to take them off from the main project as little as possible to collect data and retrain for another project.

The downside is the human factor on the part of the scribbler - it may not be responsive to the markup, so markup with overlap or the use of golden sets — tasks with a predetermined answer that serves only to control the quality of the markup — is necessary.In many more complex tasks, the analyst must personally check the markup - in such tasks the automatic mode will not work.

In general, the practice of automatic retraining has proven to be viable. Such automation can be considered as an additional mechanism that allows to maintain a good level of recognition quality during further operation of the system.

Authors of the article: Tatyana Voronova ( tvoronova ), Elvira Dyaminova ( elviraa ) .