Hello! My name is Alexander Afenov, I work at Lamoda. This article is based on my talk with HighLoad 2019, the record of which is here .

I used to be a team leader, and I was in charge of a couple of critical services. And if something went wrong in them, it stopped real business processes. For example, orders ceased to go to the assembly in the warehouse.

I recently became a leadership leader and am now responsible for three teams instead of one. Each of them has an IT system. I want to understand what is happening in every system and what might break.

In this article I will talk about how

  • what are we monitoring,
  • how we monitor,
  • and most importantly: what do we do with the results of these observations.

image

Lamoda has many systems. They are all being released, something is changing in them, something is happening with technology. And I want to have at least the illusion that we can easily localize the breakdown. I'm constantly bombarded with alerts that I'm trying to figure out. In order to get away from abstractions and move on to specifics, I will tell you the first example.

From time to time, something explodes: chronicles of one fire


One warm summer morning without declaring war, as is usually the case, monitoring worked for us. As an alert, we use Icinga. Alert said that we have 50 GB of hard disk left on the DBMS server. Most likely, 50 gigabytes is a drop in the bucket, and it will end very quickly. We decided to see how much free space was left. You need to understand that these are not virtual machines, but iron servers, and the database is under heavy load. There is a 1.5 terabyte SSD. Soon this memory will soon come to an end: it will last for 20-30 days. This is very small; you need to quickly solve the problem.

Then we additionally checked how much memory actually was consumed in 1-2 days. It turns out that 50 gigabytes is enough for about 5-7 days. After that, the service that works with this database will predictably end. We are beginning to think about the consequences: that we urgently archive what data we will delete. The Data analytics department has all the backups, so you can safely drop everything older than 2015.

We try to delete it and remember that MySQL will not work like that from half a kick. Deleted data is great, but the size of the file allocated for the table and for DB does not change. MySQL then uses this place. That is, the problem has not been resolved, there is no more space.

We try a different approach: migrating labels from fast ending SSDs to slower ones. To do this, select the tablets that weigh a lot, but under a small load, and use Percona monitoring. We moved the tables and are already thinking about moving the servers themselves. After the second move, the server takes up not 4, but 4 terabytes of SSD.

We put out this fire: organized a move and, of course, fixed monitoring. Now warning will be triggered not on 50 gigabytes, but on half a terabyte, and the critical monitoring value will be triggered on 50 gigabytes. But in reality it is only a blanket of the rear of the rear. For some time it will be enough. But if we allow a repetition of the situation without breaking the base into parts and not thinking about sharding, everything will end badly.

Suppose that we changed the server further. At some stage, it was necessary to restart the master. Probably, in this case there will be errors. In our case, the down time was about 30 seconds. But requests are coming, nowhere to write, errors rained down, monitoring worked. We use the Prometheus monitoring system - and we see in it that the metric of 500 errors or the number of errors when creating an order jumped. But we don’t know the details: which order was not created, and things like that.

Next I’ll tell you how we work with monitoring so as not to fall into such situations.

Monitoring review and clear description for customer support


We have several areas and indicators that we are observing. There are televisions everywhere in the office, on which there are many different technical and business tags, which, in addition to developers, are monitored by a support service.

In this article, I talk about how we have it, and add what we want to come to. This also applies to monitoring reviews.If we regularly conducted an inventory of our “property”, we could update everything outdated and fix it, preventing a repeat of the fakap. You need a clear list.

In our repository there are config-icings with alerts, where now there are 4678 lines. From this list it is difficult to understand what each specific monitoring is talking about. Let's say our metric is called db_disc_space_left. The support service will not immediately understand what it is about. Something about free space, great.

We want to dig deeper. We look at the configuration of this monitoring and understand where it comes from.

pm_host: "{{ prometheus_server }}" pm_query: ”mysql_ssd_space_left" pm_warning: 50 pm_critical: 10 pm_nanok: 1 

This metric has a name, its own limitations, when to include warning monitoring, an alert to report a critical situation. We use a naming convention for metrics. At the beginning of each metric is the name of the system. Thanks to this, the area of ​​responsibility becomes clear. If the person responsible for the system starts the metric, it’s immediately clear who to go to.

Alerts pour in a telegram or slack. The support service reacts to them first in 24/7 mode. The guys are watching what specifically exploded, is this a normal situation. They have instructions:

  • those that are being replaced,
  • and instructions that are fixed in confluence on an ongoing basis. By the name of the exploded monitoring, you can find what it means. For the most critical, it’s described what has broken, what are the consequences, who needs to be raised.

We also have shift shifts in the teams responsible for key systems. Each team has someone who is constantly available. If something happened, they pick it up.

When the alert is triggered, the support team needs to quickly find out all the key information. It would be nice to have a link to the monitoring description attached to the error message. For example, so that there was such information:

  1. a description of this monitoring in understandable, relatively simple terms;
  2. address where it is located;
  3. an explanation of what this metric is;
  4. consequences: how will it end if we do not correct the mistake;
  5. From time to time we have flashing controls that can be scored, and nothing will happen. Perhaps these are controls made in vain. Here you need a clear action point what to do.

It would also be convenient to immediately see the traffic dynamics in the Prometheus interface.

I would like to make such descriptions for each monitoring. They will help to build a review and make adjustments. We are introducing this practice: there is already a link to confluence with this information in the icing config. I’ve been working on one system for almost 4 years, there are basically no such descriptions on it. So now I am gathering knowledge together. Descriptions also solve the team’s lack of awareness.

We have instructions for most alerts where it says what leads to a certain business impact. That is why we must quickly understand the situation. The criticality of possible incidents is determined by the support service in conjunction with the business.

Let me give you an example: if the monitoring of the memory consumption on the RabbitMQ server of the order processing service has worked, this means that the queue service may drop in a few hours or even minutes. And this, in turn, will stop many business processes. As a result, customers will unsuccessfully wait for orders, SMS/push notifications, status changes and much more.

Discussions of monitoring with business often occur after serious incidents. If something is broken, we collect a commission with representatives of the direction, which was hooked by our release or incident. At the meeting, we analyze the causes of the incident, how to ensure that it never happens again, what damage we suffered, how much money we lost and on what.

It happens that you need to connect a business to solve problems created for customers. We discuss proactive actions there: what kind of monitoring to have so that this does not happen again.

The support service monitors the values ​​of metrics using a telegram bot. When a new monitoring appears, the support staff needs a simple tool that will let you know where it broke and what to do about it. The link to the description in the alert solves this problem.

I see the fakap in reality: we use Sentry to parse flights


It’s not enough just to find out about the error, I want to see the details.Our standard use case is as follows: rolled out the release and received alerts from the K8S stack. Thanks to monitoring, we look at the status of the hearths: which versions of the application rolled out, how the deployment ended, and whether everything is fine.

Then we look at the RMM, what we have with the base and with a load on it. By Grafana and boards we look at the number of connections to Rabbit. He's cool, but knows how to leak when memory runs out. We monitor these things, and then check the Sentry. It allows you to watch online how the next debacle unfolds with all the details. In this case, post-release monitoring reports what is broken and how.

In PHP projects, we use a raven client, additionally enriching with data. Sentry is all beautifully aggregated. And we see the dynamics for each fakap, how often it happens. And also we look at examples, which requests failed, which requests got out.
image

It looks something like this. I see that on the next release of errors there were sharply more than usual. We will check what specifically broke. And then, if necessary, we will get the failed orders and fix them in the context.

We have a cool thing - binding to Jira. This is a ticket tracker: I clicked a button, and a task error was created in Jira with a link to Sentry and a stack trace of this error. The task is marked with certain labels.

One of the developers brought a sensible initiative - “Clean project, clean Sentry”. On planning, every time we throw at least 1-2 tasks created from Sentry into the sprint. If something is broken in the system all the time, then Sentry is littered with millions of small stupid mistakes. We regularly clean them so that we don’t accidentally miss the really serious ones.

Blazes for any reason: we get rid of monitoring, which everybody scores


  • Getting used to mistakes

If something constantly blinks and looks broken, it gives a feeling of a false norm. The support service may be mistaken in thinking that the situation is adequate. And when something serious breaks, they will ignore it. Like in a fable about a boy shouting: “Wolves, wolves!”.

The classic case is our project, which is responsible for order processing. It works with a warehouse automation system and transfers data there. This system is usually released at 7 in the morning, after which monitoring flashes. Everyone is used to it and clogs, which is not very good. It would be prudent to tune these controls. For example, to link the release of a specific system and some alerts through Prometheus, just do not cut the extra alarm.

  • Monitoring does not take into account business indicators

The order processing system transfers data to the warehouse. We have added monitoring to this system. None of them shot, and it seems that everything is fine. The counter indicates that the data is leaving. This case uses soap. In fact, the counter may look like this: the green part is the incoming exchanges, the yellow ones are the outgoing ones.

image

We had a case when the data really pleased, but the curves. Orders were not paid, but they were marked as paid. That is, the buyer will be able to pick them up for free. It seems to be scary. But the opposite is more fun: a person comes to pick up a paid order, and he is asked to pay again because of an error in the system.

To avoid this situation, we observe not only the technology, but also the metrics of the business. We have specific monitoring, which monitors the number of orders requiring payment upon receipt. Any serious leaps in this metric will show if something went wrong.

Monitoring business indicators is an obvious thing, but they are often forgotten about it when new services are released, including us. Everyone covers the new services with purely technical metrics related to disks, process, whatever. As an online store, we have a critical thing - the number of orders created. We know how many people usually buy adjusted for marketing campaigns. Therefore, with releases, we follow this indicator.

Another important thing: when a customer repeatedly orders delivery to the same address, we do not torment him with communication with the call center, but automatically confirm the order. Failure in the system greatly affects customer experience. We also follow this metric, since releases from different systems can greatly affect it.

Watching the real world: we care about a healthy sprint and our performance


In order for the business to keep track of different indicators, we filmed a small Real Time Dashboard system. Initially, it was done for a different purpose. The business has a plan for how many orders we want to sell on a particular day of the coming month. This system shows the compliance of plans and made in fact. For the schedule, she takes data from the production base, reads from there on the fly.

Once our replica fell apart. There was no monitoring, so we did not have time to find out about it. But the business saw that we were not fulfilling the plan for 10 conventional units of orders, and came running with comments. We started to figure out the reasons. It turned out that irrelevant data is read from a broken replica. This is a case in which a business observes interesting indicators, and we help each other in case of problems.

I’ll tell you about another monitoring of the real world, which has long been under development and is constantly tuned by each team. We have a Jira Viewer - it allows you to monitor the development process. The system is extremely simple: the Symfony PHP framework, which runs in Jira Api and picks up data on tasks, sprints, and so on, depending on what was given to the input. Jira Viewer regularly writes metrics related to teams and their projects to Prometeus. There they are monitored, alert, and from there displayed in Grafana. Thanks to this system, we follow Work in progress.

  • We monitor how long the task has been in operation from the moment in progress to rolling out to the prod. If the number is too large, theoretically this indicates a problem with the processes, team, task description, and so on. The life of a task is an important metric, but not enough in itself.
  • You can still look at sprint health. Let's say it ends, and there are too many outstanding tasks. Or there are problems with logging time, if you decided to log it.
  • Lack of releases - there are too many tasks in the status of ready for release, but they have not gone anywhere. We have this case because of the code-freeze before a big action, such as Black Friday.
  • We monitor the drawdown in testing: when a bunch of tasks hang on testing, no one observes this objectively. This metric removes manual work from a manager or team lead.
  • The status of backlogs is also monitored. Recently, we took the technical backlog of one project and reduced it from 400 tasks to 150. We just realized that a lot would not be done, and canceled them.
  • I monitored in my team the number of pull requests from developers outside of working hours. In particular, after 8 pm. And when the metric shoots, this is an alarming sign: a person either does not have time for something, or puts too much effort and sooner or later simply burns out.

image

The screenshot shows how Jira Viewer displays data. This is a page where there is summary information about the status of tasks from the sprint, how much each weighs, and the like. Such things also gather and fly to Prometheus.

Not only technical metrics: what we are already monitoring, what we can monitor and why all this is needed


To put this all together, I propose to monitor together both technology and metrics related to processes, development and business. Technical metrics alone are not enough.

  • We made it a rule that when we roll out a new system or a new business process, we form a new Grafana board in advance. All critical metrics of this system are displayed on this board. Alerts, alerts, and so on, so that we anticipate problems at the start, are also triggered.
  • We monitor trends: we make monitoring that observes the situation in the long term. Once we saw how the number of connections to our database is gradually growing, thanks to which we found out the problem with crontab and moved to supervisor. So the practice of looking at the long-term changes in numbers in Prometheus led to the salvation of the system.True, at the same time we gave rise to another serious incident, but it happens.
  • The monitoring review is a thing that allows you to clean up this kind of story and eliminate the ancient evil that started up for some reason.
  • I propose to eliminate false positives: so that there are no situations when something is blazing, although in reality it should not. It is very important that people do not blur their eyes.
  • In general, it is worthwhile to ensure that alerts really reflect the state of the data on the system. My problem in the last year is that Sentry has accumulated several error pages. Some of them happen millions of times, and I want to remove it. I want to see really important things that need to be addressed. And so that Sentry does not get what is normal behavior, and that there are no old errors.
  • Document metrics and explain their importance. It is very bad when no one except the author of the metric understands why it is important and what needs to be fixed.
  • Solve problems, not plug the rear with a blanket. It is important not just to tune the monitoring, raising the threshold values ​​due to the fact that something is gradually falling apart in the system. I propose to really eliminate the root cause of alerts and prevent the repetition of already happened fakaps.
.

Source