We are preparing PostgreSQL in the era of DevOps. Experience 2GIS. Pavel Molyavin
Hello everyone! My name is Pavel! I work for 2GIS. Our company is a city information guide, navigation service. This is a very good thing that helps to live in the city.
I work in the web development division. My team is called Infrastructure & amp; Operations, abbreviated as IO. We are engaged in supporting the infrastructure for web development. We provide our services to teams as a service and solve all problems within our infrastructure.
The stack that we use, the technology is very diverse. These are mainly Kubernetes, we write in Golang, in Python, in Bash. From the databases, we use Elasticsearch, use Cassandra and, of course, use Postgres, because we love it very much, this is one of the basic elements of our infrastructure.
I’ll tell you about how, one fine day, we needed to create a tool for quickly deploying a failover cluster based on Postgres. And it was necessary to ensure integration with all the systems we have in place, to ensure configuration management: monitoring, logging, it was imperative that there were backups, so that there was a serious thing in order to use it, including on production.
My report will not be about Postgres internals, not about working inside Postgres, but about how we build the infrastructure around it. It would seem that everything is simple - you take and do, but actually not, everything is much more complicated. I will talk about it.
Let's start with the statement of the problem. Our product teams write applications. The number of applications is constantly growing. We used to write applications primarily in PHP, Python, and Java Scala. Slowly, fashionable languages like Golang crawled into us quietly, without Node.js nowhere, too, especially in FrontEnd.
We chose Postgres as the data warehouse for our applications. There were proposals to use MySQL, Mongo, but in the end the elephant won, and we use Postgres.
Since we practice DevOps, product teams are involved in operation, that is, there are many services that they themselves support for their applications. In particular, it was with Postgres. Each team set Postgres for itself, configured it, configured it.
Time passed, the number of Postgres increased, there were different versions, different configurations. Any problems constantly arose, then we began to migrate to Kubernetes, when Kubernetes entered the production stage. Accordingly, more databases appeared, more Postgres were needed, because a data warehouse was needed for migration, for transitional applications. In general, complete disgrace began.
The teams we tried to use to install Postgres and configure Ansible. Someone more courageous used Chef, who had experience with him.
But the most favorite way is to put everything in pens, forget how to configure and contact our IO team, and ask for help.
Accordingly, the load increased for the teams that were involved in administering Postgres. The burden increased on us because they spent their time in order to find the problem. Then they came to us, we spent time. In general, it turned out that there are a lot of things. We even had a particularly distinguished one. In the days of 9.4, Postgres managed to install Postgres 8th version.
In the end, we decided that we needed to do something with this, enough to endure it. We decided to create a single solution. They threw a cry at the team and asked: "What do you need to be good?" We formed the requirements and started to work, i.e., to create.
We decided that since it will all be production, we definitely need to make a cluster so that there are several replicas and they are replicated among themselves, so that there is a master and autofailover is provided in case the cluster falls at night, but the services could continue to work. We also had an optional requirement - to provide a replica to another data center to ensure fault tolerance and so that applications in another data center also work at least for reading.
We definitely decided to do the balancing, because the connections in Postgres are a rather expensive resource. Be sure to do pooling, stable balancers, that is, ensure that everything works.
It is imperative to make backups with an acceptable storage depth so that you can still perform point in time recovery. And it was necessary to make the mandatory archiving of WAL files.
It was necessary to ensure integration with existing monitoring and logging systems.
And since we adhere to the paradigm: infrastructure as code, we decided that our solution would look like a deploy. We decided to write it on a well-known instrument in our unit. We had it Ansible. And we wrote a little something in Python and Bash, we are also actively using it.
The main idea was such that it was a kind of holistic decision, so that any team could take our deployment. Could change some variables, change secrets, keys, change inventory and deploy a failover cluster. That is, that by and large it was necessary to click the “deploy” button and after a while already connect to the finished cluster and work with it.
And since we use corporate Gitlab, I would like it all to work through the built-in features of Gitlab, through continuous integration, continuous delivery. That is, so that you can click the "deploy" button in the repository and after a while a working cluster would appear.
And now we will consider all these elements separately.
- We chose PostgreSQL 9.4. That was about 3.5 years ago. Then it was just 9.4 in baseline. 9.6, in my opinion, just left or it wasn’t there yet, I don’t remember exactly. But we quickly upgraded our solution to 9.6, i.e., it began to support it.
- To ensure replication, we did not invent anything. We chose the standard method - streaming replication. She is the most reliable, most famous. There are no special problems with her.
- For clustering, we chose a replication manager with pseudo HA.This is repmgr from 2ndQuadrant. In defense of repmgr I can say. In the previous two days, stones were actively flying at him that he did not provide real HA, he had no fencing, but in fact for three years of operation I can say that we did not have a single incident when failovers switched incorrectly or ended a mistake. That is, always repmgr switched the cluster on the case. Always provided us with reliability. Not without downtime, of course. For about 2 minutes, repmgr realizes that it needs to perform failover and then executes it.
- For repmgr, we wrote custom failover_command. This is the kind of command that repmgr executes when it is about to conduct a failover, i.e., get the new wizard up. That is, a special script is launched. He goes to the balancers, checks that the balancers are working, that they are available from the new wizard. And after that there is a switch. That is, it performs failover, rewrites the backends on the balancers, and then completes its work. Accordingly, we get a new master. And the balancers are connected to the new master.
- With a replica in another DC. Our main place of presence is Novosibirsk. Several DCs in Novosibirsk. But there is a platform in Moscow, so we needed to ensure that two replicas were in Moscow. First we added them to the repmgr cluster.
But after the start of operation of one of the test clusters, it turned out that the connection is not very good, there are constant ping’s drawdowns, lags. As a result, we rejected this idea. Constantly warning occurred in the cluster, i.e. monitoring started saying that something in the cluster had changed, let's check what happens. Replication was also intermittently interrupted.
As a result, we made just such a scheme, that is, we pulled the nodes that are in Moscow from repmgr. They made of them the usual hot standby, did not invent anything. They have through restore restore WALs archives from the backup server. Accordingly, we did not solve the lag problem. If active work begins on our master, then it is clear that the replicas in Moscow are slightly behind.
But they lag behind so that they themselves are restored, that is, replication is not interrupted. And the application that use them is used in Moscow in the course, they know, they understand what it can be. And they always have Novosibirsk nodes in the form of fallback, which are in the cluster.
Problems. There are no special problems here. In addition to the fact that archive_command works slowly for us, because it is done through regular rsync, that is, the logs from the wizard get to the archive through rsync ssh. Accordingly, if work begins on the master, then this all slows down a little. But we can handle this, because, most likely, we will switch the archiving to streaming mode. In general, this is not such a big problem.
With balancing, we decided that we needed to make fail-safe balancers.
We have allocated special nodes for this. They have a keepalive at the entrance. Each keepalive has a virtual address for which it is a master. Accordingly, if one keepalive falls, the second receives its address as the second, that is, it has two addresses.
These addresses are balanced in DNS through round robin, that is, when the application connects to the cluster, they receive either the virtual address of the first balancer or the virtual address of the second balancer.
The keepalive addresses are received by PgBouncer. Great pooler, we have no special problems with him. It works, does not cause any trouble. But he has one problem. He can have only one backend and he should look at the master. Therefore, we put Pgpool on his backend. We liked him because he knows where the master is located in the cluster, he knows how to balance read requests, on slave.
Everything was cool, good, we started to operate it, but at some point we had to do like this. Because with Pgpool we had constant problems. He periodically, for no apparent reason, threw the master out of balance. Also, its built-in analyzer, which analyzes requests, could not periodically determine - a write request, a read request, that is, it could send a write request to a replica. Accordingly, we get read only cluster and panic on the ship. In general, everything is bad.
We got rid of Pgpool. PgBouncer now look only at the master. Deploy puts PgBouncer’s backend. And failover_command from the future master switches the backend address and does a soft reset to PgBouncer. And so PgBouncer always looks at the master.
Everything would be fine if the whole load did not go to the master, because slaves were involved with Pgpool. And then it turned out that now slaves are not involved.
And quite by a coincidence, our applications began to support real only points. That is, they can connect to the balancer for writing, they can connect to the balancer for reading in order to read from it.
Accordingly, we made the same pair. I drew one here, because he no longer fit in. We made the balancers to read in the same way. Only on the backend did PgBouncer install HAProxy, which, using a special script, goes to the cluster and knows for sure where it has slaves and sends requests for them via round robin.
Also, our applications started supporting pooling on their side, that is, they started using things like Hikari Pool. And, accordingly, save connections for the balancer. And after that it became even better to live.
With backups, we also did not invent anything, did not complicate anything.
- Barman we backup the database through cron every two days when the load is minimal.
- Also, through archive_command, we archive WAL files to Barman so that PITR can be implemented.
We always want to have confidence that we have a backup that can be restored. Therefore, we have a separate dedicated host, which every night after the last backup restores either the last backup or the previous one. And rolls up a number of WAL files to verify that PITR is working. Accordingly, the script passes, the database is restored. He runs another script that goes into the database, checks the labels. After that, it signals our monitoring that everything is fine, there is a backup and the backup is being deployed.
The disadvantages in this scheme are pretty standard:
- The same archive_command.
- And it takes a lot of space, because we do not compress WAL files, because we save the processor, save time, so that during recovery you can recover as quickly as possible.
Monitoring and logs
- We use Prometheus as monitoring. It uses a pool model for collecting metrics. That is, he goes to services himself and asks for their metrics. Accordingly, for each component of the system (and these are almost all components that do not know how to give metrics), we had to find the appropriate exporter who would collect the metrics and give them to Prometheus, or write it ourselves.
- We write our exporters in Golang, Python, Bash.
- We use mainly standard exporters. This is Node exporter - the main system exporter; Postgres exporter , which allows you to collect all metrics from Postgres; we wrote the exporter for PgBouncer ourselves in Python. And we also wrote custom Cgroups exporters in Golang. We need it in order to clearly know how many resources the system database server spent on servicing each database.
- We collect all system metrics: these are memory, processor, disk loading, network loading.
- And metrics of all components. In exporter’s for Postgres, the situation is such that you yourself exporter’s write the queries that you want to execute in the database. And, accordingly, he fulfills them for you and forms metrics. That is, in principle, the metrics that you can get from Postgres are limited only by your imagination. It is clear that it is important for us to collect the number of transactions, the number of changed tuples, the position of the WAL files in order to constantly monitor the backlog, if any. Be sure to know what happens on the balancers. How many sessions are in what status. That is, we collect everything at Prometheus.
Metrics we look at Grafana. This is what Cgroups chart with exporter’s looks like. We can always see what base we have, how much processor it has used for a certain time, that is, it’s very convenient.
As in any monitoring system, we have alerts. Alerts at Prometheus is engaged in AlertManager.
We have written a large number of alerts regarding the cluster. That is, we always know that we started failover, that failover ended, that failover ended successfully or not successfully. And we are sure to monitor all these issues. We have alerts that either signal us in the mail, or signal to the corporate Slack, or call and say: "Wake up, you have a problem!".
We also monitor successful backups, successful recovery. We check that our database is writable, read through the balancers, past the balancers, that is, we completely monitor the entire request path from the balancer to the base.
We also need to monitor the test recovery, in order to know that we have a backup, that it works, it is being restored.
With logs, we are not very rich. Our company uses ELK stack to collect logs, i.e. we use Elasticsearch, Kibana and LogStash. Accordingly, Postgres and all components of our solution add logs to disk in CSV format, and Python collects them from disk Beaver . There is such a project. He collects them and gives LogStash. LogStash enriches some information, adds information to which team the cluster belongs to, which cluster, in general, contributes some additional information.
And after that, in Kibana, you can easily view and aggregate logs, in general, sort and do whatever you want.
Also, the teams asked us to install Pgbadger . This is such a thing that allows you to easily and naturally analyze, for example, slow logs. This is the main reason they wanted this thing. It is written in Perl. We did not really like it, because under it we had to host a separate host. We eventually pushed it into Kubernetes. He works for us at Kubernetes. He gets the logs from LogStash. LogStash sends it via HTTP logs and in Pgbadger you can then browse and look for your slow requests. In general, be sad and find out how to fix it.
With deployment the following story.It was necessary to make the deployment as simple as possible so that the instrument was understandable.
We decided to write a release on Ansible. I wanted to make it as repeatable as possible, as idempotent as possible, so that it could be used not only during the initial deployment, but could also be used for operation. We wanted to automate all the operations that might be outside of Postgres itself. Including a configuration change in Postgresql.conf so that everything can be done through deployment. To always have versions, so that you can always clearly track the history of the repository, what happened, and who made the changes.
As I said, we wrote a release on Ansible. We had to use 25 roles for this. All components are installed using these roles, that is, nothing extra needs to be set. You start the deployment and after that you have a working cluster. The only thing you will need to do later is to make a merge request to the monitoring system, because it works separately for us and for it you need to separately indicate where it needs to go and collect metrics.
We wrote 20 roles for this deployment ourselves. 12 of them are degenerate, i.e., these are roles that we wrote exclusively for this deployment. And they are not particularly useful anywhere else, so they lie right in the deploy, right in the special catalog.
And we had to localize all the roles. That is, those that we used to get from the Internet, we localized them, because Roskomnadzor has become quite active. He constantly bans GitHub hosts, so we decided not to drag roles from the outside. And all the roles we have are located locally in our repositories.
We also used the built-in Ansible feature. This use for deployment different environments. That is, by default in our deployment there is testing, production and staging. By staging, we mean almost production, but without user traffic. In principle, using the built-in Ansible feature, you can add any kind of environment. If you need two or three testing to test some feature, then you add to additional environments, add variables, click "deploy". And after that you have a working cluster.
Keys, passwords, we keep secret. We encrypt everything Ansible Vault . This is a feature built into Ansible.
Since we use Ubuntu Linux as the base system, we all install from the repositories. Everything that was not provided in the form of packages, we pack and put to ourselves in a local apt. This is also necessary in order not to have problems when Roskomnadzor again decides to ban something and in order to speed up the deployment as much as possible, because locally everything is happening faster.
We tried to automate all the routine operations. That is, any configuration change: adding databases, adding users, adding real only users, adding some additional pieces to the database, changing the mode in balancing, because we have one part of the services that uses transactional balancing, the other part needs prepared statements, and they use session balancing.
Also, the bootstrap cluster itself works completely offline. When you first start, set a certain variable and your cluster will bootstrap completely. He is clean and ready to go. There, of course, there are inconveniences, because this variable must be turned off. Because if you run deploy with this variable again, then you will again have a clean cluster. That is, you must remember about these things. They are reflected in the documentation.
Our deployment is in the form of a Git repository, as I said, in our internal Gitlab.
All work takes place in branches. These are standard workflows. We don’t deploy anything from the branches. Be sure to create a merge request. This is a merge from the branches to the master. According to our internal practices, responsible people are called up for the merge request, that is, usually they are either team members or members of our team who understand the issue and understand what is being discussed. After the merge is approved, he merges into the master. And from the master comes the deployment of the cluster.
We also wrote a custom binding for deploying a cluster in OpenStack. It is very convenient.Especially if you need a bunch of testing. We can use the native Openstack API through the Ansible module, or through one piece in OpenStack, which is called heat-api, which allows templates to start machines in several pieces. It is very convenient. If you need to test something in a branch, you connect this binding, which allows you to deploy cars through heat-api. Expand your cluster. You work with him. Then say: “I do not need this. Delete. " This is all deleted, that is, virtual machines in OpenStack are deleted. Everything is very simple and convenient.
Also for iron machines we wrote bootstrap, but this is a regular preset image. This is a built-in feature of Debian, Ubuntu. He asks a few questions. He has a special menu. He asks a few questions about the disk, about the network. After you have finished, you have a node that is already ready so that it can be used in the deployment. That is, you prescribe it in your inventory and after that you can start the deployment, and everything will be fine.
Also in the deploy, we built in tests. Tests were written in Ansible. The tests are infrastructural, i.e. we run them only on a test environment. Our tests first start the cluster. Then they break it. They check that everything is available from the balancer, that the master has switched, that a backup is taking place, that the database is being restored, that is, everything is fully verified.
And the last paragraph. We use the built-in features of Gitlab.
Gitlab can run special jobs. Jobs are separate steps that are performed.
T. E. We have separate steps syntax check, deploy, tests. Here we have a small diagram. After the merge, a pipeline is automatically launched in the master, which produces a syntax check, and is deployed for testing. On testing, tests are run. After that, if everything is OK, everything is green, the manual launch of the deployment of staging and deployment of production is unlocked.
We have made a manual launch so that the people who do this understand what they are doing and report that it will now be a testing or staging or production deployment.
To summarize. What did we do? We wanted to make such a tool that we would deploy everything in order to get a very fast working cluster with all integrations. In principle, we succeeded. We created such a deployment. He helps us a lot. Even under not very good circumstances, within 15 minutes after the start of pressing the “deploy” button, you get a working cluster. You can go to endpoint and use in work. Accordingly, teams can embed it directly in the pipeline, i.e., fully work with it.
Our team supports several clusters based on this solution. Everything suits us, we fully work through deployment. We practically do nothing manually. I’ll talk about this later.
We excluded command DBMS support. The teams stopped doing amateur activities with us, stopped putting Postgres, we do all this. We install everything, configure everything. In general, the guys are free, they can continue to write their own grocery features, which they really like.
We have gained great experience. We figured out how rep managers, balancing, replication, Postgres work. This is an invaluable experience, I advise everyone.
Failover works for us. Failover, I would not say what happens often. For three years, about ten times we had a failover. Failover was almost always the case.
We do a manual switchover, i.e. this is a failover manually. We do switchover for all sorts of interesting pieces.For example, when we needed to transfer one of the clusters from one DC to another, we simply forgot to use a deploy to a node in another DC. They made a switchover on it, the balancers switched it. And then we had a working master in the new DC with virtually no downtime, only for the time of switching. And for repmgr (Replication Manager ) - this, in my opinion, is two minutes default.
We couldn’t do without falls either:
- We wanted to make the solution simple so that the teams could use it themselves. But nothing came of it. Anyway, you need to know the nuances of how everything works inside, you still need to understand how replication works, how rep managers work. That is why the number of forks in our repository is zero. Teams do not do this. Maybe this is not bad, because we would solve all the problems.
- Anyway, for each cluster, you have to finish something, add something. Cannot use the same deploy.
- And it was not possible to get rid of manual operations either. Anyway, upgrades, restarting Postgres, when you need to make some changes to postgresql.conf, have to be done manually. Because automation can be trusted, but you need to understand what is happening. You can’t do without an operator. A person must understand what he is doing. Without manual operations, it still does not work.
- Updating is a separate pain. It is necessary to upgrade all hands. Balancers, for example, can be upgraded automatically, but there will be downtime. And without automation, manually, this can be done without downtime.
- And we don’t really like to collect packages, because it takes extra time. We need to make the infrastructure for the assembly, the infrastructure for the pushes and the repository.
Our company is developing quite actively Kubernetes, so there is a feeling to try something new. For example, try running failover Postgres on Kubernetes. For this, we want to look at Stolon, Patroni. Perhaps this will be the first thing we will do.
And I want to do the same thing without a cluster, that is, I want to provide modularity so that you can turn off some jokes in the deploy, turn on some, and not to rewrite the deploy strongly. We have processing bases that do not need a cluster, do not need a backup, but, for example, need balancing and monitoring with logging.
And our solution is that we will upgrade to PostgreSQL 10-11. I predict that it will be a very "fun" entertainment. The call is accepted, we have no options, we use our solution and we will upgrade.
Thanks for the report! To develop all the playbooks in Ansible, how much time was spent?
By trial and error, by means of several clustered clusters, it took about two to three months somewhere. There were 2-3 people in the team.
What were the specifics of implementing Ansible in conjunction with Postgres, with your solution? For example, did you have to fundamentally rework something?
Modules to write for Ansible?
No, we used the built-in features of Ansible, that is, all the modules that it has inside.
Thanks for the report! Why did you stop at Barman? And about the upgrade, my question ripened at the very beginning, when they told about 9.6. I saw that you have plans for version 10 and 11, right?
I don’t know, but for some reason we decided to choose Barman. We liked it because it configures normally. That is, we had for him an already written role in Ansible. We want to look at something else, for example, on WAL-G. But the problem is, as far as I remember, WAL-G does not know how to drive (Clarification: already knows how to write to disk). And we have a requirement that all backups and all-all lay inside the infrastructure. And for this separate S3, of course, is possible, but why?
We did the upgrade manually. We also posted a cluster 9.6 nearby.It migrated data and all. And after that they threw 9.4 and everything was fine with us.
Thank you very much for the report! You said that you have several clusters. But there are more applications along the way, right?
By what principle are new clusters added besides staging?
I understood the question. We have consolidated all the bases in our three large clusters. We have testing, staging there. And when the application is already running on staging, we can understand that something is.... Let's say the processing bases that I talked about. And letting such a cluster into the cluster is not clear why, because there will be a constant load on WALs, there will be a constantly large backup. Because if the backup came while you have the processing base, you get.... And, in principle, according to the load profile, we understand what is happening inside the application. And after that we decide to give a separate cluster to these guys. And deploy a cluster, but what to do? All production applications.
Hello! Thanks for the report! I have a little double question. Are you using Postgres vanilla assemblies for half of it?
I also use vanilla now, but I’m looking at the free part of the Postgres Professional team, that is, Postgres Pro Standard. And the topic touched what to choose for backup. And they have a utility pg_probackup, which is similar to Barman, but it has a built-in backup check already inside. Why didn’t you pay attention to this utility?
I don’t even know how to answer you correctly. Basically, we somehow didn’t consider Postgres Pro’s stuff at all. Somehow it happened. We are used to having our own mustaches, building our own bicycles. Everything seems to work for us.
T. e. it’s everyone’s choice: either to fence their bike, or to go along the beaten track.
Thank you very much!.