SQL query optimization or search for dangerous criminals
I believe that almost every project that uses Ruby on Rails and Postgres as the main weapon on the backend is in a permanent struggle between the development speed, readability/maintainability of the code and the speed of the project in production. I will tell about my experience of balancing between these three whales in a case, where readability and work speed suffered at the input, and at the output I managed to do what several talented engineers tried unsuccessfully before me.
The whole story will take several parts. This is the first one where I will talk about what PMDSC is for optimizing SQL queries, share some useful tools for measuring query performance in postgres and recall one useful old cheat sheet that is still relevant.
Now, after some time, “in hindsight”, I understand that at the entrance to this case I did not expect at all that I would succeed. Therefore, this post will be useful more for bold and not the most experienced developers than for super-seniors who have seen rails with bare SQL.
We at Appbooster are promoting mobile apps. To easily put forward and test hypotheses, we develop several of our applications. The backend of most of them is the Rails API and Postgresql.
The hero of this publication has been developed since the end of 2013 - then rails 4.1.0.beta1 just came out. Since then, the project has grown into a fully loaded web application that runs on several servers in Amazon EC2 with a separate database instance in Amazon RDS (db.t3.xlarge with 4 vCPU and 16 GB RAM). Peak loads reach 25k RPM, the average load in the daytime is 8-10k RPM.
With the database instance, more precisely with its credit balance, this story began.
How does a Postgres instance of type “t” in Amazon RDS work: if your database works with an average processor time consumption below a certain value, then you accumulate credits on your account that the instance can spend on CPU consumption during high-load hours - this allows you to not overpay for server capacity and to cope with high load. In more detail about what and how much they pay using AWS, you can read in our CTO article .
The balance of loans at some point was exhausted. For some time this was not given much importance, because the balance of loans can be replenished at the expense of money - it cost us about $ 20 a month, which is not very noticeable for the total cost of renting computing power. In product development, it is customary to primarily pay attention to tasks formulated from business requirements. The increased consumption of processor power by the database server fits into the technical debt and is covered by the small cost of buying a loan balance.
One fine day, I wrote in a daily sammari that I was very tired of putting out “fires” periodically arising in different places of the project. If this continues, then the burnt developer will devote time to business tasks. On the same day, I went to the main project manager, explained the alignment and asked for time to investigate the causes of periodic fires and repairs. Having received the go-ahead, I began to collect data from various monitoring systems.
We use Newrelic to track the total response time per day. The picture looked like this:
Yellow on the graph shows the portion of the response time Postgres takes. As you can see, sometimes the response time reached 1000 ms and most of the time it was the database that pondered the response. So you need to watch what happens with SQL queries.
PMDSC - a simple and clear practice for
any boring work to optimize SQL queries
Perhaps the most important part of the whole practice. When someone says the phrase "Optimization of SQL queries" - it rather causes an attack of yawning and boredom in the vast majority of people. When you say “Detective Investigation and the Search for Dangerous Villains,” it more involves and sets you up in the right way. Therefore, it is important to enter the game. I enjoyed playing the detective story. I imagined that the database problems were either dangerous criminals or rare diseases. And he represented himself in the roles of Sherlock Holmes, Lieutenant Colombo or Doctor House. Choose a hero to your taste and go!
To analyze query statistics, I installed PgHero . This is a very convenient way to read data from the pg_stat_statements extension for Postgres. We go into/queries and look at the statistics of all queries over the last day. Sorting default queries by the Total Time column - the fraction of the total time that the database processes the query - a valuable source in finding suspects. Average Time - how much the average query is running. Calls - how many requests were in the selected time. PgHero considers requests to be slow, which were executed more than 100 times per day and took an average of more than 20 milliseconds. The list of slow queries on the first page, immediately after the list of duplicate indexes.
We take the first one in the list and look at the details of the request, right there you can see its explain analyze. If planning time is much less than execution time, then something is wrong with this request and we are focusing on this suspect.
PgHero has its own visualization method, but I liked using explain.depesz.com more copying the data from explain analyze there.
One of the suspected queries is using Index Scan. The visualization shows that this index is not effective and is a weak point - it is highlighted in red. Fine! We examined the traces of the suspect and found important evidence! Justice is inevitable!
Let's draw a lot of data that are used in the problem part of the request. It will be useful to compare what data the index covers.
A bit of context. We tested one of the ways to keep the audience in the application - something like a lottery, in which you can win a little domestic currency. You make a bet, make a number from 0 to 100, and take the entire bank, if your number is closest to what the random number generator received. We called it the Arena, and the rallies called the Battles.
There are about five hundred thousand battle records in the database at the time of the investigation. In the problematic part of the request, we are looking for battles in which the rate does not exceed the user's balance and the battle status - I am waiting for the players. We see that the intersection of sets (highlighted in orange) is a very small number of records.
The index used in the suspect part of the request covers all created battles in the created_at field. The request runs through 505330 records of which 40 chooses, and 505290 eliminates. It looks very wasteful.
We put forward a hypothesis. What will help the database find forty records out of five hundred thousand? Let's try to make an index that covers the bet field, only for battles with the status “waiting for players” - a general index.
add_index :arena_battles, :bet, where: "status='waiting_for_players'", name: "index_arena_battles_on_bet_partial_status"
Parsial index - exists only for those records that fit the condition: the status field is “waiting for players” and the rate field indexes - exactly what is in the request condition.It is very beneficial to use this particular index: it takes only 40 kilobytes and does not cover the battles that have already been played and are not needed for us to get a sample. For comparison, the index_arena_battles_on_created_at index used by the suspect takes about 40 MB, and the table with the battles is about 70 MB. This index can be safely removed if it is not used by other queries.
We roll out the migration with a new index in production and observe how the response of the endpoint with the battles has changed.
The graph shows how much we rolled out the migration. On the evening of December 6, the response time decreased by about 10 times from ~ 500 ms to ~ 50ms. The suspect in court received prisoner status and is now in prison. Fine!
A few days later we realized that we were happy early. It seems that the prisoner found accomplices, developed and implemented an escape plan.
On the morning of December 11, the postgres query planner decided that it would not be profitable for it to use the fresh index, and began to use the old one again.
We are at the Suppose it stage again! Collecting a differential diagnosis in the spirit of Dr. House:
- You may need to optimize your postgres settings;
- maybe it’s minor to update postgres to a new version (9.6.11 - > 9.6.15);
- and maybe again carefully examine which SQL query forms the Rail?
We tested all three hypotheses. The latter led us to the trail of an accomplice.
SELECT "arena_battles".* FROM "arena_battles" WHERE "arena_battles"."status"='waiting_for_players' AND (arena_battles.bet <= 98.13) AND (NOT EXISTS ( SELECT 1 FROM arena_participations WHERE arena_battle_id=arena_battles.id AND (arena_profile_id=46809) )) ORDER BY "arena_battles"."created_at" ASC LIMIT 10 OFFSET 0
Let's walk through this SQL together. We select all battle fields from the battle table whose status is “waiting for players” and the bet is less than or equal to a certain number. So far, everything is clear. The following term is creepy.
NOT EXISTS ( SELECT 1 FROM arena_participations WHERE arena_battle_id=arena_battles.id AND (arena_profile_id=46809) )
We are looking for a non-existing subquery result. Get the first field from the table of participation in the battles, where the battle identifier matches and the profile of the participant belongs to our player. I will try to draw the set described in the subquery.
It is difficult to comprehend, but in the end, with this subquery, we tried to exclude those battles in which the player is already participating. We look at the general explain request and see Planning time: 0.180 ms, Execution time: 12.119 ms. We found an accomplice!
The time has come for my favorite cheat sheet, which has been surfing the internet since 2008. Here it is:
Yes! As soon as there is something in the query that should exclude a certain number of records based on data from another table, this meme with a beard and curls should pop up in memory.
This is actually what we need:
Save this picture for yourself, and even better print and hang it in several places in the office.
We rewrite the subquery to LEFT JOIN WHERE B.key IS NULL, we get:
SELECT "arena_battles".* FROM "arena_battles" LEFT JOIN arena_participations ON arena_participations.arena_battle_id=arena_battles.id AND (arena_participations.arena_profile_id=46809) WHERE "arena_battles"."status"='waiting_for_players' AND (arena_battles.bet <= 98.13) AND (arena_participations.id IS NULL) ORDER BY "arena_battles"."created_at" ASC LIMIT 10 OFFSET 0
The corrected query runs along two tables at once. We added a “left” table with records of participation in the user's battles and added the condition that the participation identifier does not exist. We look explain analyze the received request: Planning time: 0.185 ms, Execution time: 0.337 ms. Fine! Now, the query planner will not doubt that he should use the index index, but will use the fastest option. An escaped prisoner and his accomplice are sentenced to life imprisonment in a maximum security institution. Escape to them will be more difficult.
- Use Newrelic or another similar service to find leads. We realized that the problem is in the database queries that way.
- Use the PMDSC practice - it works and engages well anyway.
- Use PgHero to find suspects and study evidence in SQL query statistics.
- Use explain.depesz.com - it’s convenient to read explain analyze requests there.
- Try to draw lots of data when you don’t understand what the query is doing.
- Remember a stern guy with curls in different places of his head when you see a subquery that is looking for something that is not in another table.
- Play the detective, maybe you’ll even get a badge.