Work in hosting support is basically the same, most of the requests from clients are solved according to a well-developed scheme, but sometimes you still have to deal with non-trivial problems. Then the main task of the engineer is to find the very one - the only right way that will lead to its solution. In this article I want to talk about how we encountered the floating error “HTTP Error 503. Service Unavailable” on our shared hosting, how we tried to catch it, performed diagnostics and got an unexpected ending.

Start


Hosting provides users with a typical Linux + Apache + Mysql + PHP stack and management shell. In our case, this is ISP Manager 5 Bussines based on Centos 7 with conversion to CloudLinux. On the administrative side, CloudLinux provides limit management tools, as well as a PHP selector with various operating modes (CGI, FastCGI, LSAPI).

This time a client approached us with the following problem. His site on the Wordpress engine periodically began to give a 503 error, which he told us about.

Response codes starting with 50s refer to server-side problems. These can be problems both of the site itself and of the web server that serves them.

Typical situations in which we receive the following errors:

  • 500 Internal Server Error - quite often associated either with syntax errors in the site code, or with missing libraries/unsupported version of PHP. There may also be problems connecting to the site database or incorrect permissions on files/directories
  • 502 Bad Gateway - for example, if Nginx refers to the wrong port of the Apache web server or the Apache process for some reason stops working
  • 504 Gateway Timeout - a response from Apache was not received within the time specified in the web server configuration
  • 508 Resource limit is reached - the limit allocated to user resources has been exceeded

This list contains only some of the most common cases. It is also worth noting that if the limits are exceeded, the user can get both 500 and 503 errors.

When diagnosing these errors, the first thing we do is check the web server logs. Usually, this is enough to identify the culprit and fix the problem.

With regards to 503 errors in our case, we saw the entry in the logs:
[lsapi: error] [pid 49817] [client x.x.x.x: 6801] [host XXX.XX] Error on sending request (GET/index.php HTTP/1.0); uri (/index.php) content-length (0): ReceiveAckHdr: nothing to read from backend (LVE ID 8514), check docs. cloudlinux.com/mod_lsapi_troubleshooting.html
Based on this log only, it was not possible to determine what might be the problem.

Initial diagnosis


Initially, we checked the user statistics for exceeding limits. Minor excesses were recorded in previous days, but errors in the magazines were fresh, moreover, they appeared in the journal with a frequency of one to several minutes.

We also examined CloudLinux recommendations using the link provided in the error logs.
Changing any parameters of the result did not bring.

The site used the database on the Mysql 5.7 server, which runs on the same server in the Docker container. There were messages in the container logs:

[Note] Aborted connection 555 to db: 'dbname' user: 'username' host: 'x.x.x.x' (Got an error reading communication packets) 

Just, among these messages there were messages about the interrupted connection of the studied site. This suggested that the connection to the DBMS is incorrect. For verification, we deployed a copy of the site on a test domain, converted the site database to the native version of DBMS 5.5.65-MariaDB in Centos 7. On the test site, several hundred queries were executed using the curl utility. Error could not be reproduced. But this result was preliminary, and after converting the database on the working site, the problem remained.

Thus, the problem of incorrect connection to the DBMS was eliminated.

The next suggestion was to check if there were any problems with the site itself. To do this, they raised a separate virtual server, they raised the most similar environment on it. The only significant difference is the lack of CloudLinux. The problem could not be reproduced on the test server.And so, we determined that everything is in order in the site code. Nevertheless, we tried to disable Wordpress plugins in the same way, but the problem also persisted.

As a result, we came to the conclusion that the problem is on our hosting.

An analysis of the logs of other sites revealed that the problem was observed in many of them. About 100 pcs. at the time of verification:

/var/www/httpd-logs# grep -Rl "ReceiveAckHdr: nothing to read from backend"./| wc -l 99 

During testing, we found that the freshly installed clean WordPress CMS also occasionally issues a 503 error.

About 2 months before that, we were working on upgrading the server, in particular, we changed the Apache mode from Worker to Prefork in order to get the opportunity to use PHP in LSAPI mode instead of slow CGI. There was an assumption that this could affect, or some additional Apache settings were required, but we could not return Worker mode back. In the process of changing the Apache operating mode, all site configs are changed, the process is not fast and not everything could go smoothly.

Adjusting Apache settings also did not give the desired result.

Along the way, we looked for similar problems in search engines. At one of the forums, the participants claimed that the hoster had a problem and needed to change it if the problem was not solved. It does not sound very optimistic when you are on the other side, but you can understand the client. Why does he need a non-working hosting.

At this stage, we have collected the available information and the results of the work performed. They were contacted in support of CloudLinux.

Detailed diagnosis


For several days, CloudLinux support staff delved into the problem. Basically, the recommendations were regarding user limits. We also checked this question. With disabled limits (CageFS option for the user) and with enabled limits in PHP mode as an Apache module, the problem was not observed. Based on this, it has been suggested that CloudLinux is somehow influencing. As a result, by the end of the week the request was escalated to the 3rd level of support, but there was no solution yet.

Along the way, we studied the Apache documentation on the CGI and LSAPI operating modes, raised the second Apache instance on the hosting server on a different port with a test site, excluded the influence of Nginx, sending requests directly to Apache and receiving the same error codes.

The LSAPI documentation helped get the ball rolling, just for diagnosing 503 errors:
www.litespeedtech.com/support/wiki/doku.php/litespeed_wiki : php: 503- errors
The Advanced Troubleshooting section suggests tracing the processes found in the system:

while true; do if mypid=`ps aux | grep $USERNAME | grep lsphp | grep $SCRIPTNAME | grep -v grep | awk '{print $2; }' | tail -1`; then strace -tt -T -f -p $mypid; fi ; done 

The team was finalized in order to write all processes to files with their identifiers.

When viewing trace files, we see the same lines in some:

cat trace.* | tail... 47307 21:33:04.137893 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=42053, si_uid=0} --- 47307 21:33:04.140728 +++ killed by SIGHUP +++... 

If you look at the description of the structure of signals sent by processes, you will see that

pid_t si_pid;/* Sending process ID */ 

Indicates the identifier of the process that sent the signal.

At the time of studying traces, the process with PID 42053 was no longer in the system, therefore, in the process of capturing traces, we decided to monitor the processes that sent the SIGHUP signal as well.
Under the spoiler, actions are described that made it possible to determine what kind of process it is, as well as get its trace and additional information about which processes it sends the SIGHUP signal to.

Trace Method
Console 1.

tail -f/var/www/httpd-logs/sitename.error.log 

Console 2.

while true; do if mypid=`ps aux | grep $USERNAME | grep lsphp | grep "sitename" | grep -v grep | awk '{print $2; }' | tail -1`; then strace -tt -T -f -p $mypid -o/tmp/strace/trace.$mypid; fi ; done 

Console 3.

while true; do if mypid=`cat/tmp/strace/trace.* | grep si_pid | cut -d '{' -f 2 | cut -d'=' -f 4 | cut -d',' -f 1`; then ps -aux | grep $mypid; fi; done; 

Console 4.

seq 1 10000 | xargs -i sh -c "curl -I http://sitename/" 

We are waiting for messages to appear in console 1, while in console 4 we see the status of the request with a response code of 503, we interrupt execution in console 4.

As a result, they got the name of the process CDMY0CDMY

This process was performed in the system with a frequency of once per minute.

We trace several cagefsctl processes to track at least one from start to finish:

for i in `seq 1 100`; do strace -p $(ps ax | grep cagefsctl | grep rebuild-alt-php-ini | grep -v grep | awk '{print $1}') -o/tmp/strace/cagefsctl.trace.$(date +%s); done; 

Next, we study what he did, for example:

cat/tmp/strace/cagefsctl.trace.1593197892 | grep SIGHUP 

Identifiers of processes that were terminated by the SIGHUP signal were also received. Completed processes were the current PHP processes.

The data was transferred to CloudLinux in order to clarify the legitimacy of this process and whether it should work with such frequency.

Later we received the answer that the CDMY1CDMY command works correctly, the only caveat is that the command is executed too often. Usually called when a system update or changes to PHP settings.

The only clue in this case is to check who the cagefsctl parent process is.

The result was not long in coming, and what was our surprise - the ispmgrnode process was the parent process for cagefsctl. This was a bit strange, because the logging level for ISP Manager was set to the maximum and cagefsctl was not seen in ispmgr.log.

Now there was enough data to contact ISP System support.

Summary


The problem was triggered after the upgrade of ISP Manager. In general, updating the ISP Manager is a normal situation, but it led to the launch of the synchronization process, which ended with an error and restarted every minute. The synchronization process called the cagefsctl process, which in turn terminated the PHP processes.

The reason for the freezing of the synchronization process was the work carried out on the hosting to upgrade equipment. A few months before the problem a PCI-e NVMe drive was installed in the server, an XFS partition was created and mounted in the/var directory. Users files were transferred to it, but disk quotas were not updated. Mount options were not enough, it was also necessary to change the file system type in the ISP Manager settings, because it invokes disk quota update commands. For Ext4 and XFS, these commands are different.

Thus, the problem made itself felt a few months after the work.

Conclusions


We ourselves created the problem, but this was not clear until the last moment. For the future, we will try to take into account as many nuances as possible. Thanks to the help of more trained colleagues from CloudLinux and ISP System support, the problem was resolved. Now our hosting is working stably. And we have gained experience that will be useful to us in future work.

P.S.: I hope you were interested in reading the article’s material, and it will help someone to solve a similar problem faster.

Source