Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l
26081

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l
2

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[root@anon-WEB1 httpd]# cat access_log | grep '26/Mar' | wc -l
437598
[root@anon-WEB1 httpd]# cat access_log | grep '25/Mar' | wc -l
339445

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.

Fixing phpmyadmin, Connection for controluser as defined in your configuration failed.

This happens when the phpmyadmin package is installed, but for some reason or another the automation the package manager and phpmyadmin have for setting up the phpmyadmin user, and phpmyadmin database doesn’t properly use the table schema from /usr/share. Here is the process of fixing this error for those that get it.

Create a database called phpmyadmin

create database phpmyadmin;

You can actually call the database anything as long as you remember what you changed it to later.

Create a database user

MariaDB [(none)]> GRANT ALL PRIVILEGES ON phpmyadmin.* to phpmyadmin@localhost identified by 'AVERYSECUREpasswordgoeshere98123123sdabcsd123' ;
Query OK, 0 rows affected (0.00 sec)

Locate the create_tables.sql file copied by the package manager (or zip if installing from source)


[root@linuxbox phpMyAdmin]# find /usr/share | grep create_table
/usr/share/phpMyAdmin/sql/create_tables.sql
/usr/share/phpMyAdmin/sql/create_tables_drizzle.sql
/usr/share/phpMyAdmin/libraries/display_create_table.lib.php
/usr/share/phpMyAdmin/test/libraries/PMA_display_create_table_test.php

Import the database schema

# Check the file is correct
[root@wlinuxbox---wp phpMyAdmin]# vi /usr/share/phpMyAdmin/sql/create_tables.sql

# Import it
[root@wlinuxbox wp phpMyAdmin]# mysql -u root -p < /usr/share/phpMyAdmin/sql/create_tables.sql
Enter password:

Afterwards you will need to make phpmyadmin aware of the creds in /etc/phpMyAdmin/config.inc.php

vi /etc/phpMyAdmin/confing.inc.php

Confirm your changes

[root@wd-linux-8---wp phpMyAdmin]# cat /etc/phpMyAdmin/config.inc.php | grep -A3 phpmyadmin
 * wiki <http://wiki.phpmyadmin.net>.
 */

/*
--
$cfg['Servers'][$i]['controluser']   = 'phpmyadmin';          // MySQL control user settings
                                                    // (this user must have read-only
$cfg['Servers'][$i]['controlpass']   = 'AVERYSECUREpasswordgoeshere98123123sdabcsd123';          // access to the "mysql/user"

$cfg['Servers'][$i]['pmadb']         = 'phpmyadmin'

Your work is done, and that pesky error is gone now phpmyadmin has it’s DB. This tutorial has been a long time coming as I see this all the time.

Retrieving SMART status from a SDA disk attached to a MegaRAID card

Today I realised that manually checking the smart status of a disk required a bit more.

[root@box ~]# smartctl -a -d megaraid,0 /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.16.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE
Product:              ST3146356SS
Revision:             HS10
User Capacity:        146,815,733,760 bytes [146 GB]
Logical block size:   512 bytes
Logical Unit id:      -----
Serial number:        ----
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Mar 15 05:18:57 2018 CDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 15
Vendor (Seagate) cache information
  Blocks sent to initiator = 3694557980
  Blocks received from initiator = 4259977977
  Blocks read from cache and sent to initiator = 2859908284
  Number of read and write commands whose size <= segment size = 1099899109
  Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 65098.07
  number of minutes until next internal SMART test = 23

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   105645673        6         0  105645679   105645679      65781.538           0
write:         0        0        38        38         45      48511.618           7
verify: 48452245        7         0  48452252   48452259      43540.092           7

Non-medium error count:       48

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  16       1                 - [-   -    -]
# 2  Background short  Completed                  16       0                 - [-   -    -]

In order to retrieve this detail you need to use -d megaraid,n where n is the disk id number. Try 0, 1, 2, 3, etc. Or use the megaraidCLI to get a list of all the disks. I dunno I thought it was worth mentioning at least. It always pays to check this if customer is having weird I/O troubles. Quite a lot of detail is provided about errors the disk encounters. So looking here, even if SMART OK. Gives you an idea if any test failing for disk.

Converting a QEMU qcow2 cloud server image to an native disk img and putting on physical disk

Got this question at work a lot. Thought I’d finally get around to putting it down since it’s came up for me. I’ve got a virtual machine using virtio passthrough for my pcie, and I found actually that disk access via the qcow2 is pretty naff.

sudo apt-get install qemu-kvm

qemu-img convert windows10cloudimage.qcow2 -O raw diskimage.img

dd if=/path/to/windos10cloudimage.qcow2 of=/dev/sdc2

Please note in my case the physical partition I’d made was sdc2. I’d actually resized another 5TB disk I have in my system using gparted. Just so I can attach a physical partition with libvirtd. Evidently though libvirtd-manager doesn’t allow this business so I have to edit the xlm file in /etc/qemu/windows10.xml .

 

root@adam:/etc/libvirt/qemu# virsh  define /etc/libvirt/qemu/win10-uefi.xml 
Domain win10-uefi defined from /etc/libvirt/qemu/win10-uefi.xml
root@adam:/etc/libvirt/qemu# virt-manager

yeah baby!


You could alternatively do it all in one like below, though you may desire a copy of the img file as well as putting it to the disk.

qemu-img convert windows10.qcow2 -O raw /dev/sdc

Enabling MySQL Slow Query Logs

In the case your seeing very long pageload times, and have checked your application. It always pays to check the way in which the database performs when interacting with the application, especially if they are on either same or seperate server, as these significantly affect the way that your application will run.


mysql> SET GLOBAL slow_query_log = 'ON' ;
Query OK, 0 rows affected (0.01 sec)

mysql> SET GLOBAL slow_query_log_file = '/slow_query_logs/slow_query_logs.txt';
Query OK, 0 rows affected (0.00 sec)

mysql> SET GLOBAL long_query_time = 5;
Query OK, 0 rows affected (0.00 sec)

Migrating a Plesk site after moving keeps going to default plesk page

So today a customer had this really weird issue where we could see that the website domain that had been moved from one server to a new plesk server, wasn’t correctly loading. It actually turned out to be simple, and when trying to access a file on the domain like I would get the phpinfo.php file.

curl http://www.customerswebsite.com/info.php 

This suggested to me the website documentroot was working, and the only thing missing was probably the index. This is what it actually did turn out to me.

I wanted to test though that info.php really was in this documentroot, and not some other virtualhost documentroot, so I moved the info.php file to randomnumbers12313.php and the page still loaded, this confirms by adding that file on the filesystem that all is well, and that I found correct site, important when troubleshooting vast configurations.

I also found a really handy one liner for troubleshooting which file it comes out, this might not be great on a really busy server, but you could still grep for your IP address as well.

Visit the broken/affected website we will troubleshoot

curl -I somecustomerswebsite.com

Give all visitors to all apache websites occurring now whilst we visit it ourselves for testing

tail -f /var/log/httpd/*.log 

This will show us which virtualhost and/or path is being accessed, from where.

Give only visitors to all apache websites occurring on a given IP

tail -f /var/log/httpd/*.log  | grep 4.2.2.4

Where 4.2.2.4 is your IP address your using to visit the site. If you don’t know what your Ip is type icanhazip into google, or ‘what is my ip’, job done.

Fixing the Plesk website without a directory index

[root@mehcakes-App1 conf]# plesk bin domain --update somecustomerswebsite.com -nginx-serve-php true -apache-directory-index index.php

Simple enough… but could be a pain if you don’t know what your looking for.

Site keeps on going down because of spiders

So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.

So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.

[root@www logs]#  grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' |  wc -l
415

It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.

I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.

Count number of IP’s over a given time period in Apache Log

So, a customer had an outage, and wasn’t sure what caused it. It looked like some IP’s were hammering the site, so I wrote this quite one liner just to sort the IP’s numerically, so that uniq -c can count the duplicate requests, this way we can count exactly how many times a given IP makes a request in any given minute or hour:

Any given minute

# grep '24/Feb/2017:10:03' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given hour

# grep '24/Feb/2017:10:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given day

# grep '24/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Month

# grep '/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Year

# grep '/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given year might cause dupes though, and I’m sure there is a better way of doing that which is more specific

Comparing Files on the internet or CDN with MD5 to determine if they present same content

So, a customer today was having some issues with their CDN. They said that their SSL CDN was presenting a different image, than the HTTP CDN. So, I thought the best way to begin any troubleshooting process would firstly be to try and recreate those issues. To do that, I need a way to compare the files programmatically, enter md5sum a handly little shell application usually installed by default on most Linux OS.

[user@cbast3 ~]$ curl https://3485asd3jjc839c9d3-08e84cacaacfcebda9281e3a9724b749.ssl.cf3.rackcdn.com/companies/5825cb13f2e6c9632807d103/header.jpeg -o file ; cat file | md5sum
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  382k  100  382k    0     0  1726k      0 --:--:-- --:--:-- --:--:-- 1732k
e917a67bbe34d4eb2d4fe5a87ce90de0  -
[user@cbast3 ~]$ curl http://3485asd3jjc839c9d3-08e84cacaacfcebda9281e3a9724b749.r45.cf3.rackcdn.com/companies/5825cb13f2e6c9632807d103/header.jpeg -o file2 ; cat file2 | md5sum
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  382k  100  382k    0     0  2071k      0 --:--:-- --:--:-- --:--:-- 2081k
e917a67bbe34d4eb2d4fe5a87ce90de0  -

As we can see from the output of both, the md5sum (the hashing) of the two files is the same, this means there is a statistically very very very high chance the content is exactly the same, especially when passing several hundred characters or more. The hashing algorithm is combination based, so the more characters, the less likely same combination is of coming around twice!

In this case I was able to disprove the customers claim’s. Not because I wanted to, but because I wanted to solve their issue. These results show me, the issue must be, if it is with the CDN, with a local edgenode local to the customer having the issue. Since I am unable to recreate it from my location, it is therefore not unreasonable to assume that it is a client side issue, or a failure on our CDN edgenode side, local to the customer. That’s how I troubleshooted this, and quite happy with this one! Took about 2 minutes to do, and a few minutes to come up with. A quick and useful check indeed, which reduces the number of possibilities considerably in tracing down the issue!

Cheers &
Best wishes,
Adam

Please note the real CDN location has been altered for privacy reasons