Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l
26081

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l
2

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[root@anon-WEB1 httpd]# cat access_log | grep '26/Mar' | wc -l
437598
[root@anon-WEB1 httpd]# cat access_log | grep '25/Mar' | wc -l
339445

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.

Renaming a folder en-mass

I was figuring out a way to rename a folder en-mass which took me a few seconds, instead of about an hour of clicking.

Linux automation is a win.

# mmv "Documentary - National Geometry series-*.avi" "#1.avi"

I had some old tv series but they were all prefixed with the same filename ‘Documentary – National Geometry series’ before their volume number and title. So I found the above command to strip that out for all 100 of them, in a single command. Basically #1 is where * is in the source command regular expression, so everything after the source expression and between the *.

It’s a really nice utility and preferred on BSD systems to perl rename imo

Rapid Troubleshooting mail not arriving at its destination

Today I had a customer whose mail was not arriving at it’s destination. IN my customers case, they knew that it was arriving at the destination but going into the spam folder. This is likely due to blacklisting, however some ISP or clients dont know it’s reaching the spam folder at the other end, and since the most common cause of this happening is due to a missing SPF (Sending policy framework record), MX record or DKIM record it is possibly to rapidly check the DNS of each using dig, if the sender domain is known.

To check for the IP Blacklistings on a mailserver use it’s ip in one of the many spam a checker;

https://mxtoolbox.com/SuperTool.aspx?action=blacklist

In other cases it might be being caused by email bounces, due to the PTR, MX or DKIM records, and not even getting into the inbox, you can see that on the sending mail server using a simple grep command;

cat /var/log/maillog | grep -i status=bounced

You probably want to save the file as well

cat /var/log/maillog | grep -i status=bounced > bouncedemail.txt 

If you wanted to know which domains bounced email, if you’ve ensured all sending domains are correctly configured via DNS..

[root@api ~]# cat /var/log/maillog | grep -i status=bounced | awk '{print $7}'
to=<[email protected]>,
to=<[email protected]>,

You could use sed to extract which domains failed..

Then you could use whois against the domains to reach out to the email contact with some automation that explains that ‘weve checked DKIM, MX and SPF and all are configured correctly and believe this is an error on your behalf, etc, blah blah’..

Such a thing would be a good idea to implement for some large email providers, and I’m sure you could automate the DNS checking as well. You could likely automate the whole thing, just by watching the logs and DNS records, and some intelligent grep and awking.

Not bad.

Enabling MySQL Slow Query Logs

In the case your seeing very long pageload times, and have checked your application. It always pays to check the way in which the database performs when interacting with the application, especially if they are on either same or seperate server, as these significantly affect the way that your application will run.


mysql> SET GLOBAL slow_query_log = 'ON' ;
Query OK, 0 rows affected (0.01 sec)

mysql> SET GLOBAL slow_query_log_file = '/slow_query_logs/slow_query_logs.txt';
Query OK, 0 rows affected (0.00 sec)

mysql> SET GLOBAL long_query_time = 5;
Query OK, 0 rows affected (0.00 sec)

Magento Rewrite Woes … really woes

I had a customer this week that had some terrible rewrite woes with their magento site. They knew that a whole ton of their images were getting 404’s most likely because rewrite wasn’t getting to the correct filesystem path that the file resided. This was due to their cache being broken, and their second developer not creating proper rewrite rule.

As a sysadmin our job is not a development role, we are a support role, and in order to enable the developer to fix the problem, the developer needs to be able to see exactly what it is, enter the sysads task. I wrote this really ghetto script, which essentially hunts in the nginx error log for requests that failed with no such file, and then qualifies them by grepping for jpg file types. This is not a perfect way of doing it, however, it is really effective at identifying the broken links.

Then I have a seperate routine that strips the each of the file uri’s down to the filename, and locates the file on the filesystem, and matches the filename on the filesystem that the rewrite should be going to, as well as the incorrect path that the rewrite is presently putting the url to. See the script below:

#!/bin/bash

# Author: Adam Bull
# Company: Rackspace LTD, Hayes
# Purpose:
#          This customer has a difficulty with nginx rewriting to the incorrect file
# this script goes into the nginx error.log and finds all the images which have the broken rewrite rules
# then after it has identified the broken rewrite rule files, it searches for the correct file on the filesystem
# then correlates it with the necessary rewrite rule that is required
# this could potentially be used for in-place upgrade by developers
# to ensure that website has proper redirects in the case of bugs with the ones which exist.

# This script will effectively find all 404's and give necessary information for forming a rewrite rule, i.e. the request url, from nginx error.log vs the actual filesystem location
# on the hard disk that the request needs to go to, but is not being rewritten to file path correctly already

# that way this data could be used to create rewrite rules programmatically, potentially
# This is a work in progress


# These are used for display output
cat /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep 'No such file' | awk '{print "URL Request:",$21,"\nFilesystem destination missing:",$7"\n"}'
zcat /var/log/nginx/*error*.gz  | grep 'No such file' | awk '{print "URL Request:",$21,"\nFilesystem destination detected missing:",$7"\n"}'

# These below are used for variable population for locating actual file paths of missing files needed to determine the proper rewrite path destination (which is missing)
# we qualify this with only *.jpg files

cat /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep 'No such file' | awk '{print $7}' | sed 's/\"//g' |  sed 's/.*\///' | grep jpg > lost.txt
zcat /var/log/nginx/*error*.gz  | grep 'No such file' | awk '{print $7}' | sed 's/\"//g' |  sed 's/.*\///' | grep jpg >> lost.txt

# FULL REQUEST URL NEEDED AS WELL
cat /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep 'No such file' | awk '{print "http://mycustomerswebsite.com",$21}' | sed 's/\"//g' | grep jpg > lostfullurl.txt
zcat /var/log/nginx/*error*.gz  | grep 'No such file' | awk '{print "http://customerwebsite.com/",$21}' | sed 's/\"//g' | grep jpg >> lostfullurl.txt

# The below section is used for finding the lost files on filesystem and pairing them together in variable pairs
# for programmatic usage for rewrite rules


while true
do
  read -r f1 <&3 || break
  read -r f2 <&4 || break
  printf '\n\n'
  printf 'Found a broken link getting a 404 at : %s\n'
  printf "$f1\n"
  printf 'Locating the correct link of the file on the filesystem: %s\n'
        find /var/www/magento | grep $f2
done 3<lostfullurl.txt 4<lost.txt

I was particularly proud of the last section which uses a ‘dual loop for two input files’ in a single while statement, allowing me to achieve the descriptions above.

Output is in the form of:


Found a broken link getting a 404 at :
http://customerswebsite.com/media/catalog/product/cache/1/image/800x700/9df78eab33525d08d6e5fb8d27136e95/b/o/image-magick-file-red.jpg
Locating the correct link of the file on the filesystem:
/var/www/magento/media/catalog/product/b/o/image-magick-file-red.jpg

As you can see the path is different on the filesystem to the url that the rewrite is putting the request to, hence the 404 this customer is getting.

This could be a really useful script, and, I see no reason why the script could not generate the rewrite rules programatically from the 404 failures it finds, it could actually create rules that are necessary to fix the problem. Now, this is not an ideal fix, however the script will allow you to have an overview either to fix this properly as a developer, or as a sysadmin to patch up with new rewrite rules.

I’m really proud of this one, even though not everyone may see a use for it. There really really is, and this customer is stoked, think of it like this, how can a developer fix it if he doesn’t have a clear idea of the things that are broken, and this is the sysads job,

Cheers &
Best wishes,
Adam

Recovering Corrupt RPM DB

I had a support specialist today that had an open yum task they couldn’t kill gracefully, after kill -9, this was happening

	
[root@mybox home]# yum list | grep -i xml
rpmdb: Thread/process 31902/140347322918656 failed: Thread died in Berkeley DB library
error: db3 error(-30974) from dbenv-&gt;failchk: DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db3 -  (-30974)
error: cannot open Packages database in /var/lib/rpm
CRITICAL:yum.main:
Error: rpmdb open failed
[root@mybox home]#

The solution to fix this is to manually erase the yumdb and to manually rebuild it;


[root@mybox home]# rm -f /var/lib/rpm/__*
[root@mybox home]# rpm --rebuilddb

Ghetto but simple Log Parser for testing website performance

So… I got fedup with constantly writing my own stuff for basic things. I’m going to turn this into something more spectacular that accepts commandline input, and also, allows you to define which days, and months, ranges, and stuff like that.

It’s a no-frills-ghetto log parser.

#!/bin/bash

echo "Total HITS: MARCH"
grep "/Mar/2017" /var/log/httpd/somewebsite.com-access_log | wc -l;

for i in 0{1..9} {10..24};

do echo "      > 9th March 2017, hits this $i hour";
grep "09/Mar/2017:$i" /var/log/httpd/somesite.com-access_log | wc -l;

        # break down the minutes in a nested visual way thats AWsome
for j in 0{1..9} {10..60};
do echo "                  >>hits at $i:$j";
grep "09/Mar/2017:$i:$j" /var/log/httpd/somesite.com-access_log | wc -l;
done

done

It’s not perfect, it’s just a proof of concept, really.

PHP5 newrelic agent not collecting data

So today I had a newrelic customer who was having issues after installing the newrelic php plugin. He couldn’t understand why it wasn’t collecting data. For it to collecting data you need to make sure newrelic-daemon process is running by using ps auxfwww | grep newrelic-daemon.

We check the process of the daemon is running

[root@rtd-production-1 ~]# ps -ef | grep newrelic-daemon
root     26007 18914  0 09:59 pts/0    00:00:00 grep newrelic-daemon

We check the status of the daemon process

[root@rtd-production-1 ~]# service newrelic-daemon status
newrelic-daemon is stopped...

Copy basic NewRelic configuration template to correct location

[root@rtd-production-1 ~]# cp /etc/newrelic/newrelic.cfg.template /etc/newrelic/newrelic.cfg

Start the daemon

[root@rtd-production-1 ~]# service newrelic-daemon start
Starting newrelic-daemon:                                  [  OK  ]

Site keeps on going down because of spiders

So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.

So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.

[root@www logs]#  grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' |  wc -l
415

It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.

I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.