Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l
26081

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l
2

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[[email protected] httpd]# cat access_log | grep '26/Mar' | wc -l
437598
[[email protected] httpd]# cat access_log | grep '25/Mar' | wc -l
339445

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.

Adding a User with Sudo Access using visudo

Use the command visudo to access the /etc/sudoers file.

visudo

Uncomment this line:

## Allows people in group wheel to run all commands
# %wheel        ALL=(ALL)       ALL

So it looks like:

## Allows people in group wheel to run all commands
 %wheel        ALL=(ALL)       ALL

Save the file then

Run this command for your user

usermod -aG wheel usernameforsudoaccesshere

Your done.

But test it

su usernamewithsudoaccess
sudo yum history

Any root only command is a good enough test for this. The command should run succesfully after re-providing your users password for sudo access.

Calculating the Average Hits per minute en-mass for thousands of sites

So, I had a customer having some major MySQL woes, and I wanted to know whether the MySQL issues were query related, as in due to the frequency of queries alone, or the size of the database. VS it being caused by the number of visitors coming into apache, therefore causing more frequency of MySQL hits, and explaining the higher CPU usage.

The best way to achieve this is to inspect /var/log/httpd with ls -al,

First we take a sample of all of the requests coming into apache2, as in all of them.. provided the customer has used proper naming conventions this isn’t a nightmare. Apache is designed to make this easy for you by the way it is setup by default, hurrah!

[[email protected] logparser]# time tail -f /var/log/httpd/*access_log > allhitsnow
^C

real	0m44.560s
user	0m0.006s
sys	0m0.031s

Time command prefixed here, will tell you how long you ran it for.

[[email protected] logparser]# cat allhitsnow | wc -l
1590

The above command shows you the number of lines in allhitsnow file, which was written to with all the new requests coming into sites from all the site log files. Simples! 1590 queries a minute is quite a lot.

Lysncd fails with Error: Terminating since out of inotify watches.

This is a remarkably easy problem to solve.

From man:

The inotify API provides a mechanism for monitoring filesystem
       events.  Inotify can be used to monitor individual files, or to
       monitor directories.  When a directory is monitored, inotify will
       return events for the directory itself, and for files inside the
       directory.

The problem is a kernel based value, that determines how much memory is used by processes such as lsyncd. Lets just double check what value it is set to presently:

Check Inotify max_user_watches value presently

[[email protected] php]# cat /proc/sys/fs/inotify/max_user_watches
100000

Check the amount of available memory on the server to ensure it can deal with the increase

[[email protected] php]# free -m
             total       used       free     shared    buffers     cached
Mem:          7358       6577        781          1        376       2427
-/+ buffers/cache:       3773       3584
Swap:            0          0          0

The cache can be added to free, as a maximum saturation. Linux uses lots of memory it doesn’t really need for efficiency.

Each used inotify watch uses 1 kB RAM in 64 bit systems, for 32bit its half.

Increase Inotify watches to a higher value

sysctl fs.inotify.max_user_watches=200000

Less Ghetto Log Parser for Website Hitcount/Downtime Analysis

Yesterday I created a proof of concept script, which basically goes off and identifies the hitcounts of a website, and can give a technician within a short duration of time (minutes instead of hours) exactly where hitcounts are coming from and where.

This is kind of a tradeoff, between a script that is automated, and one that is flexible.

The end goal is to provide a hitcount vs memory commit metric value. A NEW TYPE OF METRIC! HURRAH! (This is needed by the industry IMO).

And also would be nice to generate graphing and mean, average, and ranges, etc. So can provide output like ‘stat’ tool. Here is how I have progress

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/google.com-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: MARCH"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do echo "      > 9th March 2017, hits this $i hour";
grep "$day/$month/$year:$i" "$logfile" | wc -l;

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..60};
do echo "                  >>hits at $i:$j";
grep "$day/$month/$year:$i:$j" "$logfile" | wc -l;
done

done

Thing is, after I wrote this, I wasn’t really happy, so I refactored it a bit more;

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/someweb.biz-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: $month"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do
hitsperhour=$(grep "$day/$month/$year:$i" "$logfile" | wc -l;);
echo "    > $day $month $year, hits this $ith hour: $hitsperhour"

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..59};
do
hitsperminute=$(grep "$day/$month/$year:$i:$j" "$logfile" | wc -l);
echo "                  >>hits at $i:$j  $hitsperminute";
done

done

Now it’s pretty leet.. well, simple. but functional. Here is what the output of the more nicely refined script; I’m really satisfied with the tabulation.

[[email protected] automation]# ./list-visits.sh
Total HITS: Mar
6019301
    > 9 Mar 2017, hits this  hour: 28793
                  >>hits at 01:01  416
                  >>hits at 01:02  380
                  >>hits at 01:03  417
                  >>hits at 01:04  408
                  >>hits at 01:05  385
^C

PHP5 newrelic agent not collecting data

So today I had a newrelic customer who was having issues after installing the newrelic php plugin. He couldn’t understand why it wasn’t collecting data. For it to collecting data you need to make sure newrelic-daemon process is running by using ps auxfwww | grep newrelic-daemon.

We check the process of the daemon is running

[[email protected] ~]# ps -ef | grep newrelic-daemon
root     26007 18914  0 09:59 pts/0    00:00:00 grep newrelic-daemon

We check the status of the daemon process

[[email protected] ~]# service newrelic-daemon status
newrelic-daemon is stopped...

Copy basic NewRelic configuration template to correct location

[[email protected] ~]# cp /etc/newrelic/newrelic.cfg.template /etc/newrelic/newrelic.cfg

Start the daemon

[[email protected] ~]# service newrelic-daemon start
Starting newrelic-daemon:                                  [  OK  ]

How To Enable GZIP Compression/mod_deflate in Apache

This is something quite easy to do.

Check that the deflate module is installed

# apachectl -M | grep deflate
 deflate_module (shared)
Syntax OK

Check the content encoding is enabled by using curl on the site

# curl -I -H 'Accept-Encoding: gzip,deflate' http://somewebsite.com/
HTTP/1.1 200 OK
Date: Fri, 23 Jan 2017 11:02:16 GMT
Server: Apache
X-Pingback: http://somewebsite.com/somesite.php
X-Powered-By: rabbits
Connection: close
Content-Type: text/html; charset=UTF-8

Create a file called /etc/httpd/conf.d/deflate.conf

<ifmodule mod_deflate.c>
    AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css text/javascript application/javascript
    DeflateCompressionLevel 8
</ifmodule>

Restart apache2

/etc/init.d/httpd reload
Reloading httpd:

retest the gzip content-encoding directive with curl using Accept-Encoding: gzip directive.

# curl -I -H 'Accept-Encoding: gzip,deflate' http://somesite.com/
HTTP/1.1 200 OK
Date: Fri, 25 Sep 2015 00:04:26 GMT
Server: Apache
X-Pingback: http://somesite.com/somesite.php
X-Powered-By: muppets
Vary: Accept-Encoding
Content-Encoding: gzip    <---- this line indicates that compression is active
Content-Length: 20
Connection: close
Content-Type: text/html; charset=UTF-8

For this, thanks to odin.com

Checking Webserver Logs and generating hit counts

So, I’ve been meaning to do this for a while. I’m sure your all familiar with this crappy oneliner where you’ll check the hits using wc -l against a webserver log for a given hour and use for i in seq or similar to get the last 10 minutes data, or the last 59 minutes. But getting hours and days is a bit harder. You need some additional nesting, and, it isn’t difficult for you to do at all!

for i in 0{1..9}; do echo "16/Jan/2017:06:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/soemsite.com-access_log | wc -l ; done

I improved this one, quite significantly by adding an additional for j, for the hour, and adding an additional 0 in front of {1..9}, this properly is matching the Apache2 log format and allows me to increment through all the hours of the day. to present ;-D All that is missing is some error checking when the last date in the file is, im thinking a tail and grep for the timecode from the log should be sufficient.

Here is the proud oneliner I made for this!

for j in 0{1..9}; do for i in 0{1..9} {10..59}; do echo "16/Jan/2017:$j:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/website.com-access_log | wc -l ; done; done

Tracing Down Network and Process Traffic Using Netfilter

Every now and then at Rackspace, as with any hosting provider. We do occasionally have issues where customers have left themselves open to attack. In such cases sometimes customers find their server is sending spam email, and is prone to other malware occurring on the Rackspace Network.

Due to AUP and other obligations, it can become a critical issue for both the uptime, and reputation of your site. In many cases, customers do not necessarily have forensic experience, and will struggle very hard to remove the malware. In some cases, the malware keeps on coming back, or, like in my customers case, you could see lots of extra network traffic still using tcpdump locally on the box.

Enter, netfilter, part of the Linux Kernel, and it is able, if you ask it, to track down where packets are coming from, on a process level. This is really handy if you have an active malware or spam process on your system, since you can find out exactly where it is, before doing more investigation. Such a method, also allows you to trace down any potential false positives, since the packet address is always included, you get a really nice overview.

To give you an idea, I needed to install a kernel with debuginfo, just to do this troubleshooting, however this depends on your distribution.

Updating your Kernel may be necessary to use netfilter debug

$yum history info 18

Transaction performed with:
    Installed     rpm-4.11.3-21.el7.x86_64                               @base
    Installed     yum-3.4.3-150.el7.centos.noarch                        @base
    Installed     yum-plugin-auto-update-debug-info-1.1.31-40.el7.noarch @base
    Installed     yum-plugin-fastestmirror-1.1.31-40.el7.noarch          @base
Packages Altered:
    Updated kernel-debuginfo-4.4.40-202.el7.centos.x86_64               @base-debuginfo
    Update                   4.4.42-202.el7.centos.x86_64               @base-debuginfo
    Updated kernel-debuginfo-common-x86_64-4.4.40-202.el7.centos.x86_64 @base-debuginfo
    Update                                 4.4.42-202.el7.centos.x86_64 @base-debuginfo

You could use a similar process using netfilter.ip.local_in, I suspect.

The Script

#! /usr/bin/env stap

# Print a trace of threads sending IP packets (UDP or TCP) to a given
# destination port and/or address.  Default is unfiltered.

global the_dport = 0    # override with -G the_dport=53
global the_daddr = ""   # override with -G the_daddr=127.0.0.1

probe netfilter.ip.local_out {
    if ((the_dport == 0 || the_dport == dport) &&
        (the_daddr == "" || the_daddr == daddr))
            printf("%s[%d] sent packet to %s:%d\n", execname(), tid(), daddr, dport)
}

Executing the Script

[[email protected] hacked]# chmod +x dns_probe.sh
[[email protected] hacked]# ./dns_probe.sh
Missing separate debuginfos, use: debuginfo-install kernel-3.10.0-514.2.2.el7.x86_64
swapper/3[0] sent packet to 78.136.44.6:0
sshd[25421] sent packet to 134.1.1.1:55336
sshd[25421] sent packet to 134.1.1.1:55336
swapper/3[0] sent packet to 78.136.44.6:0

I was a little bit concerned about the above output, it looks like swapper with pid 3, is doing something it wouldn’t normally do. Upon further inspection though, we find it is just the outgoing cloud monitoring call;

# nslookup 78.136.44.6
Server:		83.138.151.81
Address:	83.138.151.81#53

Non-authoritative answer:
6.44.136.78.in-addr.arpa	name = collector-lon-78-136-44-6.monitoring.rackspacecloud.com.

Authoritative answers can be found from: