Less Ghetto Log Parser for Website Hitcount/Downtime Analysis

Yesterday I created a proof of concept script, which basically goes off and identifies the hitcounts of a website, and can give a technician within a short duration of time (minutes instead of hours) exactly where hitcounts are coming from and where.

This is kind of a tradeoff, between a script that is automated, and one that is flexible.

The end goal is to provide a hitcount vs memory commit metric value. A NEW TYPE OF METRIC! HURRAH! (This is needed by the industry IMO).

And also would be nice to generate graphing and mean, average, and ranges, etc. So can provide output like ‘stat’ tool. Here is how I have progress

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/google.com-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: MARCH"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do echo "      > 9th March 2017, hits this $i hour";
grep "$day/$month/$year:$i" "$logfile" | wc -l;

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..60};
do echo "                  >>hits at $i:$j";
grep "$day/$month/$year:$i:$j" "$logfile" | wc -l;
done

done

Thing is, after I wrote this, I wasn’t really happy, so I refactored it a bit more;

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/someweb.biz-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: $month"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do
hitsperhour=$(grep "$day/$month/$year:$i" "$logfile" | wc -l;);
echo "    > $day $month $year, hits this $ith hour: $hitsperhour"

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..59};
do
hitsperminute=$(grep "$day/$month/$year:$i:$j" "$logfile" | wc -l);
echo "                  >>hits at $i:$j  $hitsperminute";
done

done

Now it’s pretty leet.. well, simple. but functional. Here is what the output of the more nicely refined script; I’m really satisfied with the tabulation.

[root@822616-db1 automation]# ./list-visits.sh
Total HITS: Mar
6019301
    > 9 Mar 2017, hits this  hour: 28793
                  >>hits at 01:01  416
                  >>hits at 01:02  380
                  >>hits at 01:03  417
                  >>hits at 01:04  408
                  >>hits at 01:05  385
^C

Migrating a Plesk site after moving keeps going to default plesk page

So today a customer had this really weird issue where we could see that the website domain that had been moved from one server to a new plesk server, wasn’t correctly loading. It actually turned out to be simple, and when trying to access a file on the domain like I would get the phpinfo.php file.

curl http://www.customerswebsite.com/info.php 

This suggested to me the website documentroot was working, and the only thing missing was probably the index. This is what it actually did turn out to me.

I wanted to test though that info.php really was in this documentroot, and not some other virtualhost documentroot, so I moved the info.php file to randomnumbers12313.php and the page still loaded, this confirms by adding that file on the filesystem that all is well, and that I found correct site, important when troubleshooting vast configurations.

I also found a really handy one liner for troubleshooting which file it comes out, this might not be great on a really busy server, but you could still grep for your IP address as well.

Visit the broken/affected website we will troubleshoot

curl -I somecustomerswebsite.com

Give all visitors to all apache websites occurring now whilst we visit it ourselves for testing

tail -f /var/log/httpd/*.log 

This will show us which virtualhost and/or path is being accessed, from where.

Give only visitors to all apache websites occurring on a given IP

tail -f /var/log/httpd/*.log  | grep 4.2.2.4

Where 4.2.2.4 is your IP address your using to visit the site. If you don’t know what your Ip is type icanhazip into google, or ‘what is my ip’, job done.

Fixing the Plesk website without a directory index

[root@mehcakes-App1 conf]# plesk bin domain --update somecustomerswebsite.com -nginx-serve-php true -apache-directory-index index.php

Simple enough… but could be a pain if you don’t know what your looking for.

QID 150004 : Path-Based Vulnerability

A customer of ours had an issue with some paths like theirwebsite.com/images returning a 200 OK, and although the page was completely blank, and exposed no information it was detected as a positive indicator of exposed data, because of the 200 OK.

more detail: https://community.qualys.com/thread/16746-qid-150004-path-based-vulnerability

Actually in this case it was a ‘whitescreen’, or just a blank index page, to prevent the Options +indexes in the apache httpd configuration showing the images path. You probably don’t want this and can just set your Option indexes.

Change from:

Options +Indexes
# in older versions it may be defined as
Options Indexes

Change to:

Options -Indexes

This explicitly forbids, but older versions of apache2 might need this written as:

Options Indexes

To prevent an attack on .htaccess you could also add this to httpd.conf to ensure the httpd.conf is enforced and takes precedence over any hacker or user that adds indexing incorrectly/mistakenly/wrongly;

<Directory />
    Options FollowSymLinks
    AllowOverride None
</Directory>

Simple enough.

/etc/apache2/conf.d/security – Ubuntu 12.04.1 – Default File exploitable

In Ubuntu 12.04.1 there were some rather naughty security updates in specific, /etc/apache2/conf.d/security file has important lines commented out:

#<Directory />
#        AllowOverride None
#        Order Deny,Allow
#        Deny from all
#</Directory>

These above lines set the policy for the /var/www/ directory to forbid all access, then being commented out means that the policy is not forbidding access by default.

This is not good. In our customers case, they also had A listen 443 directive in their ports.conf, however they hadn’t added any default virtualhosts using port 443 SSL. This actually means that the /var/www directory becomes the ‘/’ for default HTTPS negotiation to the site. NOT GOOD since if directory listing is also available it will list the contents of /var/www as well, as exposing files that can be directly accessed, the directory listing option will make it possible to see all files listed, instead of just opening up the files in /var/www/ for access via http://somehost.com/somefileinvarwww.sql naturally its much harder if the attacker has to guess the files, but still, not good!

NOT GOOD AT ALL. If the customer has a /var/www/vhosts/sites and is using /var/www for their database dumps or other files it means those files could be retrieved.

The fix is simple, remove these lines from /etc/apache2/ports.conf,

Change from

Listen 443
NameVirtualHost *:443 

Change to

#Listen 443
#NameVirtualHost *:443 

Also make sure that the secure file (/etc/apache2/conf.d/secure) doesn’t have these lines commented as Ubuntu 12.04.1 and 12.10 may suffer; this is actually the main issue

Change from:

#<Directory />
# AllowOverride None
# Order Deny,Allow
# Deny from all
#</Directory>

Change to:

<Directory />
 AllowOverride None
 Order Deny,Allow
 Deny from all
</Directory>

Restart your apache2

# Most other OS
service apache2 restart
/etc/init.d/apache2 restart

# CentOS 7/RHEL7
systemctl restart apache2

This exploitable/vulnerable configuration was changed in later updates to the apache configurations in Ubuntu, however it appears for some people there are packages being held back for a couple of reasons. First, it appears that this server was initially deployed on Ubuntu 12.10, which is a short-term release that reached end of life May 16, 2014. As the dist-upgrade path for Ubuntu 12.10 would be to the LTS Trusty Tahr release, which reaches end of life this May.

I suspect that a significant contributor to the issue was that the releases were unsupported by the vendor at the time of being affected. The customer also used the vulnerable ports.conf file it appears with a deployment of chef.

For more information see:

http://mixeduperic.com/downloads/org-files/ubuntu/etcapache2confdsecurity-ubuntu-12041-default-file.html

Site keeps on going down because of spiders

So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.

So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.

[root@www logs]#  grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' |  wc -l
415

It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.

I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.

Count number of IP’s over a given time period in Apache Log

So, a customer had an outage, and wasn’t sure what caused it. It looked like some IP’s were hammering the site, so I wrote this quite one liner just to sort the IP’s numerically, so that uniq -c can count the duplicate requests, this way we can count exactly how many times a given IP makes a request in any given minute or hour:

Any given minute

# grep '24/Feb/2017:10:03' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given hour

# grep '24/Feb/2017:10:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given day

# grep '24/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Month

# grep '/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Year

# grep '/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given year might cause dupes though, and I’m sure there is a better way of doing that which is more specific

Installing Ioncube Zend Extension for PHP

A lot of customers note that sometimes, the exact version of ioncube is not available for their specific version of PHP in their repository for their OS.

This isn’t really a big deal, and is actually something that can be manually installed.

cd ~
_php=$(php -r "echo PHP_MAJOR_VERSION.'.'.PHP_MINOR_VERSION;"); echo $_php
wget http://downloads3.ioncube.com/loader_downloads/ioncube_loaders_lin_x86-64.tar.gz
tar -zxf ioncube_loaders_lin_x86-64.tar.gz ioncube/ioncube_loader_lin_$_php.so
chown -R root. ioncube
\mv ioncube/ioncube_loader_lin_$_php.so /usr/lib64/php/modules/
echo "zend_extension=/usr/lib64/php/modules/ioncube_loader_lin_$_php.so" > /etc/php.d/01a-ioncube-loader.ini 

Thanks to Alex Drapkin for this.

How To Enable GZIP Compression/mod_deflate in Apache

This is something quite easy to do.

Check that the deflate module is installed

# apachectl -M | grep deflate
 deflate_module (shared)
Syntax OK

Check the content encoding is enabled by using curl on the site

# curl -I -H 'Accept-Encoding: gzip,deflate' http://somewebsite.com/
HTTP/1.1 200 OK
Date: Fri, 23 Jan 2017 11:02:16 GMT
Server: Apache
X-Pingback: http://somewebsite.com/somesite.php
X-Powered-By: rabbits
Connection: close
Content-Type: text/html; charset=UTF-8

Create a file called /etc/httpd/conf.d/deflate.conf

<ifmodule mod_deflate.c>
    AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css text/javascript application/javascript
    DeflateCompressionLevel 8
</ifmodule>

Restart apache2

/etc/init.d/httpd reload
Reloading httpd:

retest the gzip content-encoding directive with curl using Accept-Encoding: gzip directive.


# curl -I -H 'Accept-Encoding: gzip,deflate' http://somesite.com/
HTTP/1.1 200 OK
Date: Fri, 25 Sep 2015 00:04:26 GMT
Server: Apache
X-Pingback: http://somesite.com/somesite.php
X-Powered-By: muppets
Vary: Accept-Encoding
Content-Encoding: gzip    <---- this line indicates that compression is active
Content-Length: 20
Connection: close
Content-Type: text/html; charset=UTF-8

For this, thanks to odin.com

Checking Webserver Logs and generating hit counts

So, I’ve been meaning to do this for a while. I’m sure your all familiar with this crappy oneliner where you’ll check the hits using wc -l against a webserver log for a given hour and use for i in seq or similar to get the last 10 minutes data, or the last 59 minutes. But getting hours and days is a bit harder. You need some additional nesting, and, it isn’t difficult for you to do at all!

for i in 0{1..9}; do echo "16/Jan/2017:06:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/soemsite.com-access_log | wc -l ; done

I improved this one, quite significantly by adding an additional for j, for the hour, and adding an additional 0 in front of {1..9}, this properly is matching the Apache2 log format and allows me to increment through all the hours of the day. to present ;-D All that is missing is some error checking when the last date in the file is, im thinking a tail and grep for the timecode from the log should be sufficient.

Here is the proud oneliner I made for this!

for j in 0{1..9}; do for i in 0{1..9} {10..59}; do echo "16/Jan/2017:$j:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/website.com-access_log | wc -l ; done; done

Configuring SFTP without chroot (the easy way)

So, I wouldn’t normally recommend this to customers. However, there are secure ways to add SFTP access, without the SFTP subsystem having to be modified. It’s also possible to achieve similar setup in a location like /home/john/public_html.

Let’s assume that public_html and everything underneath it is chowned john:john. So john:john has all the access, and apache2 runs with it’s own gid;uid. This was a pretty strange setup, and you don’t see it every day. But actually, it allowed me to solve another problem that I’ve been seeing/seeing customers have for a long time. That problem is the problem of effectively and easily managing permissions. Once I figured this out it was a serious ‘aha!’ moment!. Here’s why.

Inside the /etc/group, we find the customers developer has done something tragic:

[root@web public_html]# cat /etc/group | grep apache
apache:x:48:john,bob

But fine.. we’ll run with it.

We can see all the files inside their /home/john/public_html , the sight is not good

]# ls -al 
total 232
drwxrwxr-x 27 john john  4096 Dec 20 15:56 .
drwxr-xr-x 12 john john  4096 Dec 15 11:08 ..
drwxrwxr-x 10 john john  4096 Dec 16 09:56 administrator
drwxrwxr-x  2 john john  4096 Dec 14 11:18 bin
drwxrwxr-x  4 john john  4096 Nov  2 15:05 build
-rw-rw-r--  1 john john   714 Nov  2 15:05 build.xml
drwxrwxr-x  3 john john  4096 Nov  2 15:05 c
drwxrwxr-x  3 john john 45056 Dec 20 13:09 cache
drwxrwxr-x  2 john john  4096 Dec 14 11:18 cli
drwxrwxr-x 32 john john  4096 Dec 14 11:18 components
-rw-rw-r--  1 john john  1863 Nov  2 15:05 configuration-live.php
-rw-r--r--  1 john john  3173 Dec 15 11:08 configuration.php
drwxrwxr-x  3 john john  4096 Nov  2 15:05 docs
drwxrwxr-x  8 john john  4096 Dec 16 17:17 .git
-rw-rw-r--  1 john john  1734 Dec 14 11:21 .gitignore

It gets worse..

# cat /etc/passwd | grep john
john:x:501:501::/home/john:/bin/sh

Now, adding an sftp user into this, might look like a nightmare, but actually with some retrospective thought it was really easy.

Solving this mess:

Install Scponly

yum install scponly

Create new ‘SFTP’ user:

scponlyuser:x:504:505::/home/john:/usr/bin/scponly

Create a password for user scponlyuser

 
passwd scponlyuser

Solution to john:john permissions

[root@web public_html]# cat /etc/group | grep john
apache:x:48:john,bob
john:x:501:scponlyuser

We simply make scponlyuser part of the john group by adding the second line there. That way, the scponlyuser will have read/write access to the same files as the shell user, without exposing any additional stuff.

This was a cool solution to fixing this customers insecure solution, that they wanted to keep it the way they had, and was also great way to add an sftp account without requiring root jail. Whether it’s better than the root jail, is really debatable, however scponly enforces that only this account can be used only for SCP, as well as achieving sftp user access, without a jail.

I was proud of this achievement.. goes to show Linux permissions are really more flexible than we can imagine. And, whether you really want to flex those permissions muscles though, should be of concern. I advised this customer to change this setup, remove the /bin/sh, among other things..

We finally test SFTP is working as expected with the new scponlyuser


sftp> rmdir test
sftp> get index.php
Fetching /home/john/public_html/index.php to index.php
/home/john/public_html/index.php                                                                                     100% 1420     1.4KB/s   00:00
sftp> put index.php
Uploading index.php to /home/john/public_html/index.php
index.php                                                                                                                100% 1420     1.4KB/s   00:00
sftp> mkdir test
sftp> rmdir test

Just replace ‘scponly’ with whatever username your setting up. The only part that you need to keep the ‘scponly’ bit, is /usr/bin/scponly, this is the environment logging into. Apologies that scponly is so similar to scponlyuser ;-D

scponlyuser:x:504:505::/home/john:/usr/bin/scponly

I was very pleased with this! Hope that you find this useful too!