Calculating the Average Hits per minute en-mass for thousands of sites

So, I had a customer having some major MySQL woes, and I wanted to know whether the MySQL issues were query related, as in due to the frequency of queries alone, or the size of the database. VS it being caused by the number of visitors coming into apache, therefore causing more frequency of MySQL hits, and explaining the higher CPU usage.

The best way to achieve this is to inspect /var/log/httpd with ls -al,

First we take a sample of all of the requests coming into apache2, as in all of them.. provided the customer has used proper naming conventions this isn’t a nightmare. Apache is designed to make this easy for you by the way it is setup by default, hurrah!

[root@box-DB1 logparser]# time tail -f /var/log/httpd/*access_log > allhitsnow
^C

real	0m44.560s
user	0m0.006s
sys	0m0.031s

Time command prefixed here, will tell you how long you ran it for.

[root@box-DB1 logparser]# cat allhitsnow | wc -l
1590

The above command shows you the number of lines in allhitsnow file, which was written to with all the new requests coming into sites from all the site log files. Simples! 1590 queries a minute is quite a lot.

Site keeps on going down because of spiders

So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.

So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.

[root@www logs]#  grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' |  wc -l
415

It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.

I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.

Count number of IP’s over a given time period in Apache Log

So, a customer had an outage, and wasn’t sure what caused it. It looked like some IP’s were hammering the site, so I wrote this quite one liner just to sort the IP’s numerically, so that uniq -c can count the duplicate requests, this way we can count exactly how many times a given IP makes a request in any given minute or hour:

Any given minute

# grep '24/Feb/2017:10:03' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given hour

# grep '24/Feb/2017:10:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given day

# grep '24/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Month

# grep '/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Year

# grep '/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given year might cause dupes though, and I’m sure there is a better way of doing that which is more specific

How to limit the amount of memory httpd is using on CentOS 7 with Cgroups

CentOS 7, introduced something called CGroups, or control groups which has been in the stable kernel since about 2006. The systemD unit is made of several parts, systemD unit resource controllers can be used to limit memory usage of a service, such as httpd control group in systemD.

# Set Memory Limits for a systemD unit
systemctl set-property httpd MemoryLimit=500MB

# Get Limits for a systemD Unit
systemctl show -p CPUShares 
systemctl show -p MemoryLimit

Please note that OS level support is not generally provided with managed infrastructure service level, however I wanted to help where I could hear, and it shouldn’t be that difficult because the new stuff introduced in SystemD and CGroups is much more powerful and convenient than using ulimit or similar.

Using Sar to tell a story

So, a customer is experiencing slowness/sluggishness in their app. You know there is not issue with the hypervisor from instinct, but instinct isn’t enough. Using tools like xentop, sar, bwm-ng are critical parts of live and historical troubleshooting.

Sar can tell you a story, if you can ask the storyteller the write questions, or even better, pick up the book and read it properly. You’ll understand what the plot, scenario, situation and exactly how to proceed with troubleshooting by paying attention to these data and knowing which things to check under certain circumstances.

This article doesn’t go in depth to that, but it gives you a good reference of a variety of tests, the most important being, cpu usage, io usage, network usage, and load averages.

CPU Usage of all processors

# Grab details live
sar -u 1 3

# Use historical binary sar file
# sa10 means '10th day' of current month.
sar -u -f /var/log/sa/sa10 

CPU Usage of a particular Processor

sar -P ALL 1 1

‘-P 1’ means check only the 2nd Core. (Core numbers start from 0).

sar -P 1 1 5

The above command displays real time CPU usage for core number 1, every 1 second for 5 times.

Observing Changes in Memory over time

sar -r 1 3

The above command provides memory stats every 1 second for a total of 3 times.

Observing Swap usage over time

sar -S 1 5

The above command reports swap statistics every 1 seconds, a total 3 times.

Overall I/O activity

sar -b 1 3 

The above command checks every 1 seconds, 3 times.

Individual Block Device I/O Activities

This is a useful check for LUN , block devices and other specific mounts

sar -d 1 1 
sar -p d

DEV – indicates block device, i.e. sda, sda1, sdb1 etc.

Total Number processors created a second / Context switches

sar -w 1 3

Run Queue and Load Average

sar -q 1 3 

This reports the run queue size and load average of last 1 minute, 5 minutes, and 15 minutes. “1 3” reports for every 1 seconds a total of 3 times.

Report Network Statistics

sar -n KEYWORD

KEYWORDS Available;

DEV – Displays network devices vital statistics for eth0, eth1, etc.,
EDEV – Display network device failure statistics
NFS – Displays NFS client activities
NFSD – Displays NFS server activities
SOCK – Displays sockets in use for IPv4
IP – Displays IPv4 network traffic
EIP – Displays IPv4 network errors
ICMP – Displays ICMPv4 network traffic
EICMP – Displays ICMPv4 network errors
TCP – Displays TCPv4 network traffic
ETCP – Displays TCPv4 network errors
UDP – Displays UDPv4 network traffic
SOCK6, IP6, EIP6, ICMP6, UDP6 are for IPv6
ALL – This displays all of the above information. The output will be very long.

sar -n DEV 1 1

Specify Start Time

sar -q -f /var/log/sa/sa11 -s 11:00:00
sar -q -f /var/log/sa/sa11 -s 11:00:00 | head -n 10

Grabbing network activity from server without network utility

So, is it possible to look at a network interfaces activity without bwm-ng, iptraf, or other tools? Yes.

while true do
RX1=`cat /sys/class/net/${INTERFACE}/statistics/rx_bytes`
TX1=`cat /sys/class/net/${INTERFACE}/statistics/tx_bytes`
DOWN=$(($RX1-$RX2))
UP=$(($TX1-$TX2))
DOWN_Bits=$(($DOWN * 8 ))
UP_Bits=$(($UP * 8 ))
DOWNmbps=$(( $DOWN_Bits >> 20 ))
UPmbps=$(($UP_Bits >> 20 ))
echo -e "RX:${DOWN}\tTX:${UP} B/s | RX:${DOWNmbps}\tTX:${UPmbps} Mb/s"
RX2=$RX1; TX2=$TX1
sleep 1; done

I found this little gem yesterday, but couldn’t understand why they had not used clear. I guess they wanted to log activity or something… still this was a really nice find. I can’t remember where I found it yesterday but googling part of it should lead you to the original source 😀

Checking requests to apache2 webserver during downtime

A customer of ours was having some serious disruptions to his webserver, with 15 minute outages happening here and there. He said he couldn’t see an increase in traffic and therefore didn’t understand why it reached maxclients. Here was a quick way to prove whether traffic really increased or not by directly grepping the access logs for the time and day in question and using wc -l to count them, and a for loop to step thru the minutes of the hour in between the events.

Proud of this simple one.. much simpler than a lot of other scripts that do the same thing I’ve seen out there!

root@anonymousbox:/var/log/apache2# for i in `seq 01 60`;  do  printf "total visits: 13:$i\n\n"; grep "12/Jul/2016:13:$i" access.log | wc -l; done

total visits: 13:1

305
total visits: 13:2

474
total visits: 13:3

421
total visits: 13:4

411
total visits: 13:5

733
total visits: 13:6

0
total visits: 13:7

0
total visits: 13:8

0
total visits: 13:9

0
total visits: 13:10

30
total visits: 13:11

36
total visits: 13:12

30
total visits: 13:13

29
total visits: 13:14

28
total visits: 13:15

26
total visits: 13:16

26
total visits: 13:17

32
total visits: 13:18

37
total visits: 13:19

31
total visits: 13:20

42
total visits: 13:21

47
total visits: 13:22

65
total visits: 13:23

51
total visits: 13:24

57
total visits: 13:25

38
total visits: 13:26

40
total visits: 13:27

51
total visits: 13:28

51
total visits: 13:29

32
total visits: 13:30

56
total visits: 13:31

37
total visits: 13:32

36
total visits: 13:33

32
total visits: 13:34

36
total visits: 13:35

36
total visits: 13:36

39
total visits: 13:37

70
total visits: 13:38

52
total visits: 13:39

27
total visits: 13:40

38
total visits: 13:41

46
total visits: 13:42

46
total visits: 13:43

47
total visits: 13:44

39
total visits: 13:45

36
total visits: 13:46

39
total visits: 13:47

49
total visits: 13:48

41
total visits: 13:49

30
total visits: 13:50

57
total visits: 13:51

68
total visits: 13:52

99
total visits: 13:53

52
total visits: 13:54

92
total visits: 13:55

66
total visits: 13:56

75
total visits: 13:57

70
total visits: 13:58

87
total visits: 13:59

67
total visits: 13:60

root@anonymousbox:/var/log/apache2# for i in `seq 01 60`; do printf “total visits: 12:$i\n\n”; grep “12/Jul/2016:12:$i” access.log | wc -l; done
total visits: 12:1

169
total visits: 12:2

248
total visits: 12:3

298
total visits: 12:4

200
total visits: 12:5

341
total visits: 12:6

0
total visits: 12:7

0
total visits: 12:8

0
total visits: 12:9

0
total visits: 12:10

13
total visits: 12:11

11
total visits: 12:12

30
total visits: 12:13

11
total visits: 12:14

11
total visits: 12:15

13
total visits: 12:16

16
total visits: 12:17

28
total visits: 12:18

26
total visits: 12:19

10
total visits: 12:20

19
total visits: 12:21

35
total visits: 12:22

12
total visits: 12:23

19
total visits: 12:24

28
total visits: 12:25

25
total visits: 12:26

30
total visits: 12:27

43
total visits: 12:28

13
total visits: 12:29

24
total visits: 12:30

39
total visits: 12:31

35
total visits: 12:32

25
total visits: 12:33

22
total visits: 12:34

33
total visits: 12:35

21
total visits: 12:36

31
total visits: 12:37

31
total visits: 12:38

22
total visits: 12:39

39
total visits: 12:40

11
total visits: 12:41

18
total visits: 12:42

11
total visits: 12:43

28
total visits: 12:44

19
total visits: 12:45

27
total visits: 12:46

18
total visits: 12:47

17
total visits: 12:48

22
total visits: 12:49

29
total visits: 12:50

22
total visits: 12:51

31
total visits: 12:52

44
total visits: 12:53

38
total visits: 12:54

38
total visits: 12:55

41
total visits: 12:56

38
total visits: 12:57

32
total visits: 12:58

26
total visits: 12:59

31
total visits: 12:60

Tuning Apache2 for high Traffic Spikes

So this came up recently where a customer was asking if we could tune their apache2 for higher traffic. The best way to do this is to benchmark the site to double the traffic expected, this should be a good measure of whether the site is going to hold up..

# Use Apachebench to test the local requests
ab -n 1000000 -c 1000 http://localhost:80/__*index.html

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests

Server Software:        Apache/2.2.15
Server Hostname:        localhost
Server Port:            80

Document Path:          /__*index.html
Document Length:        5758 bytes

Concurrency Level:      1000
Time taken for tests:   377.636 seconds
Complete requests:      1000000
Failed requests:        115
   (Connect: 0, Receive: 0, Length: 115, Exceptions: 0)
Write errors:           0
Total transferred:      6028336810 bytes
HTML transferred:       5757366620 bytes
Requests per second:    2648.05 [#/sec] (mean)
Time per request:       377.636 [ms] (mean)
Time per request:       0.378 [ms] (mean, across all concurrent requests)
Transfer rate:          15589.21 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   52  243.0     22   15036
Processing:     0  282 1898.4     27   81404
Waiting:        0  270 1780.1     24   81400
Total:          6  334 1923.7     50   82432

Percentage of the requests served within a certain time (ms)
  50%     50
  66%     57
  75%     63
  80%     67
  90%     84
  95%   1036
  98%   4773
  99%   7991
 100%  82432 (longest request)



# During the benchmark test you may wish to use sar to indicate general load and io
stdbuf -o0 paste <(sar -q 10 100) <(sar 10 100) | awk '{printf "%8s %2s %7s %7s %7s %8s %9s %8s %8s\n", $1,$2,$3,$4,$5,$11,$13,$14,$NF}'

# Make any relevant adjustments to httpd.conf threads

# diff /etc/httpd/conf/httpd.conf /home/backup/etc/httpd/conf/httpd.conf
103,108c103,108
< StartServers       2000
< MinSpareServers    500
< MaxSpareServers   900
< ServerLimit      2990
< MaxClients       2990
< MaxRequestsPerChild  20000
---
> StartServers       8
> MinSpareServers    5
> MaxSpareServers   20
> ServerLimit      256
> MaxClients       256
> MaxRequestsPerChild  4000
-----------------------------------

In this case we increased the number of startservers and minspareservers. Thanks to Jacob for this.