So this can be a bugbear for me. Not seeing the progress of a DD. So not knowing whether my blocksize is so crappy the transfer is never going to complete.
Heres a quick pic how I did it! Splendid indeeeed!
So, I had a customer having some major MySQL woes, and I wanted to know whether the MySQL issues were query related, as in due to the frequency of queries alone, or the size of the database. VS it being caused by the number of visitors coming into apache, therefore causing more frequency of MySQL hits, and explaining the higher CPU usage.
The best way to achieve this is to inspect /var/log/httpd with ls -al,
First we take a sample of all of the requests coming into apache2, as in all of them.. provided the customer has used proper naming conventions this isn’t a nightmare. Apache is designed to make this easy for you by the way it is setup by default, hurrah!
[root@box-DB1 logparser]# time tail -f /var/log/httpd/*access_log > allhitsnow ^C real 0m44.560s user 0m0.006s sys 0m0.031s
Time command prefixed here, will tell you how long you ran it for.
[root@box-DB1 logparser]# cat allhitsnow | wc -l 1590
The above command shows you the number of lines in allhitsnow file, which was written to with all the new requests coming into sites from all the site log files. Simples! 1590 queries a minute is quite a lot.
So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.
So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.
[root@www logs]# grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' | wc -l 415
It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.
I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.
So, a customer had an outage, and wasn’t sure what caused it. It looked like some IP’s were hammering the site, so I wrote this quite one liner just to sort the IP’s numerically, so that uniq -c can count the duplicate requests, this way we can count exactly how many times a given IP makes a request in any given minute or hour:
Any given minute
# grep '24/Feb/2017:10:03' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c
Any given hour
# grep '24/Feb/2017:10:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c
Any Given day
# grep '24/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c
Any Given Month
# grep '/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c
Any Given Year
# grep '/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c
Any given year might cause dupes though, and I’m sure there is a better way of doing that which is more specific
CentOS 7, introduced something called CGroups, or control groups which has been in the stable kernel since about 2006. The systemD unit is made of several parts, systemD unit resource controllers can be used to limit memory usage of a service, such as httpd control group in systemD.
# Set Memory Limits for a systemD unit systemctl set-property httpd MemoryLimit=500MB # Get Limits for a systemD Unit systemctl show -p CPUShares systemctl show -p MemoryLimit
Please note that OS level support is not generally provided with managed infrastructure service level, however I wanted to help where I could hear, and it shouldn’t be that difficult because the new stuff introduced in SystemD and CGroups is much more powerful and convenient than using ulimit or similar.
So, a customer is experiencing slowness/sluggishness in their app. You know there is not issue with the hypervisor from instinct, but instinct isn’t enough. Using tools like xentop, sar, bwm-ng are critical parts of live and historical troubleshooting.
Sar can tell you a story, if you can ask the storyteller the write questions, or even better, pick up the book and read it properly. You’ll understand what the plot, scenario, situation and exactly how to proceed with troubleshooting by paying attention to these data and knowing which things to check under certain circumstances.
This article doesn’t go in depth to that, but it gives you a good reference of a variety of tests, the most important being, cpu usage, io usage, network usage, and load averages.
# Grab details live sar -u 1 3 # Use historical binary sar file # sa10 means '10th day' of current month. sar -u -f /var/log/sa/sa10
sar -P ALL 1 1
‘-P 1’ means check only the 2nd Core. (Core numbers start from 0).
sar -P 1 1 5
The above command displays real time CPU usage for core number 1, every 1 second for 5 times.
sar -r 1 3
The above command provides memory stats every 1 second for a total of 3 times.
sar -S 1 5
The above command reports swap statistics every 1 seconds, a total 3 times.
sar -b 1 3
The above command checks every 1 seconds, 3 times.
This is a useful check for LUN , block devices and other specific mounts
sar -d 1 1 sar -p d
DEV – indicates block device, i.e. sda, sda1, sdb1 etc.
sar -w 1 3
sar -q 1 3
This reports the run queue size and load average of last 1 minute, 5 minutes, and 15 minutes. “1 3” reports for every 1 seconds a total of 3 times.
sar -n KEYWORD
KEYWORDS Available;
DEV – Displays network devices vital statistics for eth0, eth1, etc.,
EDEV – Display network device failure statistics
NFS – Displays NFS client activities
NFSD – Displays NFS server activities
SOCK – Displays sockets in use for IPv4
IP – Displays IPv4 network traffic
EIP – Displays IPv4 network errors
ICMP – Displays ICMPv4 network traffic
EICMP – Displays ICMPv4 network errors
TCP – Displays TCPv4 network traffic
ETCP – Displays TCPv4 network errors
UDP – Displays UDPv4 network traffic
SOCK6, IP6, EIP6, ICMP6, UDP6 are for IPv6
ALL – This displays all of the above information. The output will be very long.
sar -n DEV 1 1
sar -q -f /var/log/sa/sa11 -s 11:00:00
sar -q -f /var/log/sa/sa11 -s 11:00:00 | head -n 10
So, is it possible to look at a network interfaces activity without bwm-ng, iptraf, or other tools? Yes.
while true do RX1=`cat /sys/class/net/${INTERFACE}/statistics/rx_bytes` TX1=`cat /sys/class/net/${INTERFACE}/statistics/tx_bytes` DOWN=$(($RX1-$RX2)) UP=$(($TX1-$TX2)) DOWN_Bits=$(($DOWN * 8 )) UP_Bits=$(($UP * 8 )) DOWNmbps=$(( $DOWN_Bits >> 20 )) UPmbps=$(($UP_Bits >> 20 )) echo -e "RX:${DOWN}\tTX:${UP} B/s | RX:${DOWNmbps}\tTX:${UPmbps} Mb/s" RX2=$RX1; TX2=$TX1 sleep 1; done
I found this little gem yesterday, but couldn’t understand why they had not used clear. I guess they wanted to log activity or something… still this was a really nice find. I can’t remember where I found it yesterday but googling part of it should lead you to the original source 😀
A customer of ours was having some serious disruptions to his webserver, with 15 minute outages happening here and there. He said he couldn’t see an increase in traffic and therefore didn’t understand why it reached maxclients. Here was a quick way to prove whether traffic really increased or not by directly grepping the access logs for the time and day in question and using wc -l to count them, and a for loop to step thru the minutes of the hour in between the events.
Proud of this simple one.. much simpler than a lot of other scripts that do the same thing I’ve seen out there!
root@anonymousbox:/var/log/apache2# for i in `seq 01 60`; do printf "total visits: 13:$i\n\n"; grep "12/Jul/2016:13:$i" access.log | wc -l; done
total visits: 13:1
305
total visits: 13:2
474
total visits: 13:3
421
total visits: 13:4
411
total visits: 13:5
733
total visits: 13:6
0
total visits: 13:7
0
total visits: 13:8
0
total visits: 13:9
0
total visits: 13:10
30
total visits: 13:11
36
total visits: 13:12
30
total visits: 13:13
29
total visits: 13:14
28
total visits: 13:15
26
total visits: 13:16
26
total visits: 13:17
32
total visits: 13:18
37
total visits: 13:19
31
total visits: 13:20
42
total visits: 13:21
47
total visits: 13:22
65
total visits: 13:23
51
total visits: 13:24
57
total visits: 13:25
38
total visits: 13:26
40
total visits: 13:27
51
total visits: 13:28
51
total visits: 13:29
32
total visits: 13:30
56
total visits: 13:31
37
total visits: 13:32
36
total visits: 13:33
32
total visits: 13:34
36
total visits: 13:35
36
total visits: 13:36
39
total visits: 13:37
70
total visits: 13:38
52
total visits: 13:39
27
total visits: 13:40
38
total visits: 13:41
46
total visits: 13:42
46
total visits: 13:43
47
total visits: 13:44
39
total visits: 13:45
36
total visits: 13:46
39
total visits: 13:47
49
total visits: 13:48
41
total visits: 13:49
30
total visits: 13:50
57
total visits: 13:51
68
total visits: 13:52
99
total visits: 13:53
52
total visits: 13:54
92
total visits: 13:55
66
total visits: 13:56
75
total visits: 13:57
70
total visits: 13:58
87
total visits: 13:59
67
total visits: 13:60
root@anonymousbox:/var/log/apache2# for i in `seq 01 60`; do printf “total visits: 12:$i\n\n”; grep “12/Jul/2016:12:$i” access.log | wc -l; done
total visits: 12:1
169
total visits: 12:2
248
total visits: 12:3
298
total visits: 12:4
200
total visits: 12:5
341
total visits: 12:6
0
total visits: 12:7
0
total visits: 12:8
0
total visits: 12:9
0
total visits: 12:10
13
total visits: 12:11
11
total visits: 12:12
30
total visits: 12:13
11
total visits: 12:14
11
total visits: 12:15
13
total visits: 12:16
16
total visits: 12:17
28
total visits: 12:18
26
total visits: 12:19
10
total visits: 12:20
19
total visits: 12:21
35
total visits: 12:22
12
total visits: 12:23
19
total visits: 12:24
28
total visits: 12:25
25
total visits: 12:26
30
total visits: 12:27
43
total visits: 12:28
13
total visits: 12:29
24
total visits: 12:30
39
total visits: 12:31
35
total visits: 12:32
25
total visits: 12:33
22
total visits: 12:34
33
total visits: 12:35
21
total visits: 12:36
31
total visits: 12:37
31
total visits: 12:38
22
total visits: 12:39
39
total visits: 12:40
11
total visits: 12:41
18
total visits: 12:42
11
total visits: 12:43
28
total visits: 12:44
19
total visits: 12:45
27
total visits: 12:46
18
total visits: 12:47
17
total visits: 12:48
22
total visits: 12:49
29
total visits: 12:50
22
total visits: 12:51
31
total visits: 12:52
44
total visits: 12:53
38
total visits: 12:54
38
total visits: 12:55
41
total visits: 12:56
38
total visits: 12:57
32
total visits: 12:58
26
total visits: 12:59
31
total visits: 12:60
So this came up recently where a customer was asking if we could tune their apache2 for higher traffic. The best way to do this is to benchmark the site to double the traffic expected, this should be a good measure of whether the site is going to hold up..
# Use Apachebench to test the local requests ab -n 1000000 -c 1000 http://localhost:80/__*index.html Benchmarking localhost (be patient) Completed 100000 requests Completed 200000 requests Completed 300000 requests Completed 400000 requests Completed 500000 requests Completed 600000 requests Completed 700000 requests Completed 800000 requests Completed 900000 requests Completed 1000000 requests Finished 1000000 requests Server Software: Apache/2.2.15 Server Hostname: localhost Server Port: 80 Document Path: /__*index.html Document Length: 5758 bytes Concurrency Level: 1000 Time taken for tests: 377.636 seconds Complete requests: 1000000 Failed requests: 115 (Connect: 0, Receive: 0, Length: 115, Exceptions: 0) Write errors: 0 Total transferred: 6028336810 bytes HTML transferred: 5757366620 bytes Requests per second: 2648.05 [#/sec] (mean) Time per request: 377.636 [ms] (mean) Time per request: 0.378 [ms] (mean, across all concurrent requests) Transfer rate: 15589.21 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 52 243.0 22 15036 Processing: 0 282 1898.4 27 81404 Waiting: 0 270 1780.1 24 81400 Total: 6 334 1923.7 50 82432 Percentage of the requests served within a certain time (ms) 50% 50 66% 57 75% 63 80% 67 90% 84 95% 1036 98% 4773 99% 7991 100% 82432 (longest request) # During the benchmark test you may wish to use sar to indicate general load and io stdbuf -o0 paste <(sar -q 10 100) <(sar 10 100) | awk '{printf "%8s %2s %7s %7s %7s %8s %9s %8s %8s\n", $1,$2,$3,$4,$5,$11,$13,$14,$NF}' # Make any relevant adjustments to httpd.conf threads # diff /etc/httpd/conf/httpd.conf /home/backup/etc/httpd/conf/httpd.conf 103,108c103,108 < StartServers 2000 < MinSpareServers 500 < MaxSpareServers 900 < ServerLimit 2990 < MaxClients 2990 < MaxRequestsPerChild 20000 --- > StartServers 8 > MinSpareServers 5 > MaxSpareServers 20 > ServerLimit 256 > MaxClients 256 > MaxRequestsPerChild 4000 -----------------------------------
In this case we increased the number of startservers and minspareservers. Thanks to Jacob for this.