Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l
26081

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l
2

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[root@anon-WEB1 httpd]# cat access_log | grep '26/Mar' | wc -l
437598
[root@anon-WEB1 httpd]# cat access_log | grep '25/Mar' | wc -l
339445

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.

Retrieving SMART status from a SDA disk attached to a MegaRAID card

Today I realised that manually checking the smart status of a disk required a bit more.

[root@box ~]# smartctl -a -d megaraid,0 /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-696.16.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE
Product:              ST3146356SS
Revision:             HS10
User Capacity:        146,815,733,760 bytes [146 GB]
Logical block size:   512 bytes
Logical Unit id:      -----
Serial number:        ----
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Mar 15 05:18:57 2018 CDT
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 15
Vendor (Seagate) cache information
  Blocks sent to initiator = 3694557980
  Blocks received from initiator = 4259977977
  Blocks read from cache and sent to initiator = 2859908284
  Number of read and write commands whose size <= segment size = 1099899109
  Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 65098.07
  number of minutes until next internal SMART test = 23

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   105645673        6         0  105645679   105645679      65781.538           0
write:         0        0        38        38         45      48511.618           7
verify: 48452245        7         0  48452252   48452259      43540.092           7

Non-medium error count:       48

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  16       1                 - [-   -    -]
# 2  Background short  Completed                  16       0                 - [-   -    -]

In order to retrieve this detail you need to use -d megaraid,n where n is the disk id number. Try 0, 1, 2, 3, etc. Or use the megaraidCLI to get a list of all the disks. I dunno I thought it was worth mentioning at least. It always pays to check this if customer is having weird I/O troubles. Quite a lot of detail is provided about errors the disk encounters. So looking here, even if SMART OK. Gives you an idea if any test failing for disk.

Converting a QEMU qcow2 cloud server image to an native disk img and putting on physical disk

Got this question at work a lot. Thought I’d finally get around to putting it down since it’s came up for me. I’ve got a virtual machine using virtio passthrough for my pcie, and I found actually that disk access via the qcow2 is pretty naff.

sudo apt-get install qemu-kvm

qemu-img convert windows10cloudimage.qcow2 -O raw diskimage.img

dd if=/path/to/windos10cloudimage.qcow2 of=/dev/sdc2

Please note in my case the physical partition I’d made was sdc2. I’d actually resized another 5TB disk I have in my system using gparted. Just so I can attach a physical partition with libvirtd. Evidently though libvirtd-manager doesn’t allow this business so I have to edit the xlm file in /etc/qemu/windows10.xml .

 

root@adam:/etc/libvirt/qemu# virsh  define /etc/libvirt/qemu/win10-uefi.xml 
Domain win10-uefi defined from /etc/libvirt/qemu/win10-uefi.xml
root@adam:/etc/libvirt/qemu# virt-manager

yeah baby!


You could alternatively do it all in one like below, though you may desire a copy of the img file as well as putting it to the disk.

qemu-img convert windows10.qcow2 -O raw /dev/sdc

Moving a WordPress site – much ado about nothing !

Have you noticed, there is all kinds of advise on the internet about the best way to move WordPress websites? There is literally a myriad of ways to achieve this. One of the methods I read on
wp.com was:

Changing Your Domain Name and URLs

Moving a website and changing your domain name or URLs (i.e. from http://example.com/site to http://example.com, or http://example.com to http://example.net) requires the following steps - in sequence.

    Download your existing site files.
    Export your database - go in to MySQL and export the database.
    Move the backed up files and database into a new folder - somewhere safe - this is your site backup.
    Log in to the site you want to move and go to Settings > General, then change the URLs. (ie from http://example.com/ to http://example.net ) - save the settings and expect to see a 404 page.
    Download your site files again.
    Export the database again.
    Edit wp-config.php with the new server's MySQL database name, user and password.
    Upload the files.
    Import the database on the new server.

I mean this is truly horrifying steps to take, and I don’t see the point at all. This is how I achieved it for one my customers.

1. Take customer Database Dump
2. Edit the database searching for 'siteurl' with vi
vi mysqldump.sql
:?siteurl

And just swap out the values, confirming after editing the file;

[root@box]# cat somemysqldump.sql  | grep siteurl -A 2
(1, 'siteurl', 'https://www.newsiteurl.com', 'yes'),
(2, 'home', 'https://www.newsiteurl.com', 'yes'),
(3, 'blogname', 'My website name', 'yes'),

Job done, no stress https://codex.wordpress.org/Moving_WordPress.

There might be additional bits but this is certainly enough for them to access the wp-admin panel. If you have problems add this line to the wp-config.php file;

define('RELOCATE',true);

Just before the line which says

/* That’s all, stop editing! Happy blogging. */

And then just do the import/restore as normal;

mysql -u newmysqluser -p newdatabase_to_import_to < old_database.sql

Simples! I really have no idea why it is made to be so complicated on other hosting sites or platforms.

How to limit the amount of memory httpd is using on CentOS 7 with Cgroups

CentOS 7, introduced something called CGroups, or control groups which has been in the stable kernel since about 2006. The systemD unit is made of several parts, systemD unit resource controllers can be used to limit memory usage of a service, such as httpd control group in systemD.

# Set Memory Limits for a systemD unit
systemctl set-property httpd MemoryLimit=500MB

# Get Limits for a systemD Unit
systemctl show -p CPUShares 
systemctl show -p MemoryLimit

Please note that OS level support is not generally provided with managed infrastructure service level, however I wanted to help where I could hear, and it shouldn’t be that difficult because the new stuff introduced in SystemD and CGroups is much more powerful and convenient than using ulimit or similar.

Aaron Mehar’s CBS to VHD solution for Rackspace Cloud

Hey. So another one of my colleagues put together this really awesome article. Although I was aware this could be done, he’s done a really good job or putting together the procedures, of turning your CBS BFV (boot from (network) volume) disk into a VHD file.

Rackspace CBS disks works over iscsi and are presented via the network. The difference between instance store on the hypervisor, (utilized by cloud-server images), and the disk store on the CBS is that the CBS disk is not a VHD, but an disk presented over network via iscsi.

So, to take a VHD, or an equivalent cloud-server image snapshot, you need to image the disk manually, as well as convert it to VHD.

Taking an image of a volume is not possible, and would not be downloadable. However there are some workarounds that can be done.

*** Please NOTE ****
This is not supported, and we can not assist beyond these instructions. I could provide some clarity if required, however, my collegaues may not be able to help should I become unavailable.

If you just want the data, then you could just download the data to your local machine, however, if you a VHD to create a local VM, then the below instructions will achieve this.

Steps

Please take special care, making a mistake working with partitioner can wipe all your data

1. Shutdown the server
2. Clone the disk, by Starting a volume clone and start the server back up.
3. Attach the newly created clone to the server
4. create another new CBS volume of a slightly larger size (+5GB is OK)

Now that is done, we can image the disk. You will need to ensure you have the corrects disks. The second disk with data should be xvbd and the new CBS should be xvdc

Create partition and filesystem for xvdbc. Please see this guide: https://support.rackspace.com/how-to/prepare-your-cloud-block-storage-volume/

the image xvdb to xvdc

   dd if=/dev/xvdb of=/mnt/cbsvolume1/myimage.dd

The download the image to your workstation, and install VirtualBox, and run the below command

   VBoxManage convertfromraw myfile.dd myfile.vhd --format VHD

Please take special care, making a mistake working with partitioner can wipe all your data

Help! I can’t login to my cloud-server even though I’ve reset my root password

The most common cause of this is the permit root login is set to no, although there might be other causes, like a really broken sshd_config, instead of just one variable. The procedure for looking into this is pretty much the same regardless of the breakage that has occurred. Here is what you need to do:

Here’s the full procedure:

1) Put server into rescue mode.
2) Login to cloud-server on SSH port, please note rescue mode gives you a new temporary root password allowing you to reset the password for SSH on the ‘original disk’.
3) once logged in mount the /dev/xvdb devices, this may be /dev/xvdb1 or /dev/xvdb2 but is usually /dev/xvdb1 and chroot (change root to the ‘original disk’)

# Mount old disk
mnt /dev/xvdb1 /mnt

# Change to the ‘old disk’
chroot /mnt

# Set the new password for root on the old disk:

passwd
# enter the new password when prompted

and specifically ensure that /etc/ssh/sshd_config has this line:

PermitRootLogin no

changed to:

PermitRootLogin yes

Your developer or sysad won’t be able to login until you reset the root password here, and if you do not know the username to su to root from, it is absolutely critical to perform this work, otherwise you won’t be able to access the server.

Also, once you have allowed the root login, and changed the password to something you recognise you will be able to exit rescue mode thru the control panel and login to the machine as normal.

For more detail about how to do this (although all the steps are here pretty much, please see):

https://support.rackspace.com/how-to/rackspace-cloud-essentials-rescue-mode-on-linux-cloud-servers/

I hope this helps you folks out some,

DISASTER RECOVERY! Exporting a Broken Cloud-server image VHD from Rackspace and attempting to recover data

Thanks to my colleague Marcin for thie guestmount tools protip.

I wrote a previous guide which explains how to download/export a Cloud server image VHD from Rackspace Cloud, which is failing to build. It might allow you to perform data recovery, even if the image can’t be booted. Which I’m guessing someone is going to run into sooner or later, and will be pleased to see this article, it will at least give you a best shot at reading the VHD and recovering it, since as you might know already, just because boot or kernel is broken, doesn’t mean that the data isn’t there!

# A better article to use if you want to download via commandline
https://community.rackspace.com/products/f/25/t/3583

# My article doing this thru a web-browser which might be useful too for some customers
https://community.rackspace.com/products/f/25/t/7089

Once the image gets downloaded to your new cloud instance you can use ‘libguestfs-tools’ package (same name on Ubuntu and CentOS) which contains tools necessary for mounting .vhd image files.

The command would be (read-only mode):

guestmount -a {image-name}.vhd -i --ro {mount-point}

Fixing Innodb Registered as a storage Engine Failed

So you have a nice wordpress (or similar site) running, but then all of a sudden you got a nasty message telling you that, the InnobDB registration as a storage engine failed.

# tail /var/log/mariadb/mariadb.log
InnoDB: If that is the case, please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/error-creating-innodb.html
160914  6:45:29 [ERROR] Plugin 'InnoDB' init function returned error.
160914  6:45:29 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
160914  6:45:29 [ERROR] Failed to initialize plugins.
160914  6:45:29 [ERROR] Aborting
160914  6:45:29 [Note] /usr/libexec/mysqld: Shutdown complete

160914 06:45:29 mysqld_safe mysqld from pid file /var/lib/mysql/wpstack.localdomain.pid ended

# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─limits.conf
   Active: failed (Result: exit-code) since Wed 2016-09-14 06:25:38 UTC; 2min 32s ago
  Process: 1434 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=1/FAILURE)
  Process: 1433 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 1305 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1433 (code=exited, status=0/SUCCESS)

Sep 14 06:25:35 wpstack.localdomain systemd[1]: Starting MariaDB database se....
Sep 14 06:25:37 wpstack.localdomain mysqld_safe[1433]: 160914 06:25:37 mysqld...
Sep 14 06:25:37 wpstack.localdomain mysqld_safe[1433]: 160914 06:25:37 mysqld...
Sep 14 06:25:38 wpstack.localdomain systemd[1]: mariadb.service: control pro...1
Sep 14 06:25:38 wpstack.localdomain systemd[1]: Failed to start MariaDB data....
Sep 14 06:25:38 wpstack.localdomain systemd[1]: Unit mariadb.service entered....
Sep 14 06:25:38 wpstack.localdomain systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@wpstack ~]# systemctl start mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@wpstack ~]# systemctl start mariadb
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

This is naturally a problem since until it is resolved you have no database service running. It is fairly easy to resolve in most cases, by removing the log from var/lib/mysql

mv /var/lib/mysql/ib_logfile0{,.bak}  	   	 	   
mv /var/lib/mysql/ib_logfile1{,.bak}  	   	 	   
mv /var/lib/mysql/ibdata1{,.bak} 

These simply move the files away, which allows InnoDB to continue operating. Please note that this may not always fix your issue and in some situations might result in data loss, so it is advisable to take a backup of the database filesystem before proceeding.

Testing Rackspace Cloud-server Service-net Connectivity and creating an alarm

So, the last few weeks my colleagues and myself have been noticing that there has been a couple of issues with the cloud-servers servicenet interface. Unfortunately for some customers utilizing dbaas instances this means that their cloud-server will be unable to communicate, often, with their database backend.

The solution is a custom monitoring script that my colleague Marcin has kindly put together for another customer of his own.

The python script that goes on the server:

Create file:

vi /usr/lib/rackspace-monitoring-agent/plugins/servicenet.sh

Paste into file:

#!/bin/bash
#
ping="/usr/bin/ping -W 1 -c 1 -I eth1 -q"

if [ -z $1 ];then
   echo -e "status CRITICAL\nmetric ping_check uint32 1"
   exit 1
else
   $ping $1 &>/dev/null
   if [ "$?" -eq 0 ]; then
      echo -e "status OK\nmetric ping_check uint32 0"
      exit 0
   else
      echo -e "status CRITICAL\nmetric ping_check uint32 1"
      exit 1
   fi
fi

Create an alarm that utilizes the below metric

if (metric["ping_check"] == 1) {
    return new AlarmStatus(CRITICAL, 'what?');
}
if (metric["ping_check"] == 0) {
    return new AlarmStatus(OK, 'eee?');
}

Of course for this to work the primary requirement is a Rackspace Cloud-server and an installation of Rackspace Cloud Monitoring installed on the server already.

Thanks again Marcin, for this golden nugget.