Disk subsystems can be a significant performance bottleneck on an SDElements system if incorrectly configured or under-provisioned.
When working with disk systems there are two important performance metrics:
Throughput is the speed of sequential data transfer, while this is an important metric for an application like a file server it's mostly irrelevant for systems that depend on databases and non-sequential data access. Random seek operations are measured in Input/Output Operations per second (IOPS) and are a much better indicator of SDElements performance.
When a server reaches a bottleneck with total throughput this will show up in the data rates being transferred to the disk but if random seek operations are the cause this will show up as CPU load (The CPU spends it's time waiting for IO). This can be determined by looking at the IOWAIT on the server. Note that this needs to be done as close to peak load as is possible.
The simplest way to determine IOWAIT across the server is to use the "top" utility where the "wa" percentage indicates how much of the CPU time is spend waiting for disk IO.
[sde_admin@server ~]$ top top - 15:30:53 up 173 days, 18:53, 1 user, load average: 0.40, 0.18, 0.12
Tasks: 183 total, 1 running, 182 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.3%us, 2.5%sy, 0.0%ni, 96.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3924512k total, 2963652k used, 960860k free, 335512k buffers
For more detailed analysis a tool such as iostat will need to be installed and used to view per device/disk and even per process information.
To install the iostat utility on Centos/RHEL use the following command:
- sudo yum install sysstat iotop
To install the iostat utility on Debian/Ubuntu use the following command:
- sudo apt-get install sysstat iotop
Once the package has been installed iostat can be started with the -x 1 parameters to update once per second.
[sde_admin@server ~]$ iostat -x 1
Linux 2.6.32-573.18.1.el6.x86_64 (server.sdelements.com) 24/02/17 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle
2.89 0.02 0.65 0.04 0.00 96.40 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.05 3.86 0.15 2.24 6.53 48.86 23.12 0.01 3.81 5.70 3.68 0.73 0.18
dm-0 0.00 0.00 0.17 6.07 4.08 48.56 8.44 0.01 1.38 8.62 1.18 0.27 0.17
dm-1 0.00 0.00 0.02 0.04 0.17 0.30 8.00 0.00 23.29 6.28 32.72 0.70 0.00
dm-2 0.00 0.00 0.01 2.60 0.69 20.63 8.15 0.00 0.96 18.14 0.86 0.27 0.07
The relevant numbers here are the await, r_await, w_await which correspond to the average wait time per device.
If it's determined that disk load is causing issues the iotop command can be used to determine which processes are responsible.
# iotop Total DISK READ: 8.00 M/s | Total DISK WRITE: 20.36 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 15758 be/4 root 7.99 M/s 8.01 M/s 0.00 % 61.97 % postgres