vmstat and IO wait

Our company has had a problem on a customer’s site. The site has been complaining about very poor system performance.

The site (details don’t matter that much in this question):

F50 dual processor
AIX 4.2.1
0.75 gig memory
5 SSA disks
100mb network card

I started looking at site and started trying to tune the DB (being quite sure about the software as its on lots of other sites and running okay).

On initial investigation the vmstat was showing very poor I/O wait times around 60 – 90 percent.

To cut a long story short we discovered hardware fault. The fault was causing the switch the server was connected to, to go very slow.

When the problem with the switch was resolved and the system was brought back the I/O wait time went down to around 20 – 40 percent.

I can only conclude from this that I/O wait time on vmstat is including network wait time as well.

Can anyone confirm this?
 
Hi,
In AIX 4.3.3 the wa column details CPU idle time (percent) with pending local disk I/O.
If there is at least one outstanding I/O to a local disk when the wait process is running, the time is classified as "waiting on I/O". A wa value over 40 percent could indicate that the disk subsystem may not be balanced properly, or it may be the result of a disk-intensive workload. If there is only one process available for execution -- often the case on a technical workstation -- there may be no way to avoid waiting on I/O.

Method used in AIX 4.3.2 and earlier AIX Versions

At each clock interrupt on each processor (100 times a second in AIX), a determination is made as to which of four categories (usr/sys/wio/idle) to place the last 10 milliseconds of time. If the CPU was busy in usr mode at the time of the clock interrupt, then usr gets the clock tick added into its category. If the CPU was busy in kernel mode at the time of the clock interrupt, then the sys category gets the tick. If the CPU was NOT busy, then a check is made to see if any I/O to disk is in progress. If any disk I/O is in progress, then the wio category is incremented. If NO disk I/O is in progress and the CPU is not busy, then the idl category gets the tick.

Hope this helps

Best regards,
Gabor Topor
AIX V4.3 System Support

International System House Ltd.
Phone : (+36-1) 355-8720, (+36-1) 214-2368
E-mail: gtopor@ish.hu
 
Thanks for the reply, but.....

Does this mean because the system was being slowed down by the network and therefore waiting to push data out. When it did a check and it also saw that disk I/O was going on, it put this down as I/O wait time.


Still a little confused and distrusting of vmstat

TIA
 
Hi,

Since the problem already solved, I suggest to use sar command instead of vmstat next time (e.g. sar -P ALL 2 5). This can report per-processor statistics.
Other option I can recommend is to upgrade to V4.3.3. Version 4.3.3 and later contains an enhancement to the method used to compute the percentage of CPU time spent waiting on disk I/O (wio time).

The change in operating system version 4.3.3 is to only mark an idle CPU as wio if an outstanding I/O was started on that CPU. This method can report much lower wio times when just a few threads are doing I/O and the system is otherwise idle. For example, a system with four CPUs and one thread doing I/O will report a maximum of 25 percent wio time. A system with 12 CPUs and one thread doing I/O will report a maximum of 8.3 percent wio time.

Also, NFS now goes through the buffer cache, and waits in those routines are accounted for in the wa statistics.

In short, the wrong wa calculation in SMP system in AIX V4.2 can lead such a wrong result.

Best regards,
Gabor Topor
AIX V4.3 System Support

International System House Ltd.
Phone : (+36-1) 355-8720, (+36-1) 214-2368
E-mail: gtopor@ish.hu
 
Top