Forum Post: The latches and the disk IO

George Potemkin · Jul 18, 2015

The customer of ours recently had the strong performance degradation of their Progress application. The root case turned out to be the disks: time to copy 1 GB file took 5-6 minutes instead of 7 seconds during the normal period. There were no IO waits. CPU was 95% idle. The iostat shown very low disk activity. From the Progress' point of view there were the following problems: Time to connect db through shared memory took 20-25 seconds. We had enough time to generate the protrace file: (5) 0xe000000130043440 ---- Signal 16 (SIGUSR1) delivered ---- (6) 0xc00000000054cc10 _open_sys + 0x30 [/usr/lib/hpux64/libc.so.1] (7) 0xc000000000561070 _open + 0x170 at ../../../../../core/libs/libc/shared_em_64_perf/../core/syscalls/t_open.c:28 [/usr/lib/hpux64/libc.so.1] (8) 0x40000000008069e0 dbut_utOsOpen + 0x90 at /vobs_rkt/src/dbut/ut/utosopen.c:310 [/usr/dlc/bin/_progres] (9) 0x4000000000716ec0 bkioOpenUNIX + 0xb0 at /vobs_rkt/src/dbmgr/bkio/bkio.c:1005 [/usr/dlc/bin/_progres] (10) 0x400000000071bc70 bkioOpen + 0x210 at /vobs_rkt/src/dbmgr/bkio/bkiotop.c:784 [/usr/dlc/bin/_progres] (11) 0x4000000000727c50 bkOpenOneFile + 0x290 at /vobs_rkt/src/dbmgr/bk/bkopen.c:2052 [/usr/dlc/bin/_progres] (12) 0x4000000000727720 bkOpenAllFiles + 0x1d0 at /vobs_rkt/src/dbmgr/bk/bkopen.c:1942 [/usr/dlc/bin/_progres] Remote connections were instant. Promon shown that there were more-or-less normal activity during 5-10 seconds. Then db hung approximately for 20 seconds. During this time one of the processes (the different ones) with the active transactions hold and did not release MTX latch. The same process during the same period hold and did not release a buffer lock (EXCL). There are no any signs that a process that hold MTX latch was waiting for any other db resource. Processes had created the small transactions. It looked like db was "proquiet'ed" except the facts that during proquiet it's a db broker that holds the MTX latch and the broker also holds BIB latch. In our case BIB was not locked. During the pauses there were neither db writes nor db reads. There were only the record reads - obviously the client's sessions were able to read the records that were already located in buffer pool. The example that combine a few promon screens: Activity: Latch Counts Status: Active Transactions Status: Buffer Lock Queue Latch Counts ----- Locks ----- Naps Active Transactions Status: Buffer Lock Queue Owner Total /Sec /Sec Trans id Trans State DBKEY Area T Status Type Usect 18:25:41 MTX -- 5593 2796 9 18:25:45 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:25:48 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:25:52 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:25:55 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:25:59 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:26:02 MTX 101 0 0 0 1516710131 Active 36678784 17 D LOCKED EXCL 1 18:26:06 MTX 0 2173 1086 119 18:26:09 MTX 0 0 0 0 18:26:13 MTX 0 0 0 0 18:26:16 MTX 0 0 0 0 18:26:20 MTX 0 0 0 0 18:26:23 MTX 0 0 0 0 18:26:27 MTX 0 0 0 0 18:26:31 MTX 135 5445 2722 12 1516712589 Active 36680832 17 D LOCKED EXCL 1 18:26:35 MTX 135 3469 1734 51 1516713316 Active 18:26:38 MTX 135 0 0 0 1516713316 Active 18:26:42 MTX 135 0 0 0 1516713316 Active 18:26:45 MTX 135 0 0 0 1516713316 Active 18:26:49 MTX 135 0 0 0 1516713316 Active 18:26:52 MTX 135 0 0 0 1516713316 Active 18:26:56 MTX 135 0 0 0 1516713316 Active 18:26:59 MTX 135 5415 2707 145 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:03 MTX 135 0 0 0 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:06 MTX 135 0 0 0 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:10 MTX 135 0 0 0 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:13 MTX 135 0 0 0 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:17 MTX 135 0 0 0 1516714016 Active 36674496 17 D LOCKED EXCL 1 18:27:20 MTX 0 1167 583 51 18:27:24 MTX 0 0 0 0 Usr 101 was a batch job. Usr 135 was an interactive user. The pauses in db activity are not related to the checkpoints. But during the checkpoints we had the same huge pauses: No. Time Len Freq Dirty CPT Q Scan APW Q Flushes Duration Sync Time 71 17:54:56 144 170 9508 9507 0 0 0 22.62 0.00 72 17:57:46 111 111 10798 7682 0 0 3115 2.50 0.00 73 17:59:37 244 244 12242 10430 49 0 341 44.77 0.00 74 18:03:41 196 196 6934 6427 1 0 337 19.34 0.00 75 18:06:57 114 119 10087 9913 5 0 0 4.02 0.00 76 18:08:56 308 308 7926 7783 1216 0 4 2.59 0.00 77 18:14:04 311 311 7740 5113 3005 0 312 8.25 0.00 78 18:19:15 490 0 15094 7826 8 0 0 24.13 0.00 Promon itself hangs for 1-2 seconds on the two screens: Status: Buffer Lock Queue Activity: I/O Operations by File All screens above should represent data for 2 sec intervals but in fact the interval is 3-4 seconds. My questions: What data in promon would hint that the root case is the disks? How can the slow disk access force a Progress session to hold MTX latch for the tens seconds instead of a part of /milli/second? The latches are used for the short term resource locks. The disk operations are associated with the medium term locks and Progress uses the semaphores to wait for such resources. The disk operations should not affect the latch locks but our situation seems to say the opposite. Any thoughts? Thanks in advance, George

Continue reading...

Forum Post: The latches and the disk IO

George Potemkin

Guest

Similar threads