Help! One database suddenly slow, other fine

dp74

New Member
Hello, we are running QAD 8.6e on Progress 8.3C, sitting on HP-UX. We have two databases, mfgpro85e (which is fine) and addmfgpro (which suddenly starting being slow today; was fine Fri and Sat). Users always connect to both db when entering app. Queries against the slow db are running about 10x slower than the comparable queries against the test version of the addmfgpro database (week-old copy) or the other database. Compiling and data dictionary against the slow db also seem noticeably slower. GLANCE says everything is wonderful (CPU, disk utilization, network, memory).

We've been adding lots of users over the past few months, so I suspect -B needs to be raised in the problem database (addmfgpro). What does ProgressTalk think? I don't have a lot of experience in this kind of Progress DB tuning, any help appreciated.

Some parameters below. TIA.


adddmfgpro (slow)
Shared Resources:
Busy After Image Extent: /u103/PRODATA/addmfgpro.a2
Number of database buffers (-B): 60000
Number of before image buffers (-bibufs): 150
Number of after image buffers (-aibufs): 225
Before-image file name (-g): -
Before-image truncate interval (-G): 60
No crash protection (-i): Not enabled
Maximum total of all private buffers (-I): 424
Current size of locking table (-L): 360000
Locking table entries in use: 4
Locking table high water mark: 27
Maximum number of clients per server (-Ma): 4
Delay of before-image flush (-Mf): 120
Maximum number of servers (-Mn): 42
Maximum number of users (-n): 169
Before-image file I/O (-r -R): Raw
Shared memory version number: 8303
Number of semaphores used: 215


mfgpro85e (fine)
Shared Resources:
Busy After Image Extent: /u14/PRODATA/mfgpro85e.a1
Number of database buffers (-B): 225100
Number of before image buffers (-bibufs): 350
Number of after image buffers (-aibufs): 500
Before-image file name (-g): -
Before-image truncate interval (-G): 60
No crash protection (-i): Not enabled
Maximum total of all private buffers (-I): 424
Current size of locking table (-L): 720000
Locking table entries in use: 318
Locking table high water mark: 7038
Maximum number of clients per server (-Ma): 4
Delay of before-image flush (-Mf): 30
Maximum number of servers (-Mn): 42
Maximum number of users (-n): 169
Before-image file I/O (-r -R): Raw
Shared memory version number: 8303
Number of semaphores used: 215
 

dp74

New Member
Aha, buffer hits are 0%. The Progress database admin guide tells me this is very bad, as they should exceed 95%.

Activity - Sampled at 10/03/11 11:54 for 179:42:22.

Event Total Per Sec Event Total Per Sec
Commits 1186634 1.8 Undos 4214 0.0
Record Updates 1058305 1.6 Record Reads 1443557382 239.6
Record Creates 292741 0.4 Record Deletes 230621 0.3
DB Writes 482596 0.7 DB Reads 62613098 96.7
BI Writes 114246 0.1 BI Reads 6673 0.0
AI Writes 55152 0.0
Record Locks 197366091 305.0 Record Waits 73 0.0
Checkpoints 1534 0.0 Buffers Flushed 170749 0.2

Rec Lock Waits 0 % BI Buf Waits 0 % AI Buf Waits 0 %
Writes by APW 64 % Writes by BIW 41 % Writes by AIW 89 %
Buffer Hits 0 %
DB Size 5636053 K BI Size 972784 K AI Size 278496 K
FR chain 34193 blocks RM chain 48401 blocks
Shared Memory 83424 K Segments 6

 

TomBascom

Curmudgeon
You should use the sampling feature of PROMON to get a proper view of activity during your slow queries.

Improvements in performance are related to the inverse square of the change in -B. In other words -- to reduce IO by 2x (and improve performance by doing less IO) you need to increase -B by 4x. YMMV but this is the crux of the relationship.
 

dp74

New Member
Well, we also ran into -n, so we bit the bullet and took a brief outage to restart, doubling our -B in the slow db (along with doubling -n in both). The slowness problem is gone... at least for now.

I suspect we ran below some critical ratio of -B to users, in which the DB was spending most of its time thrashing around looking for open buffers. It was a HUGE performance degradation, about 10x to 100x slower than normal. We'll be keeping an eye on things.

Wow indeed. I can't even install the Windows version on my upcoming Win7 laptop to get the online help from the procedure editor, which I will miss.

Thanks for the feedback!
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Hi dp74,

Re installing an old version on a Win7 laptop:
I don't have any experience with Progress 8.x, but according to the Availability Guide (http://www.progress.com/progress_software/products/docs/bu_sep/pavail.pdf) you can install the client and development products on Windows XP. To that end, if you have Win7 Professional (or a higher SKU) you can run Windows XP Mode. It is a free download (http://www.microsoft.com/windows/virtual-pc/download.aspx) that gives you a licensed XP VM, where you could install an old Windows version of Progress. Give it a try.

Re your DBs:
Is there a reason why your lock tables are so huge? Also, you mentioned -B but not your DB block size.
 

dp74

New Member
Well, the Progress version I have for Windows is 9.1, and it always installed fine on XP, but no joy on Win7 even running it with XP compatibility as Admin (it always claims a file is locked, is_del iirc). Alas, MS tells me I am not eligible for the full XP VM on my home machine, so maybe I'll give that a shot on the new laptop as that will probably have Professional. Hadn't considered that, thanks for the notion and the links.

Heh, I wondered about the lock tables too. I suspect our DBA just set -L to a big number because he wasn't sure what it did. He's more of an Oracle guy really. I'll have to check block size, I want to say 4k but I'm at another client atm.

Slowness issue has not cropped up again, at least not yet.
 

dp74

New Member
Block size is 1024.

Database is a little slow now, not as good as in Test but not nearly as slow as yesterday. I'm thinking we need to start up another APW or two, the Database ADmin Guide says nonzero flushes is bad.

10/04/11 Activity: Page Writers
08:26:30 10/03/11 12:21 to 10/04/11 08:25 (20 hrs 4 min)

Total Per Min Per Sec Per Tx

Total DB writes 42701 35 0.59 0.36
APW DB writes 42439 35 0.58 0.36
scan writes 15098 12 0.20 0.13
apw queue writes 1939 1 0.02 0.01
ckp queue writes 25402 21 0.35 0.21
scan cycles 87 0 0.00 0.00
buffers scanned 9959406 8272 137.87 85.85
bufs checkpointed 25760 21 0.35 0.22
Checkpoints 126 0 0.00 0.00
Marked at checkpoint 25924 21 0.35 0.22
Flushed at checkpoint 128 0 0.00 0.00

Number of APWs: 1

 

dp74

New Member
Started another APW, now database is perfectly fast again. Coincidence? Or was our poor solitary APW terribly overburdened? Hmmm.

BTW, to check performance I wrote very basic queries (a for each with no where clause, inside a repeat loop) and just counted the number of records it was able to return and displayed the count at every 100 or 1000 records. Of course, you can get a bunch of buffer hits after the first time, so I switch between tables. This seems to give me a pretty good idea of how slow things are, and I used our Test system (which has relatively little load) as a benchmark.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Are things running better because you added another APW, or because your transaction activity fell off to nothing? Without knowing that, all you can do is speculate, which isn't terribly valuable.

You could look at checkpoint stats, if your version of promon shows that. In v10 it's in R&D | 3 (Other) | 4 (Checkpoints). I think R&D was available but hidden in v9; no idea about 8. I'm waaay out of my element with a 1K DB on 8.3C which is (all together now!) AOUB: ancient, obsolete, unsupported, buggy. As Tom said: wow.

If you don't have UI in promon for checkpoints and assuming the VST existed then :rolleyes: you could look at _Checkpoint directly to see who much time there is between each, at your peak transaction time, and whether there were any buffers flushed.

I don't know your BI configuration, but you want want to evaluate those settings as well. And if you have the luxury of some time and disk space, you could dump and load your DB into one with a larger block size and see how it performs. I don't know your file system block size, but if it is larger than 1K you're also paying a penalty there for I/O. Those are the steps that come to mind for me.

All that said, you may ultimately need some consultation. No sense listening to amateurs like me when there are professionals lurking hereabouts. ;)
 

TomBascom

Curmudgeon
ProMon R&D does exist in v8.

So do the VSTs although you have to run "proutil dbName -C enablevsts" to get them.

Buffers flushed are a potential problem if you see hundreds of them at each checkpoint -- it is normal for there to be a bunch at db startup. If the number never changes after that it isn't a serious worry.

v8 (which is Paleolithicly ancient, obsolete and really, really unsupported) does support db block sizes greater than 1k. Rows per block is tied to the block size. You get 32 rows per block for all blocks other than 8k. 8k blocks are 64 rows per block. Often, but not always, you can get very significant performance improvements by dumping and loading into an 8k block db. (Don't forget to reduce -B by a factor of 8...)

Upgrading to OE10.2 would, of course, be an excellent idea.
 

dp74

New Member
Yep, I use the VSTs daily to keep an eye on BI HWM and some other things, and I've been delving into PROMON R&D.

The performance improvement seems very likely to be attributable to the additional APW -- transaction volume should not have fallen off significantly in the five minutes between when I last checked performance and when I checked it again after starting the APW. The degradation was very persistent, it did not vary much over short periods of time (except when we shut down and restarted). Performance now seems significantly better than at any time since I started looking at it yesterday AM.

Of course, we're going to do some stress tests in Test to try to replicate the slowness, and see if we can prove beyond any reasonable doubt the single APW being overloaded is responsible for the very significant slowdown we started noticing Monday AM.

Block size is 1024. Thanks for the suggestions.

We'd love to upgrade, but that's out of our hands. Industry conditions are extremely challenging and mgmt doesn't want to spend the money. They'll almost certainly move us off the application (and Progress) if they decide to start spending -- that was originally scheduled to happen in 2005, but we've had 4-5 rounds of layoffs since then.

The funniest part is they're moving plants to 1998 technology off an even older technology, a mainframe app that dates from the early 1980s. That's why we have the user/tx growth.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Your life certainly would be nicer on 10.2B. However, one thing I would caution about a DLC upgrade is that there could be application impacts, depending on how the code was written. Runtime and compiler behaviour and defaults can change across releases, and it certainly has changed across the gulf between 8.3C and 10.2B.

It may not be anything you can't adjust to, and might only necessitate some configuration changes or cause some unsightly warning messages, but it is just something to be aware of that it isn't necessarily a completely transparent transition. Also, you will need to recompile your code. So the effort and cost of the upgrade will depend somewhat on your relationship with your vendor.

As far as server side is concerned I think you would be better off building new databases (dump & load) rather than migrating the old ones forward, if that's even possible.

Platform is one other thing to consider. Many times a platform that is old enough to run an old DLC version is too old to run the latest version, at least with official support. Refer to the latest Platform & Product Availability Guide (http://communities.progress.com/pcom/docs/DOC-32338) for details. So you could be looking at an OS upgrade to get to 10.2B. And sometimes you need better hardware to run the new OS. And so on...
 

cj_brandt

Active Member
Your OS probably uses 8k blocksize. To have your database at a 1k blocksize is a huge performance penalty.
If you were to take the time to perform a dump and load, the performance increase would be well worth it.
I would highly recommend doing this on a test system.
 

dp74

New Member
Well, I wrote a script to beat on the database in much the same way the users are (it spawns 34 processes to do reads against a couple tables -- this is more or less what's happening). This nicely replicates our conditions at the problem times (8-10A) during which times about 34 users all want to run the same report against our side database.

APWs and -B did not appear to affect performance. (Boo!) Apparently coincidence was responsible for the improvement I saw last time.

Since our problem is massive reads, it looks like private read-only buffers would be optimal for our situation. Of course, they were introduced in the next version...

That pretty much leaves me with block size (moving to 8K to match the OS). Doubling the rows per block might help. We do have the disk space on our dev/test server to try this so we may go there next if we can get the time.

Question on changing block size: when you connect to multiple databases, is there any requirement that they all use the same block size? Our side database is about 1/20th the size of the main, and has about 1/100th as many tables, and then there's the gui and help databases. A dump and load is a lot more feasible if we just do the one small one that's having the problem, rather than that plus two medium and one large.

Meanwhile, I guess we'll try to get users to batch some of these reports for an overnight run.
 

TomBascom

Curmudgeon
There are a couple of possible issues in the high-read rate scenario that you describe.

It is possible, likely even, that you have a coding problem. You are probably reading far, far more records than makes any sense at all for the business purpose being served. Try to work out, on the back of a napkin, how many record reads it should take to achieve the business purpose of the reports that are creating this load. If you arrive at a number that is several orders of magnitude less than what is actually happening then your code is probably suffering from poor index selection (or it could just be a really dreadful choice of algorithm). As a first step compile with XREF and look for WHOLE-INDEX searches. That will expose blatantly poor queries (and maybe some false positives). You could also consider applying the profiler to these sessions to determine which, specific, lines of code are taking the most time.

Of course fixing code is time-consuming and expensive. And it is possible (albeit unlikely) that it is not actually a coding problem. So you may end up chasing the problem from the administrative end:

1) You may be reading so much data that you cannot buffer it in -B. If "OS Reads" is high you can try to keep expanding -B until you run out of RAM or the entire DB fits in -B. If you cannot get it all in memory then dumping and loading, using a larger block size, and a better rows per block match will usually help make each IO more efficient and effective.

2) If "OS Reads" are not taking place then your working set fits in memory. In that case you may be experiencing "latch contention". You can see if this is the case by going to PROMON R&D -> 3 "Other" -> 1 "Performance Indicators". Sample the counters during a period of poor performance and observe the "latch timeouts". If you are seeing more than a handful per second you have pressure on the latch mechanism. That cannot be cured by any simple tuning technique. Your best options are:

i) You can try messing around with -spin, -nap and -napmax. It is unlikely that any of them will have a noticeable impact (unless they are currently very fouled up, -spin should probably be set around 10,000 for v8, if it is 0 then you have likely found your culprit).

ii) Upgrade. The latch mechanism has had several very significant overhauls since 8.3 Proper use of storage areas also substantially addresses many of these issues.

iii) Split the db. Move high activity tables into stand-alone databases. That will give them their own set of latches and relieve pressure. The downsides to this are that it takes planning -- you need to know which tables are "high activity" (the _tablestat and _indexstat VSTs are a big help), you now have to manage more databases and you might need to worry about transaction integrity on multi-table updates.
 

dp74

New Member
Thanks Rob.

After doing some more reading, I'm increasingly optimistic about the block size change. The DB HWM is only 4.2G and with an 8K block size we can put 4G, or nearly the entire database, into memory (vs the 150MB we have today) and the server has 20G with ~14G normally free today. It's hard to imagine we could have performance problems with 100% buffer hits! Plus, the processes doing the reads won't be contending for buffers as fast because they'll have more data to do stuff with. Plus, our OS block size will match the DB block size, which should also help performance, and if the reports finish faster few of them will be running concurrently. Win, win, win.

Thanks Tom, I agree code is usually the culprit in these cases, and Lord knows I've fixed plenty of it myself (my first week here, I made a simple change to a cost extract that took its run time from 20+ hours down to 5 minutes), but unfortunately it doesn't appear to be the problem here. The indexing and record volume have been looked at pretty carefully. These are reports that normally run in a few minutes, but that's being stretched into 10x longer over the problem periods, which seem to run about half an hour to an hour around 8:30A on weekday mornings. I wrote a utility to tell me how many records the DB can return per millisecond, and when the issue arises it very clearly drops from about 5 to something below .1 (this is for a very basic query with no where clause). Still, never hurts to look again, so we may review that.

Spin does seem a likely culprit -- _Startup tells me it is in fact at zero, and we have four processors. I'll try adjusting that in Test and see if that helps.

The latch notion is very interesting and I will be sure to check that Monday AM when we will probably see performance issues again. Splitting the DB if all else fails is also a very good notion, thanks. We do know exactly which tables are at issue.
 

dp74

New Member
Spinlock FTW! Tried that on our test server -- my 34-process spawning script previously brought the database to its knees, performance was horrible, .01 records per ms. With -spinlock 10000, performance is normal.

Thanks a bunch Tom, I think this was our issue. Keeping fingers crossed for our migration of the db parameter change to Prod, probably next week.
 
Top