Periodic Database Slowdown

kolonuk · Mar 2, 2011

Hi All,

Sorry for the long post, but we're a bit stuck!

We have a situation where we get a slowdown of the database. This seems to happen randomly during the month/year, but always at the same time range during the day.

Our customer has a workgroup license (10.1b) and database size is around 20gb. We've just moved them to a brand new server which we hoped would solve the issue - it's RHEL 6, solid state disks, Intel Xeon 24 cores, 16Gb ram).

We now have a process that detects the slowdown and notifies us, but still can't pinpoint what it is.

Our detection process simply does a for each of the customer table to get 2000 records and times the start and finish, and it does this every 5 seconds:

Code:

v_start = etime.
for each customer
no-lock
x = 1 to 2000
:
end.
v_end = etime.

using this, a database at rest normally returns a difference of 9-13ms, but during the slowdown, it can take anywhere between 20000-30000ms, which is a huge difference! the script records the last 3 time differences, and if all 3 are over 500ms, we get notified (to avoid spikes).

Also, during the night, we have a statistical batch process that hits the database very hard for about 3-4 hours, and that doesn't get picked up, so whatever this is, it's very severe.

We contacted Progress, but all they could offer was to increase the -B and change the -spin value (but that's only for Enterprise).

Looking at promon during the slowdown, the buffers are at 85%-95%, and every other database responds as expected.

The BI stays more or less the same size, and disk monitoring and CPU utils don't report anything out of the ordinary, and there are no linux cron jobs running during the time frame. Apart from the database, everything on the system responds nicely.

Our only conclusion is that it's something to do with the database engine itself, but other than that we are lost.

Any idea what else I could check or change to help find the issue would be appreciated.

RealHeavyDude · Mar 2, 2011

From the information one could only speculate.

But, during the "slowness" time frame I would check for users which run reports on lots of rarely used data. I assume the 85%-95% is the buffer hit ratio. If that's the case it tells that some process is killing the buffer pool optimization because it is requesting lots of database blocks to be buffered against the grain. If it is a report you could solve the issue in giving this report private buffers. If it is a job which changes the database then you need to increase the buffer pool.

Again, that's just speculation.

Heavy Regards, RealHeavyDude.

TomBascom · Mar 2, 2011

Is there a reason that you are choosing not to follow PSC's advice regarding increasing -B?

What value do you have it set to?

Assuming that "buffers are at" refers to the buffer hit ratio then 85% to 95% is horrifically bad. 98% is "ok", 99% is "good", 99.5% is "pretty good". 85% means that 1 in 6 logical reads is resulting in disk IO. Reading 2,000 records generates at least 4,000 logical reads (1 index block and 1 data block per record). So if your hit ratio is 85% you're doing around 600 disk reads in that loop. You're probably getting lucky and a lot of those are being cached by the OS most of the time. The times when they aren't cached (for whatever reason) are likely when your performance is poor. You might also have situations where application contention pushes the blocks you happen to be reading out to disk because other things are using your -B more aggressively.

The cure for that is to increase -B dramatically. If you have a 32 bit executable you are limited to around 2GB (with 4k db blocks that is a value of roughly -B 500000).

Also, the Workgroup db does not like multi-core systems. The cure for that is Enterprise. Or you can disable all but one core. Or virtualize and give the db VM 1 core. On Linux you can set "processor affinity" for the db and convince it to stay on a single core. That sometimes helps.

kolonuk · Mar 2, 2011

Hi RealHeavyDude,

During the slowdown, the majority of users are sitting at prompts (entering an order, menu, etc), one or two people may be running reports, but they work from a different "stats" database (partly in place to stop such reports affecting business critical live db!)... The only time the stats db gets rebuilt is overnight, when I'd expect a slowdown of some sort...

We do have our own user generated logs, with time of day and program run to find out who is running what when in detail, but it doesn't give any useful info to track this down...

kolonuk · Mar 2, 2011

Hi Tom,

The buffer hit ratio is only at that during the slowdown (normally its actaully around 95+), at which point we can't identify the issue (what with various logging going on now...)

Sorry, I should have said we did follow their advice, and set the -B to 30000 as they recommended. My concern on having a large buffer is to do with DB writes - do they go into the buffer first and wait to get written to disk, or go straight to disk? If there is a power failure, whats the result if writes are still in memory? Or am I barking up the wrong tree and it's just for reads...?

Do you know what the default spin value is, or how I can find it? I didn't set this as advice from PSC suggested that it was ONLY for enterprise... This could be the next thing we play with...

Also, what do you mean by "doesn't like multi-core systems"? We get no other issues with our other customers - is there any documentation etc regarding this from Progress?

cj_brandt · Mar 2, 2011

Progress changed the spin value in the workgroup db from 0 to 1 with OE 10.?? You can look in the db log after a startup and it will report the value of spin.

Do you collect table stats with the tablerange and indexrange parameters ? That would help determine what tables are being read when your slowness hits. I would recommend upgrading to at least service pack 3 of 10.1B - big performance benefits. I would recommend upgrading to 10.1C because of the ability to generate protrace files against running processes - another way to track down your problems.

If your server has 16gb of ram, I would allocate more buffers than 30,000 - that seems very small.

You are correct the database updates are committed to memory and then later written to disk. If there is a power failure, you have incomplete transactions left in your BI file which are backed out when the database starts, no worries. Remember - OE databases have to be able to run on Windows platforms

MaximMonin · Mar 2, 2011

Use promon and check Checkpoint and time between them.
Normally it should be 15sec or more and awp processes 100% flashes all changes.
If not try change - bi blocksize
-C truncate bi -bi 16384 -biblocksize 16384

also try
#-TXESkipLimit 100000
-TXESkipLimit 0
-B 500000 or more

kolonuk · Mar 2, 2011

@cj_brant:
At the moment, upgrading is a good idea, will have to look into that...

I don't collect table stats - what would that show and I take it thats from one of the system tables?

The db has 1024 block size, so yes, you're right with 30000 buffers means about 30mb - could do with a lot more!

@MaximMonin:
What does looking at the checkpoint solve? What is a check point?
Also, I can't find any reference to TXESkipLimit on the progress website - what is this?

@All:
I've been looking at the -spin value, and it looks like this may be the culprit (see http://progress.atgnow.com/esprogress/documentDisplay.do?clusterName=CombinedCluster&preview=1&groupId=2&page=https://progress.atgnow.com/esprogress/docs/Solutions/Progress/ESERVER_P23850.xhtml&docType=1006&resultType=5002&docProp=$solution_id&docPropValue=p98703).

Does anyone use the spin value on workgroup? What happens if/when the retries are all gone - does the session crash? What happens when I turn it on whilst the db is running? Anything I should be aware of?

Thanks in advance!

MaximMonin · Mar 2, 2011

At checkpoint time all processes freeze until bi file update finished.
TXESkipLimit - it is some secret parameter that tune system if there are many blocked clients with reason TXELOCK (transaction end lock)

kolonuk · Mar 2, 2011

Re: New here...Where Should I Begin?

@Addedowitoext:
Hello, and welcome. You might want to try http://www.progresstalk.com/forumdisplay.php?81-Chit-Chat, this is the wrong forum for introductions.

@all:
Right, scrap the -spin value, doesn't work at all for workgroup, so we are stuck with the queueing semaphores...

Code:

15:54:57 SERVER   : This version of PROGRESS does not support Spin Lock Retries (-spin). (4965)

Ok, next step I guess is to increase the -b some more - if the max is 2gb for 32bit progress and I'm only using about 30mb, I could go for say 500mb, or even the full 2gb, and see what happens...

Anyone got any other ideas?

tamhas · Mar 2, 2011

You are getting some answers here and need to follow them. You can't set spin on workgroup. All you can do is change versions to get a better value.

Is your machine multi-processor or multi-core. This is a BIG RED FLAG for workgroup in older versions.

Pay a lot of attention to Tom (pretty much universally good advice when related to performance and the DB). 95% is not good. 85% is poor. 30000 is tiny. Start by adding at least another 0. No, you don't need to worry about data loss. That is what the BI file is there for.

TomBascom · Mar 2, 2011

-B 30,000 is minuscule, a hit ratio of 95% is bad and 85% is horrific. 95% means that 1 in 20 logical reads is a disk read -- and every record read is at least 2 logical reads, so 1 in 10 record reads requires disk access. Even with fast SSD disks that means going through layers of system calls, out onto the bus and back and it probably involves at least 2 context switches. And if the disks aren't local (i.e. they are on a SAN) it is even worse. 85% means that nearly 1 in 6 logical reads -- or one-third of record reads requires a disk IO.

If the "normal" hit ratio is 95% and it dips to 85% when things are bad you have probably found your "smoking gun". Any modestly bad report (or inquiry screen or lots of other possibilities) that happens to do a table scan of a large table will push all other data out of RAM and everyone will be twiddling their thumbs waiting for IO -- although with SSDs one would hope those waits would be short.

I suggest that you download the character version of the ProTop utility from http://dbappraise.com and get comfortable with its use. Monitoring user IO in real time as well as table stats (and index stats) would help you to determine who the culprit is and what they are doing.

1k db blocks are too small (IMHO that option ought to be removed from the db). You should convert to 4k or 8k. Note: -B is # of blocks so if you increase the block size you also increase RAM usage if you leave -B at the same value for both block sizes. You should also move all of your data and indexes to type 2 storage areas if you have not already done so. This will improve the efficiency of the IO that you do do.

You should also upgrade to 64 bits. It is silly to be running 32 bits.

Larger -B does not impact data integrity. Yes, more stuff gets buffered -- reducing IO (that's the point) but the before-image file protects you from data loss. Only uncommitted transactions are at risk (which they are anyway, regardless of -B size). The optimum value of -B is your db size (if you have enough RAM, otherwise stop prior to running out -- you don't want to start swapping...)

You should also be running after-imaging. Unless your data has no value and your job is only a hobby.

Workgroup does not support APWs. For WG licenses you want checkpoints to be frequent and short.

The multi-core issue is not something that they like to talk about. But it is very well known. It is much better with 10.1B+ (that's when they started using -spin 1 for WG rather than the semaphore approach). You would only ever notice it on systems which are busy -- on lightly loaded systems it doesn't show up as a problem.

It is possible that this is not as blindingly obvious as it seems. This diagnosis is, after all, based on second hand data -- and a limited amount of that

In particular, please keep in mind that no amount of database tuning can overcome really bad code. If taking the obvious steps doesn't work then you may need to engage a consultant.

sdjensen · Mar 4, 2011

Are you sure its a database problem?
Have you checked if there is some odd job running at the server using all the cpu,disk-read/write?

Could it be the network?

cj_brandt · Mar 4, 2011

Re: New here...Where Should I Begin?

dump and load of the database to change from 1k blocksize to 4 or 8k. That coupled with the tiny -B setting and you are doing way more disk io that you should be.

keno · Nov 29, 2011

Re: New here...Where Should I Begin?

The main reason for the problem is WG-licensing and multicore / multiprocessor.

Similar and simultaneous queries kill your service completely.

I have come across the same problem in 9.1x version and it did not help other than the server configured to use one processor.
Try if the 10.xx version spin 1 help, or update the EP-license.

kolonuk · Nov 30, 2011

This is a little old and we solved it.

Thanks all for responses, we upgraded to 10.2b, and the issue seems to have gone away. We did also change the blck size as well, so not sure exactly which one fixed it...

Cheers!

Rob Fitzpatrick · Nov 30, 2011

Good news. I hope you went with the 64-bit Enterprise RDBMS license. If you care about performance that's what you should be using. Be sure to tune the buffer pool, as advised by others in this thread.

Periodic Database Slowdown

kolonuk

Member

RealHeavyDude

Well-Known Member

TomBascom

Curmudgeon

kolonuk

Member

kolonuk

Member

cj_brandt

Active Member

MaximMonin

Member

kolonuk

Member

MaximMonin

Member

kolonuk

Member

tamhas

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

sdjensen

Member

cj_brandt

Active Member

keno

New Member

kolonuk

Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

Similar threads