Question Reduce checkpoint duration

FocusIT · Jul 18, 2013

Is there a way to reduce the checkpoint duration? There is plenty of advice on checkpoint frequency and I have followed that intently to get our checkpints to no more that every 5 minutes with 0 buffers flushed at peak load during the day, but as the footprint of the database increases so it checkpoint duration. Its currently taking about 30 seconds for each checkpoint to complete and end users are quite rightly complaining about UI performance. Details of database: -

OE 10.2B05 X64
Windows Server 2008R2 Enterprise X64
396GB memory
24 * 2.4 Ghz AMD Opteron Cores
4kb block size
35m in -B
10m in -B2
LRU2 policy remains disabled
1 APW
50 BI Buffs
100000 -spin
99% buffer hits
100% alternate buffer hits
850GB total database size

cj_brandt · Jul 18, 2013

You don't mention using a BI Writer - but you did list an APW. You have to use a BIW.
Can you show the promon R&D -> 3 -> 4 screen with the details of the last few checkpoints ?
Can you provide what your BI Clustersize and BI Blocksize are ?
What stats does windows perfmon show for the disk that holds the bi file during these checkpoints ?

The fix will be pretty simple if you weren't using a BIW process before.

FocusIT · Jul 18, 2013

A BIW and AIW are running.

BI Clustersize is 262128kb, block size is 4kb.

The disk hosting the BI is < 10% active during the checkpoint.

Only thing that seems odd is that 'Writes by BIW' rarely gets over 50%.

R&D -> 3 -> 4 below

15:36:42
Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes Duration Sync Time
42 15:23:54 769 0 170968 52827 4914 0 0 31.88 0.05
41 14:44:14 2019 2380 158081 155798 11804 0 0 36.19 0.05
40 14:13:06 1742 1868 127145 125842 4706 0 0 25.46 0.06
39 13:49:25 1320 1421 120683 119758 3640 0 0 20.86 0.05
38 13:29:00 1189 1225 55616 53883 2782 0 0 29.12 0.06
37 13:19:16 0 584 60079 59546 1508 0 0 21.23 0.08
36 13:10:31 507 525 138849 138316 1040 0 0 20.60 0.08
35 12:52:38 1022 1073 174106 173365 2210 0 0 19.55 0.06
Enter <return>, R, U, P, T, or X (? for help):

Rob Fitzpatrick · Jul 18, 2013

FocusIT said:
15:36:42
Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes Duration Sync Time
42 15:23:54 769 0 170968 52827 4914 0 0 31.88 0.05
41 14:44:14 2019 2380 158081 155798 11804 0 0 36.19 0.05
40 14:13:06 1742 1868 127145 125842 4706 0 0 25.46 0.06
39 13:49:25 1320 1421 120683 119758 3640 0 0 20.86 0.05
38 13:29:00 1189 1225 55616 53883 2782 0 0 29.12 0.06
37 13:19:16 0 584 60079 59546 1508 0 0 21.23 0.08
36 13:10:31 507 525 138849 138316 1040 0 0 20.60 0.08
35 12:52:38 1022 1073 174106 173365 2210 0 0 19.55 0.06
Enter <return>, R, U, P, T, or X (? for help):

This information is basically unreadable unless you put it within CODE tags.

FocusIT · Jul 18, 2013

Code:

    5.  Activity
    6.  Shared Resources
    7.  Database Status
    8.  Shut Down Database
 
  R&D.  Advanced options
    T.  2PC Transactions Control
    L.  Resolve 2PC Limbo Transactions
    C.  2PC Coordinator Information
 
    J.  Resolve JTA Transactions
 
    M.  Modify Defaults
    Q.  Quit
 
    Enter your selection: R&D
 
♀07/18/13        OpenEdge Release 10 Monitor (R&D)
16:21:14        Main (Top) Menu
 
                1. Status Displays ...
                2. Activity Displays ...
                3. Other Displays ...
                4. Administrative Functions ...
                5. Adjust Monitor Options
 
Enter a number, <return>, P, T, or X (? for help): 3
 
♀07/18/13        OpenEdge Release 10 Monitor (R&D)
16:21:16        Other Displays Menu
 
                1. Performance Indicators
                2. I/O Operations by Process
                3. Lock Requests By User
                4. Checkpoints
                5. I/O Operations by User by Table
                6. I/O Operations by User by Index
                7. Total Locks per User
 
Enter a number, <return>, P, T, or X (? for help): 4
 
♀07/18/13        Checkpoints
16:21:17
 
Ckpt                                  ------ Database Writes ------
No. Time        Len  Freq  Dirty  CPT Q    Scan  APW Q Flushes  Duration  Sync Time
 
  43 16:08:05    792      0  199800  85592    1768      0      0      29.22      0.07
 
  42 15:23:54  2534  2651  170968  154583  19656      0      0      31.88      0.05
  41 14:44:14  2019  2380  158081  155798  11804      0      0      36.19      0.05
  40 14:13:06  1742  1868  127145  125842    4706      0      0      25.46      0.06
  39 13:49:25  1320  1421  120683  119758    3640      0      0      20.86      0.05
  38 13:29:00  1189  1225  55616  53883    2782      0      0      29.12      0.06
  37 13:19:16      0    584  60079  59546    1508      0      0      21.23      0.08
  36 13:10:31    507    525  138849  138316    1040      0      0      20.60      0.08
 
Enter <return>, R, U, P, T, or X (? for help):

Rob Fitzpatrick · Jul 18, 2013

You're using the largest possible BI cluster size. How does it perform with a smaller cluster size? This will make your checkpoints closer together, but if they are 5 minutes apart at peak load then you have tolerance for that. It may help with the responsiveness of the application.

I'm curious, is all that RAM being used by some other DBs or processes? Can you use it for buffer pool?

FocusIT · Jul 19, 2013

The checkpoint duration is exactly the same with a small cluster size as I tried this last week which means UI performance is even worse as the pauses are more frequent. I raised the same issue with PSC and they suggested turning off the Alternate Buffer Pool altogether by setting -B2 to zero. I did this last night and the average checkpoint duration is now around 10 seconds instead of 30. Overall performance has however dropped as the hot tables are now in the standard buffer pool and the LRU policy is recycling them. Does anyone know if this drop in checkpoint duration is due to turning off the alternate buffer pool or is it because of the overall reduction in -B + -B2 i.e. the overall buffer pool is now 20m less with -B2 set to zero.

In answer to the memory question the rest of the server memory is split between other databases on the same box with about 96GB left for the OS to utilise, being Windows it used all of it even sometimes pages to disk.

FocusIT · Jul 19, 2013

Could this be a rare case for using more than one APW? The write volume against this database is large, it generates 20-30GB of AI files a day.

Rob Fitzpatrick · Jul 19, 2013

FocusIT said:
Could this be a rare case for using more than one APW? The write volume against this database is large, it generates 20-30GB of AI files a day.

I wouldn't describe using more than one APW as "rare".

Check your latch wait timeouts. If there is contention for the LRU latch then you would probably benefit from using the -lruskips parameter, however you need to update your service pack to get access to that startup param. Is there something keeping you on SP05?

TomBascom · Jul 19, 2013

What sort of disk is the bi file on? Is there a SAN involved?

Is after-imaging enabled? With an AIW?

Why is -spin so large?

Since -B2 works for you you should upgrade to 10.2b06 or better and get the advantages of -lruskips.

Are client connections remote? Or do you have local app-server connections?

Is this a virtualized server?

FocusIT · Jul 19, 2013

Thanks Rob. Where in promon can I monitor LRU latch contention?

The mother of all change control processes is keeping us on SP05. I had to work extremely hard to push through an upgrade from 9.1D last March to 10.2B05. I would rather explore all options with SP05 before starting the regression testing cycle and change control process for SP07 or even OE11.

FocusIT · Jul 19, 2013

Hi Tom

BI is on physical RAID 1+0 disk, fast array with 16 spindles. Nothing else on array other than .r code i.e. no other database technologies or services.

Would prefer to avoid regression testing for an SP upgrade (see my earlier posting), but if SP06 or 07 is the answer then I will have to bite the bullet.

Clients are remote, no app-server.

No virtualized server or SAN involved, database has its own dedicated box.

FocusIT · Jul 19, 2013

Is spin too high for a server with 24 hyper threaded cores, what would be a better setting?

Rob Fitzpatrick · Jul 19, 2013

FocusIT said:
Thanks Rob. Where in promon can I monitor LRU latch contention?

You can see total latch waits in promon R&D 3 1. For the individual latch counts, go to promon R&D, then enter "debghb" (without the quotes), then select menu 6, then option 11 (latch counts).

Rob Fitzpatrick · Jul 19, 2013

You may also want to look at empty buffer waits (promon R&D 2 5). With all that transaction activity I'd be surprised if -bibufs 50 is enough. If you do increase it, set aibufs to the same value.

TomBascom · Jul 19, 2013

Sp06+ also has some very significant client/server improvements. And lruskips is huge.

In my experience -spin is NOT related to number of cores. And values like 5,000 or 10,000 are better than 100k.

In the short term I would increase your -bibufs to 500. Especially if you see empty bi buffer waits.

You didn't answer the after-imaging question...

For monitoring latch contention I suggest ProTop... Http://DBAppraise.com/ProTop.html

FocusIT · Jul 19, 2013

R&D 3 1, is Latch timeouts good or bad?

Code:

♀07/19/13        Activity: Performance Indicators
13:13:53        07/18/13 22:40 to 07/19/13 13:13 (14 hrs 34 min)
 
                                    Total        Per Min          Per Sec          Per Tx
 
Commits                          7704011            8818          146.96            1.00
Undos                                240              0            0.00            0.00
Index operations                  1260870K        1477802        24630.04          167.59
Record operations                1649485K        1933278        32221.31          219.25
Total o/s i/o                    52973996          60633          1010.55            6.88
Total o/s reads                  39850681          45612          760.20            5.17
Total o/s writes                13123315          15021          250.34            1.70
Background o/s writes            12945951          14818          246.96            1.68
Partial log writes                627329            718            11.97            0.08
Database extends                        0              0            0.00            0.00
Total waits                        598481            685            11.42            0.08
Lock waits                            148              0            0.00            0.00
Resource waits                    598333            685            11.41            0.08
Latch timeouts                    302486            346            5.77            0.04
 
Buffer pool hit rate:  99 %    Primary pool hit rate:  99 %    Alternate pool hit rate:  0 %
 
Enter <return>, A, L, R, S, U, Z, P, T, or X (? for help):

FocusIT · Jul 19, 2013

Sorry Tom, yes AI is enabled with an AIW.

FocusIT · Jul 19, 2013

R&D 2 5

Code:

♀07/19/13        Activity: BI Log
13:16:48        07/18/13 22:40 to 07/19/13 13:13 (14 hrs 34 min)
 
                                    Total        Per Min          Per Sec          Per Tx
 
Total BI writes                  3466430            3968            66.13            0.45
BIW BI writes                    1961988            2246            37.43            0.25
Records written                  93448025          106959          1782.64          12.13
Bytes written                    11343672K      13295343        221589.06        1507.78
Total BI Reads                      53658              61            1.02            0.01
Records read                          719              1            0.01            0.00
Bytes read                          87743            100            1.67            0.01
Clusters closed                        44              0            0.00            0.00
Busy buffer waits                  380901            436            7.27            0.05
Empty buffer waits                    930              1            0.02            0.00
Log force waits                        0              0            0.00            0.00
Log force writes                        0              0            0.00            0.00
Partial writes                    613748            702            11.71            0.08
Input buffer hits                    529              1            0.01            0.00
Output buffer hits                    250              0            0.00            0.00
Mod buffer hits                      261              0            0.00            0.00
BO buffer hits                        246              0            0.00            0.00
 
Enter <return>, A, L, R, S, U, Z, P, T, or X (? for help):

TomBascom · Jul 19, 2013

You're looking at averages over a 14 hour period. That's not helpful. Do a 10s sample during a problem time - that's much more revealing.

Question Reduce checkpoint duration

Member

Active Member

Member

ProgressTalk.com Sponsor

Member

ProgressTalk.com Sponsor

Member

Member

ProgressTalk.com Sponsor

Curmudgeon

Member

Member

Member

ProgressTalk.com Sponsor

ProgressTalk.com Sponsor

Curmudgeon

Member

Member

Member

Curmudgeon

Similar threads