Question Parallel Load

catch.saravana · Sep 27, 2016

Hi All,
I am facing performance issues with parallel load with inline index build.

Let's take a scenario, I was able to run 4 loads with inline index build in < 6 minutes when I ran it one by one (without proserve just proutil). The same set of 4 loads are taking around 23 minutes to complete when ran in parallel from 4 terminals (with proserve). What do you think the reason could be?

proserve test.db -spin 5000 -bibufs 5 -n 70 -Mn 10 -Mpb 4 -Ma 10 -Mi 3 -minport 10206 -maxport 10270 -Mm 32600 -mmax 65534 -baseindex 1 -indexrangesize 4000 -basetable 1 -tablerangesize 4000 -B 1250000 -i -pwsdelay 100 -pwscan 4166 -pwwmax 4166

proutil test.db -C load /dumplocal/t1.bd build indexes -T /tmpForidx -spin 5000 -i -B 312500 -TB 64 -TM 32 -TMB 64 -SG 64
proutil test.db -C load /dumplocal/t2.bd build indexes -T /tmpForidx -spin 5000 -i -B 312500 -TB 64 -TM 32 -TMB 64 -SG 64
proutil test.db -C load /dumplocal/t3.bd build indexes -T /tmpForidx -spin 5000 -i -B 312500 -TB 64 -TM 32 -TMB 64 -SG 64
proutil test.db -C load /dumplocal/t4.bd build indexes -T /tmpForidx -spin 5000 -i -B 312500 -TB 64 -TM 32 -TMB 64 -SG 64

Do we have to add any other parameter or tweak some values to resolve the performance issue?

Regards,
Saravanakumar B

Rob Fitzpatrick · Sep 27, 2016

When discussing binary load and index rebuild it is very important to mention your exact OpenEdge version. You appear to be on at least 10.2B04 but I'd rather not guess. If you're on a late service pack of 10.2B, or 11.2+, you can likely improve your performance considerably either by changing parameters or by changing your approach, e.g. separating the binary loads from the index rebuilds.

Platform information (OS and hardware, including available RAM and # of cores) would also be useful and relevant.

catch.saravana · Sep 27, 2016

Sorry Rob, My bad. We are in the process of migrating the DB from 9.1E to 11.6 (load process is on 11.6 version).

Can you please let me know the parameter that has to be added to tune up the performance in this case?

I wonder why will the load of 4 tables take approximately 26 minutes when it's run in parallel on different sessions whereas it takes hardly 6 minutes when it's run one by one (without proserve). Ideally when I run in parallel it has to improve the performance as it doesn't have any dependency factor involved. I am confused on what could be the underlying issue here.

System Level Details:
Linux Centos 7,
4 CPU Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
32GB Ram can increase to 64 or 128GB

Regards,
Saravanakumar B

Rob Fitzpatrick · Sep 27, 2016

Is that 4 x E5-2670, 8 cores each, for a total of 32 cores?

catch.saravana · Sep 27, 2016

Rob - It's 4 CPU and each is 1 core.

catch.saravana · Sep 27, 2016

Hi Rob,
Please find more specifications below;

Each CPU is of:
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping : 2
microcode : 0x710
cpu MHz : 2600.000
cache size : 20480 KB
physical id : 6
siblings : 1
core id : 0
cpu cores : 1
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm ida arat pln pts dtherm
bogomips : 5200.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual

Regards,
Saravanakumar B

Rob Fitzpatrick · Sep 27, 2016

For starters, review this presentation from PUG Challenge Americas 2013:
Index Rebuild Performance, Wei Qiu, Progress Software
Slides: http://pugchallenge.org/downloads2013/261_Index_Rebuild_Perf.pptx
Audio: http://pugchallenge.org/downloads2013/audio/261_Index_Rebuild_Perf.mp3

It discusses the changes made to proutil idxbuild in late 10.2B and early 11.x to improve performance.

What works best will ultimately depend on your data, schema, structure, and hardware. I assume you understand structure design and have a reasonable Type II structure for your new v11 database.

There is no one approach or set of parameter values that is optimal for all cases. The key to success is testing and documentation. Write your test cases, execute the tests, log the results, analyze and document the results, move on to the next test, etc.

There are different ways to approach load performance optimization, including inline index rebuild as you're doing. There are also different constraints that can influence which approach is best, or at least good enough, like the combination of database size, hardware speed, and allowable downtime window. If you can take 12 hours to dump and load, a difference of 6 minutes versus 23 to load and rebuild doesn't matter. If you can only afford one hour of downtime, or just as little as possible, then that difference does matter.

I typically start with the approach of parallel binary loads in a database opened multi-user, no-integrity. I try to balance the load jobs such that they all do more or less the same amount of work and finish in a similar amount of time. Usually I have a few large, slow-loading tables and a lot of small ones. So I create one thread (a shell script with one or more proutil loads) for each big chunk of work, i.e. one or more large tables, and one other thread for all of the other small tables. I import dbanalys output into Excel and sort by table size to determine how to arrange tables into separate threads.

When the loads are finished I shut down and back up the database. Don't skip the backup to save time. Then I run idxbuild all with parameters something like this:

proutil dbname -C idxbuild all -B 1024 -datascanthreads <a> -TF 60 -TB 64 -TM 32 -SG 64 -T <fast, non-DB disk> -threadnum <b> -mergethreads <c> -pfactor 80 -rusage

The values of <a>, <b>, and <c> are dependent on data and available hardware. It sounds like you have lots of RAM, which is good. But of course "lots" is relative, depending on the size of your DB.

With the thread parameters, you also have to consider OS file handle limits, if any. Each proutil thread spawned, including the parent thread, will open every database file. So if you aren't running as root (where these limits can be ignored), lots of extents times lots of threads can quickly add up to a large number and cause a crash if you hit the limit. At that point your only option is to restore from backup and start the idxbuild again.

As a starting point, try 16 for -datascanthreads, 2 for -threadnum, and 4 for -mergethreads.

The idxbuild runs separately for each storage area. The performance goal is to minimize and if possible eliminate temp-file disk write I/O during the data scan/key build phase and temp-file read I/O during the sort/merge phase. If -TF (the percentage of available memory to use for sort/merge buffers) gives you enough memory to hold all the key data for the indexes being built for a given area then this temp-file I/O will be zero. When you specify -rusage, at the end of processing for each area you will see (11480) and (11483) messages something like this:

DBUTIL : (11480) Temporary sort file at /home/robf/db/bg/gw used up 0K of disk space.
DBUTIL : (11483) A total of 0K of temporary sort disk space was used for area 7.

If it says "0K" was used then all of the key-sorting was memory-resident, which is ideal. If you get that result for every application area then -T doesn't matter because you aren't using it. If the size of your data relative to available memory is such that you can't avoid temp-file I/O then you may have to experiment with either different -T locations or using -SS instead of -T, with a sort-specification file that lists file systems with available space to use. The documentation describes how to use -SS.

There's lots to say about dump/load/idxbuild and I can't cover it all in one post. Try the above and let me know how it goes, and if you have any questions.

Rob Fitzpatrick · Sep 27, 2016

catch.saravana said:
Rob - It's 4 CPU and each is 1 core.

I missed this point when I wrote the above. Are these really 1-core CPUs? Is this a VM then?

Change -datascanthreads from 16 to 6.

TomBascom · Sep 28, 2016

Aside from what Rob said... in modern releases it is *usually* going to be fastest to do one big index rebuild after everything is loaded.

That could change someday but in today's world that is the case. It is always possible that in some future release Progress might do something and some other approach may become the generally accepted fastest approach.

Once upon a time and under certain conditions, the inline index rebuild could be more effective. I have not, personally, seen a case where it was faster for quite a long while. There was also a point in time when 4gl code using BUFFER-COPY worked very well (I can still sometimes get that approach to come out more or less as fast as binary D&L but the amount of work that it takes to do it is not worth it).

Also -- large values of -B are generally not effective with proutil. They may even be counter-productive.

catch.saravana · Sep 28, 2016

@Rob Fitzpatrick - Yes its VM configured 1-core per CPU and we have 4 CPUs on this VM. Will increasing cores helps in parallel load?

@TomBascom - Thanks Tom, will change my approach to 'one big index rebuild after everything is loaded' approach.

Rob Fitzpatrick · Sep 28, 2016

catch.saravana said:
Will increasing cores helps in parallel load?

Yes. Really I should have told you to lower -threadnum and mergethreads as well, given that you only have 4 actual cores. I hope your hypervisor is configured not to overprovision cores and memory, i.e. that your VM actually has the dedicated use of those resources.

More threads with few cores will just lead to more context switching. If you can, bump up the cores to at least 8. In that case I would try -datascanthreads 12, -threadnum 2, -mergethreads 4.

catch.saravana · Sep 28, 2016

@Rob Fitzpatrick - "More thread with few cores will just lead to more context switching"; Is that the reason why parallel load consumes ~23 minutes whereas loading tables one by one without using proserve and just proutil consumes ~6 minutes? Is this the bottle neck in my case?

Yes, we should be able to bump up the cores to 8.

Also we are running the parallel loads without an index; will this have a severe impact as well?

catch.saravana · Sep 28, 2016

@Rob Fitzpatrick / @TomBascom - Looks like PROUTIL (independent run; in serial mode) executes faster than using proserve (parallel mode). I think it should be the other way around - right? I don't understand what could be the real bottleneck in this case.

Rob Fitzpatrick · Sep 28, 2016

It's hard to say where bottlenecks might be without seeing your system.

The issue with parallel activities might be I/O contention; it might be the limited number of cores; it might be the configuration.

Have you mentioned what your performance/downtime target is and how far you are from it?

catch.saravana · Sep 28, 2016

@Rob Fitzpatrick - I am performing a migration to a test machine to see how much downtime I will require to get this done and go back to my management to see if we have that time or come up with some other strategy.

I wanted to check if I/O contention is the issue in this case. Should I be using PROMON [promon dbname -> R&D -> 3. Other Displays -> 2. I/O Operations by Process] to cross verify or do you have some other suggestion for it?

catch.saravana · Sep 28, 2016

@Rob Fitzpatrick - I just checked with the management, looks like I will have a weekend (but again we have close to 2 TB of db to migrate, backup's and testing).

catch.saravana · Sep 29, 2016

@Rob Fitzpatrick - I am a newbie to Progress DB Admin and pretty much a progress programmer. Sorry for asking dumb questions;

I did monitor during parallel load and the below was the statistics I got;

Parallel load stats using promon:

Total Per Min Per Sec Per Tx

Commits 5946 35676 594.60 1.00

Undos 0 0 0.00 0.00

Index operations 2727783 16366698 272778.30 458.76

Record operations 567425 3404550 56742.50 95.43

Total o/s i/o 92096 552576 9209.60 15.49

Total o/s reads 18135 108810 1813.50 3.05

Total o/s writes 73961 443766 7396.10 12.44

Background o/s writes 0 0 0.00 0.00

Partial log writes 1702 10212 170.20 0.29

Database extends 16640 99840 1664.00 2.80

Total waits 0 0 0.00 0.00

Lock waits 0 0 0.00 0.00

Resource waits 0 0 0.00 0.00

Latch timeouts 1697 10182 169.70 0.29

Buffer pool hit rate: 98 % Primary pool hit rate: 98 % Alternate pool hit rate: 0 %

Iteration 31 of 9999, pause for 10 seconds ...

09/28/16 Activity: Performance Indicators

20:50:59 09/28/16 20:50 to 09/28/16 20:50 (10 sec)

Total Per Min Per Sec Per Tx

Commits 4583 27498 458.30 1.00

Undos 0 0 0.00 0.00

Index operations 2777642 16665852 277764.20 606.08

Record operations 430595 2583570 43059.50 93.95

Total o/s i/o 78261 469566 7826.10 17.08

Total o/s reads 13876 83256 1387.60 3.03

Total o/s writes 64385 386310 6438.50 14.05

Background o/s writes 0 0 0.00 0.00

Partial log writes 1523 9138 152.30 0.33

Database extends 12544 75264 1254.40 2.74

Total waits 0 0 0.00 0.00

Lock waits 0 0 0.00 0.00

Resource waits 0 0 0.00 0.00

Latch timeouts 1767 10602 176.70 0.39

Buffer pool hit rate: 98 % Primary pool hit rate: 98 % Alternate pool hit rate: 0 %

Iteration 32 of 9999, pause for 10 seconds ...
09/28/16 Activity: Performance Indicators

20:51:09 09/28/16 20:50 to 09/28/16 20:51 (10 sec)

Total Per Min Per Sec Per Tx

Commits 4707 28242 470.70 1.00

Undos 0 0 0.00 0.00

Index operations 2728389 16370334 272838.90 579.64

Record operations 443427 2660562 44342.70 94.21

Total o/s i/o 79147 474882 7914.70 16.81

Total o/s reads 14545 87270 1454.50 3.09

Total o/s writes 64602 387612 6460.20 13.72

Background o/s writes 0 0 0.00 0.00

Partial log writes 1514 9084 151.40 0.32

Database extends 12992 77952 1299.20 2.76

Total waits 0 0 0.00 0.00

Lock waits 0 0 0.00 0.00

Resource waits 0 0 0.00 0.00

Latch timeouts 1786 10716 178.60 0.38

Buffer pool hit rate: 98 % Primary pool hit rate: 98 % Alternate pool hit rate: 0 %

Iteration 33 of 9999, pause for 10 seconds ...

Do you think there is a locking contention issue in this case?

TomBascom · Sep 29, 2016

One thing I noticed is that you are doing a lot of "database extends". Those are very expensive.

But there is a trick that you can use! Make a backup of your db after you first build and and have loaded the DF file -- but *before* you load any data. (Use the -com option on probkup.) This empty backup can then be used to re-initialize the target database without rebuilding it. Just restore the empty backup on top of the new db when you have finished testing. By doing that you will avoid needing to keep extending and formatting the extents as the load grows the db. This can save you many, many hours of time on subsequent loads and index rebuilds.

The first run-through will still be slower. But your 2nd, 3rd, 4th... tests will all get the advantage of the space being pre-allocated.

cj_brandt · Sep 29, 2016

What Tom said - write to fixed extents as much as possible during the load.
I recommend putting tables with word indexes in their own area to avoid a single threaded data scan during the idxbuild.
The more memory you can allocate to the idxbuild operation - the TF parameter - the better. The summary at the end of each area will show how much disk space was required to hold temp files.
I like running the index rebuild per area rather than the entire db, but that is just a preference, not really performance impacting.

assuming you are dumping from 1 db to a new db.
table analysis source
dump the tables in parallel.
load the tables in single user mode with -i as soon as a table is dumped.
idxbuild
table analysis destination

Might want to test upgrading the 9.1E db to 11 before running the dump. 11 may dump data faster with multithreaded dump than single thread with 9.

Fazlur Khan · Sep 29, 2016

If you are doing parallel loads, monitor the checkpoints. If the len of checkpoint is too small and the duration to sync that data is more, then that is the bottle neck which is slowing down. You can increase the BI cluster size, increase the Bibufs and add a BIW. This will give you minor relief. To get better performance do a serial load with fixed extents.

Latch timeout seems to be high, there is definitely contention of shared resources ... I think you can increase a little bit of spin and check.

Question Parallel Load

Member

ProgressTalk.com Sponsor

Member

ProgressTalk.com Sponsor

Member

Member

ProgressTalk.com Sponsor

ProgressTalk.com Sponsor

Curmudgeon

Member

ProgressTalk.com Sponsor

Member

Member

ProgressTalk.com Sponsor

Member

Member

Member

Curmudgeon

Active Member

New Member

Similar threads