best direction to tweak my idxbuild parameters

jurriaan

New Member
I'm testing a dump/load for a move from windows (enterprise 10.1C) to linux (enterprise 10.2B07)
The dump-part has been solved by our friendly system admins, who created a windows 2003 VMWare instance with SSD storage. Now the dump takes just an hour for 49 gigabytes. We have a downtime window of about 12 hours, but any time saved will enable me to get some rest before the next day :)

The load will take place on a Red Hat Enterprice VMWare instance, 1 single Xeon 2.9 Ghz cpu, 16 Gb memory, storage on a metrocluster (Netapp). I can't influence the storage configuration, no idea what raid-levels lurk in those murky depths.

I've made great speed advances already:

sequential import of all tables + build indexes on the fly (-TB 24 -TM 32 -SS ... -G 1): 3 hours 40 minutes
parallel import of the one big table (20G)/ all other tables +build indexes on the fly (-TB 24 -TM 32 -SS .. -G 1): 2 hours 40 minutes
parallel import of the one big table/all other tables, build indexes afterwards (
-C idxbuild all -thread 1 -threadnum 4 -SS ... -TB 64 -TF 85 -TM 32 -datascanthreads 2 -B 1024 -TMB 64 -rusage): 1 hour 40 minutes

so I have 2 hours saved already. When building the indexes for the big table, I read

Multi-threaded index sorting and building complete. Elapsed time: 951.483
Resource usage: CPU user 804.163749, system 42.930474
Resource usage: DISK reads: 35005288 KB at 36 MB/sec, writes: 32735828 KB at 34 MB/sec
16 indexes were rebuilt in area 10. Total elapsed time: 1859.124
Resource usage: CPU user 1638.285943, system 72.874921
Resource usage: DISK reads: 56857840 KB at 30 MB/sec, writes: 41998548 KB at 22 MB/sec

Temporary sort file at /srt/1/ used up 3118720K of disk space. (11480)
Temporary sort file at /srt/2/ used up 3390656K of disk space. (11480)
Temporary sort file at /srt/3/ used up 2507072K of disk space. (11480)
Temporary sort file at /srt/4/ used up 2494272K of disk space. (11480)
A total of 11510720K of temporary sort disk space was used for area 10. (11483)

Do the experts see any good avenues to speed the process up any further? I could ask for more memory (would 32 Gb mean the temporary sort files disappear?) or a dual cpu or tweak parameters, but I'd like to get a hint on where to look first.

Thanks for reading!
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
In 10.2B07 (is there a reason you didn't go with 10.2B08, or for that matter 11.3.2?) there are two really big parameters that will help you a lot: -datascanthreads, during the data scan/key build phase, and -TF, during the last three phases (data scan/key build, sort/merge, and key insertion). During an index rebuild each of the four phases of index rebuild is repeated for each data storage area. The -datascanthreads param allows you to multi-thread the reads of RM (data) blocks to extract the index key data into sort blocks. This phase will be multi-threaded if certain conditions exist:
  • you have indicated that you have enough space for disk sorting;
  • the area being scanned doesn't contain any of the indexes that are being rebuilt;
  • the area being scanned doesn't contain any tables that have word indexes;
  • the area being scanned is a Type II area.
The first, second, and fourth conditions should always be true and the third is usually true. You haven't mentioned your structure but I assume, given your care and attention to this index reuild, that you have taken this opportunity to analyze your data and your application's data access patterns to determine whether your database structure or your allocation of objects to areas needs to change.

The other big parameter I mentioned is -TF. This determines the amount of "available" memory, expressed as a percentage, that idxbuild will use for sort file space per area. The more memory it can allocate, the less disk sorting you need to do. If the sort files can be completely memory-resident then you save I/O in multiple phases. You eliminate writes during the data scan/key build phase where unsorted index keys are written to sort blocks and merged into merge blocks; reads and writes during the sort/merge phase where merge blocks get merged into larger sorted "runs" of merge blocks -TM at a time; and reads during the key insertion phase where the b-tree is built on disk.

This is a significant optimization. You want to use enough -TF to eliminate sort file disk I/O if you can. And if you can't you want to use as much -TF as you can so you minimize sort file disk I/O but not so much that you starve the OS for memory and cause swapping which will be a huge performance penalty.

I have run into rebuilds like the one you describe where one large, heavily-indexed table dominates the rebuild. I assume this table is in its own area and its indexes are in their own area. I had a situation like this (on 10.2B06) where I tweaked as much as I could and though I could get no faster with a single idxbuild, and I still couldn't eliminate sort file I/O. I then experimented with multiple successive idxbuilds down to the individual index level. My table was about 12 GB and had 12 indexes, in total it used over 16 GB of sort space and I had 8 GB of RAM on my box. I tested each index in turn without -TF to determine their sort file size needs and knew their order in descending size.

Ultimately my best performance came from a shell script where I ran six idxbuilds: four for the largest individual indexes and then two groups of four small indexes each. It was actually a pretty significant gain in performance though I don't remember the numbers offhand. Note that index size as reported in idx/dbanalys does not necessarily correspond to potential sort file size which is affected by factors like whether an index is unique or is a word index. Note also that this approach is a tradeoff as it means you are doing some repeated work: the first two phases, which are index scan and data scan/key build. The first typically doesn't take much time and isn't worth the trouble of optimizing (and isn't required when building a new DB from scratch as there is no index data yet). The second does take a while in some circumstances. And if you run six idxbuilds for a given data table then you are doing six data scans of that data area. But in my case the payoff from eliminating sort file disk I/O completely was so big that it overcame the penalty of the multiple data scans and then some. You'll just have to experiment with your data and your system to figure out what is best. Test, test, test, and log your results carefully. But it looks like you're already doing that, so kudos to you.

Another optimization available to you is multi-threading of the sort-merge phase with the -threadnum and -mergethreads parameters. Unfortunately being on a one-core VM really hurts you here. At least I think you are but it's worth clarifying. You said "1 single Xeon 2.9 Ghz cpu". Does that mean you see one logical core in top/htop/nmon, or is it one physical CPU with multiple logical cores? If the latter then you're in better shape than I thought and can bump up your threading-related parameters including -datascanthreads. If it is really one logical core though, then you need to push back. Who the heck deploys a 1-core VM for a database server in 2014 unless they're running a sports2000 application? Get them to give you more cores! Then set -datascanthreads to cores * 1.5, set -threadnum * -mergethreads to cores * 1.5 with a value of -threadnum from 2 to 4. Be careful of your handle usage, as more threads means more handles as each thread opens every file in the DB, and then some. If you aren't running as root you may want to bump up your ulimit -n.

Also you didn't say how much RAM you have. Maybe they can bump it up, even if it's just temporarily during the rebuild; that should be possible. Yes, 32 GB should make your sort file I/O go away.

My condolences on the NetApp.

I hope and assume that /srt is not also used for database storage. If you are forced to do sort file I/O then you don't want to share disk throughput between database access and sort file access.

Fun "zecret": you can use the undocumented -z parameter for extra debug information about merge runs, etc. May not be useful but I enjoy "looking in the engine room", as it were.

It's worth talking a bit more about your structure: are you using a new all-variable structure with each test? If so you are paying the penalty of extending your extents as you write to them. I deal with this by doing one test load to create fully-grown extents. Then I procopy empty8 overtop of that DB and load in my application schema and then probkup. Then for successive tests I can restore this DB as my target. It still has an all-variable structure so I have the benefit that it will be easy to manage going forward but I don't have waste time to do any extends operations during the D&L/idxbuild.

One other parameter you haven't mentioned: -pfactor. This is the "packing factor": the percentage of a block size that proutil tries to use to pack key data into IX blocks during the rebuild. The default value is 100, meaning it will fill them as much as possible. This sounds good but it means potential post-rebuild performance impacts during database updates as key insertions will be much more likely to cause expensive "block splits". This is what happens when a key needs to be inserted at a certain point in the b-tree but there is no space in the block. So a new empty block is allocated and the block's contents are split evenly across two blocks. And the block "above" that previously pointed to the one block now points to two, so it has to be updated as well. In the worst case this split may propagate itself all the way up the tree to the root block. Progress recommends using -pfactor 80, which means index blocks will be (roughly) 80% full, leaving room for future growth without immediately causing lots of block splits.

I don't think you mentioned -SG either. The default value is 48 and the maximum is 64. It specifies the maximum number of sort groups that will be used for the current storage area. More sort groups means more separate files and potential opportunity for concurrency via multi-threading. If you don't have more than 48 indexes in a given area then you don't need to set this. If there are more indexes than sort groups then indexes will be assigned to sort groups in round-robin fashion. I typically just set -SG 64 and forget about it. If I need the extra sort groups they're available and if fewer than 64 are needed then the idxbuild will just use what it needs.

Congrats on the move of your DB away from Windows ;) and good luck with the rebuild. It looks like you're on the right track. Let me know if you have any questions.
 
Last edited:

Rob Fitzpatrick

ProgressTalk.com Sponsor
Some background material from recent PUG Challenge Americas conferences:

Still More Database Improvements (includes a section on idxbuild improvements in 10.2B06)
http://pugchallenge.org/2012PPT/NEPUG_Performance.pptx

Index Rebuild Performance (from the 10.2B07 timeframe)
slides: http://www.pugchallenge.org/downloads/261_Index_Rebuild_Perf.pptx
audio: http://www.pugchallenge.org/downloads/audio/261_Index_Rebuild_Perf.mp3

27 Days to 27 Hours: Best Practices in Dump & Load Optimization
slides: http://pugchallenge.org/2012PPT/2012USPUGCh27Daysto27Hours.pdf
 

jurriaan

New Member
Yes, I really have a single single-core cpu. This has something to do with VMWare, apparently a single-core VM can be protected against hardware failure on the VMWare server side - it moves along transparantly when the VMWare server crashes. That doesn't sound so bad (it certainly sounds better than the broken windows cluster we're on now), and so far, speed has been acceptable.

I've been using Type II areas with fixed extents plus a weekly monitoring script that notifies me ahead of time when an extent is filling up.

I'll be trying to for more memory first and see how that works out. The fact that we're well within the maintenance window helps a lot!
 

cj_brandt

Active Member
1 item I ran into that slowed down the index rebuild was I had tables with a word index in multiple areas. Areas containing a word index will not use multiple threads, so I moved all tables with word indexes to their own area.
 

jurriaan

New Member
A new test with 2 cpu's and 24 Gb memory with these parameters

proutil db -C idxbuild all -thread 1 -threadnum 8 -SS $SORTDIR/db.srt -TB 64 -TF 90 -TM 32 -pfactor 80 -datascanthreads 8 -B 1024 -TMB 64 -rusage

took 70 minutes, down from 100 minutes with 1 single cpu/16 Gb.

The one big table now reads:

Multi-threaded index sorting and building complete. Elapsed time: 566.118 (was 951 seconds)
Resource usage: CPU user 825.138560, system 12.401115
Resource usage: DISK reads: 1750060 KB at 3 MB/sec, writes: 6993648 KB at 12 MB/sec
16 indexes were rebuilt in area 10. Total elapsed time: 1036.813 (was 1859 seconds)
Resource usage: CPU user 1680.592511, system 47.321806
Resource usage: DISK reads: 23601514 KB at 22 MB/sec, writes: 8743344 KB at 8 MB/sec
Temporary sort file at /srt/1/ used up 493696K of disk space.
Temporary sort file at /srt/2/ used up 500032K of disk space.
Temporary sort file at /srt/3/ used up 383680K of disk space.
Temporary sort file at /srt/4/ used up 372288K of disk space.
A total of 1749696K of temporary sort diskspace was used for area 10. (now 1.7 Gb vs 11 Gb before)

Of course I'd like to spend my days benchmarking if 8 cpu's and 64 Gb memory wouldn't be even faster, but I can live with this.
I'm still not clear on dual cpu support for VMWare fault tolerance, there's not many people in the organization who really know about that,
it turns out fault tolerance is off now anyway. Something to be resolved before we go in production. For now, I'm going to concentrate my
efforts on translating lots of windows paths to linux paths in the application and the numerous scripts surrounding it, and removing each
and every instance of 'cygpath'.
 

Cringer

ProgressTalk.com Moderator
Staff member
Don't forget that Linux is case sensitive where windows isn't necessarily.
 
Top