Question Readprobe unexpected results HPUX v Linux64

Paul Frost

New Member
Morning all,

Were in the process of evaluating a platform switch from HPUX on Itanium chip set to Linux64 on Dell Intel hardware. Ive used readprobe consistently over the last 20 years to compare the read potential of one machine against another and am getting some unexpected results from readprobe against the Linux platform.

The graph below shows a line for every progress database server we have had back to 2000 when we first started using them.

The red and green lines at the top of the graph are the two main current HP machines we have. The profile of the line produced from these is similar to that of all previous machines tested in that the number of reads increases steeply as each of the cpu’s is utilised (ie one user can only use one cpu whereas two users can use two cpu etc) when all cpu’s have users on them the line flattens out as the processes all vie for cpu time with the sign of a good machine (for db reads anyway) being that the reads don’t drop to far from the max as the number of users ramps up.

The lilac line starting at 1m reads is the new Dell PER740 and as you can see has the reverse profile which is baffling me a bit. It starts with one user being able to read 1m records per second which is excellent but then nose dives as the number of users increases. Its almost as if whatever component is handling the switching between cpu’s of the processes is the bottle neck and not working in an efficient way. Clearly if it followed the same profile as the others and started at 1m and went up as the 16cpu’s were used the overall results would be massively higher than the current machines which is what we expected with the more powefull CPUs.

We’ve tried with hyper-threading on and off and made little difference (the hyper threaded result is the orange line which mirrors the lilac one but just a bit below)

Have any of you encountered anything similar? Are we missing something in our setup that would cause this inverse graph line? Linux flavour is Red Hat. Same number of CPU in HP machines as Dell server.

I would appreciate any words of wisdom on the subject please either to understand the cause of the inverse results or to point at what can be tweaked to improve the results I am getting.

Thanks in advance

Paul

1567852721512.png
 

TomBascom

Curmudgeon
The Progress and Readprobe versions can matter quite a bit. It is possible, for instance, that you are seeing a bug. I forget the gory details and I'm not in my office at the moment but there have been a couple of changes to VSTs over the years that had to be addressed.

What specific values of -spin are you using? Other startup options (-lruskips for instance) can matter quite a lot too. If you are going with default values some of the defaults change from version to version and that can have an impact.

Is the Linux box virtualized? Readprobe is pretty good at surfacing poor virtualization choices.

You say that the number of CPUs is the same as HPUX but do not mention what that number is. "Too many cores" is, for instance, a very real problem and can behave like this.

I am not intimately familiar with Dell's product naming scheme and have no idea what sort of CPUs a "Dell PER740" contains. What are the actual specs on those chips?
 

TomBascom

Curmudgeon
Update: Re-reading I see 16 cores mentioned. Is that with or without HT enabled? Checking Dell's website it looks like you can put just about any CPU into that rack so the question of what you actually have installed is still quite relevant.
 
Last edited:

TomBascom

Curmudgeon
FWIW, without any other information, this looks very much like the chart that I would expect to see with a single-core system. Or with -spin 0 or some other limitation making it act as a single-core system.

It is not typical of what I see on properly configured Linux systems.
 

Paul Frost

New Member
Morning, Thanks for the replies, they are much appreciated.

In terms of your questions:-
-spin set to 50000 as currently set on our 10.2B07 production boxes.
This planned upgrade will move us to 11.7.5 so will definately be looking closely at -lruskips
RHEL install directly on the tin. No VM.
2 x physical cpu each with 8 cores RHEL sees 16 OSCPU and 32 if hyperthreading is turned on.
Processors are Launch date Q2 2019 Xeon Gold 6244 3.60Ghz 10.4 GT/s QPI DDR4-2933
NUMA is active... and so far is where we have concentrated most of our efforts.

Below are findings to date.

Having read over the NUMA white paper (copy attached for reference) a few times I did some more testing.

The machine we are testing without hyperthreading turned on is like the below diagram, 2 Physical CPU’s each with 8 cores which equals 16 OSCPU’s which Linux sees.
Turning on hyperthreading gives us the same 16 cores but each with a sub thread that Linux also sees as an OSCPU hence displaying 32 OSCPU’s when booted in that mode.
Each CPU has its own NUMA Node (memory pool). Any OSCPU can read the memory from either NUMA Node. However it is ~50% slower to read memory from
The NUMA Node not attached to the CPU containing the OSCPU of your process i.e. the remote rather than local NUMA Node.
1568102797728.png

So in our tests so far something like the following is happening, the first process fires up, for this example lets say it runs on OSCPU0 on CPU0, reads the db and because the db is small 7mb ish all of the db is now
stored in memory in NUMA Node 0, Second process fires up and load balances to CPU1 and say picks OSCPU1, all the records it needs are in shared memory, none
of which are in its local NUMA node so it has to read everything from the remote NUMA Node 0 via CPU0 which is, according to the white paper 50% slower.
Play this out to our 50 users in the test and we have 25 of them trying to read from the remote NUMA Node which as well as being 50% slower in normal operation probably has
its throughput reduced because the remote CPU is now, as well as handling 25 remote users requests for memory, is also being hammered by its own 25 users.

I think this is what is causing the reads profile we are seeing.

Try to prove the theory..

If we force the processes to only use the OSCPU’s from one of the CPU’s all the memory it needs will be in the local NUMA Node to that CPU. OK so we will have only
half the OSCPU’s to use but they will access the shared memory much more quickly.

Noting the numeration of the OSCPU’s and the NUMA Nodes by Linux the following command runs the readprobe utility against in effect half the machine ie all OSCPU’s on the CPU which is attached to NUMA Node 0

#numactl -l --cpunodebind=0 ./readprobe.sh

htop then shows the running test with 8 cpu’s maxed out (NB numbering of OSCPU’s in htop is 1 to 16 rather than 0 to 15). Inspecting the OSCPU affinity of one of the processes shows the OSCPU’s it is instructed to use

1568102896053.png

So what effect on database reads per second and the profile as the users ramp up to 50? The new graph below shows the salmon coloured line for the new configuration. A massive improvement over the same machine with auto NUMA settings but still not as high reads as with current HPUX machines.

1568102923944.png


Clearly much better set of results. AND don’t forget we now have the other 50% of the machine to do something with. For the sake of these tests running a second readprobe against a second database using the other NUMA Node at the same time is likely to approximately double the results and get us back to the theoretical max reads we are trying to use as a comparison between machines. Below shows the number of reads with 2 x databases and utilising both NUMA Nodes of the machine. A much nicer profile for the investment in new hardware.

1568102953551.png


I think we should try the same set of tests with Hyperthreading turned back on too at some point for comparison purposes. I still dont like the fact that there is a steep drop off at 5 users.. doesnt seem to tie in with much logic I can think of at this point.

Conclusion – somehow HPUX manages the shared memory in a more efficient way allowing it all to be accessed from either CPU and Linux64 with Dell Hardware needs some manual intervention in this area to get the best results.

But what does all this mean for real life…. where we have multiple databases and multiple users running on our production machines? Looks to me like we either accept the performance degradation of some processes having to access their shared memory from a remote NUMA Node or get a new DBA job of trying to manually balance databases -B memory pools across the NUMA Nodes keeping all shared memory for one database within one Node whilst accepting that limits any one database to only 50% of total memory resources. Looking forward to that new challenge….

Other hardware isues being investigated by Dell are inbalanced Memory Dimms across the CPU slots too.

Any other insights welcomed.

Thanks

Paul
 

Attachments

  • NUMA for Dell PowerEdge 12G Servers.pdf
    808.2 KB · Views: 1

Cringer

ProgressTalk.com Moderator
Staff member
I don't suppose it's possible to disable NUMA completely at the BIOS level?
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Have you mentioned which OE version these tests are running under? Also we still don't know your broker startup parameters in your readprobe tests.

This planned upgrade will move us to 11.7.5 so will definately be looking closely at -lruskips
The -lruskips/-lru2skips parameters and others for client/server (-nMsgWait, -prefetch*) were added in 11.1 back in 2012 and back-ported to 10.2B06, so you don't have to wait until your upgrade to use them.
 

TomBascom

Curmudgeon
NUMA is a very bad thing for a database server.

Some of the smaller HPUX servers were actually quite reasonable for their day because they were not NUMA. Those "rp" and "rx" servers would run rings around "superdomes" which are horrid NUMA fuster-clucks. As bad as the Sun Niagra boxes. Totally inappropriate for running databases. When Uncle Larry bought Sun he immediately moved Sparc away from that infernal mis-architecture in order to make Sun's servers useful for databases again.

Yes, imbalanced DIMMS and less than full memory slots can also be a big problem.

There is still a lot of information missing here:

1) db startup parameters: you mentioned -spin 50,000. What other db startup parameters are you using?

2) Are you running readprobe with the standard sports db? Or have you plugged in something else?

3) What version of readprobe is this? That's important. There have been many changes (and some bugs that do not manifest until you get to certain Progress releases) over the years. The most up to date version is available in the current ProTop download.

You should at least try -lruskips 100 and different -spin values. Try 5k, 10k, 25k and 100k for instance.

I work with big customers a lot. Thousands of connections. Many of them have workloads of several millions of reads per second. Over-sizing db servers is a plague. Database servers hardly ever really need more than 4 cores. Really, really big ones might need 8. You might be very, very special but for most people lots of cores is spending your money in the wrong place. It seems counter-intuitive but reducing the core count will often noticeably improve performance. Nobody ever wants to explain that to the bean counters. But if you have core-based licensing you might be motivated.
 

TomBascom

Curmudgeon
If you think that you need lots of cores because you have lots of shared memory connections you might want to reconsider that. The addition of the -prefetch* parameters and the move to appservers and PASOE and server side joins and various other things that Progress has been doing have changed the calculus quite a lot over the years.

Yes, it is still faster to FIND a record with a shared memory connection and it always will be. But a FOR EACH that returns a non-trivial result set is very likely to be a lot faster if run client/server. And a query with a non-trivial result set is a *lot* more noticeable to a user.
 
Last edited:
Top