Question Performance issue

Manmohan

New Member
We have an old application to maintain customer data. The user data has increased 6 folds in last one year. I am new to this organization and looking into the whole setup (DB's, Webspeed, NS, etc.) I have seen tables being added in Schema Area (which i have heard is a big NO NO), some tables have Scatter Factor ranging from 2 to 6.5.

The problem we are facing is a periodic application latency where the broker agents, explicit for a specific application module, become busy. Sometimes 30% agents are in available state but still we see the latency in the application, but the following events still happen at the time of issue:

> The no of processes for that module also constitutes about 50% of the total process count on the box.
> CPU usage by the module, at time of issue, is about >25% and load average becomes around 13.

I havent run DBANALYS yet so cant comment on the index usage but me and many other suspect bad programming but even looking at broker logs, server logs, adminserv log and DB log, i am not able to identify the exact code/programs which causes the issue.

Please provide any comments, ideas, knowledge on how to pinpoint the exact reason of the issue and what else i can check from DB/webspeed etc. point of view.

Regards.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
To start helping, we need some more information about your application and environment.
  • What is your Progress version? (cat $DLC/version)
  • What is your operating system?
  • Is there a single application database, or several?
  • Is WebSpeed installed on the database server or on a separate box?
  • If on the database server, do the WS agents connect to the DB via TCP or shared memory?
  • What are the database startup parameters?
  • Please post the contents of your database structure file. First make sure it is up to date with prostrct list <dbname>.

We have an old application to maintain customer data. The user data has increased 6 folds in last one year. I am new to this organization and looking into the whole setup (DB's, Webspeed, NS, etc.) I have seen tables being added in Schema Area (which i have heard is a big NO NO), some tables have Scatter Factor ranging from 2 to 6.5.

You have heard correctly; application storage objects should never be put in the schema area. There should be separate storage areas for tables, indexes, and LOB columns if any. Typically you should also have an area for each large table, and one for the indexes of each such table. If you use word indexes, you may want to put those in a separate area. All application storage areas should be Type II (i.e. blocks per cluster of at least 8). To use Type II areas you must be on OE 10.0A or later.

I believe dbanalys scatter factor calculations have changed over the releases, but you may well benefit from a dump and load. Before doing so however, be sure you have validated that your database structure is appropriate for your needs and create a schema file that maps your objects to the appropriate areas in the new DB.

The problem we are facing is a periodic application latency where the broker agents, explicit for a specific application module, become busy. Sometimes 30% agents are in available state but still we see the latency in the application

Latency in a WS-based application isn't just determined by how many agents are available. If agents have to contend for resources (DB server disk I/O, WS server disk I/O, DB record locks, network bandwidth) then this could result in latency, even if only a couple of them are busy at a given time.

...the following events still happen at the time of issue:

> The no of processes for that module also constitutes about 50% of the total process count on the box.

I don't know what to make of this, as I don't know what these processes are. Do the agents spawn child processes? Are there other application clients apart from the WS agents? Do the servers have other workloads apart from the application you are troubleshooting?

I havent run DBANALYS yet so cant comment on the index usage but me and many other suspect bad programming but even looking at broker logs, server logs, adminserv log and DB log, i am not able to identify the exact code/programs which causes the issue.

Depending on your Progress version and whether you have access to source there may be other troubleshooting options available to you, such as the ABL profiler, database client-request statement caching, more verbose logging with the -logentrytypes parameter, and obtaining DB CRUD statistics from the _*stat virtual system tables.
 

Manmohan

New Member
Hello Rob,
Thank for the reply.
The progress version is 9.1E.
There are 20+ databases for this one application,
WS is installed on the same box where database is installed.
The OS is SunOS.
The pf file shows it uses TCP.

Out of these only 3-4 DB's are used extensively and this issue comes mostly on one DB which supports a particular module.
To cater requests for this module 2 brokers are setup with max 20 agents each. Will adding a new broker help here?

I think there is some recursive code that causes latency even when some agents are available. But I do not understand programming much so cannot verfy that.

Apart from WS agents, a lot of cron batch jobs also connect to the DB.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
You are on a very old version of Progress; 9.1E is almost a decade old and the core design of v9 is much older than that. So a lot of options, e.g. Type II storage areas, database client statement caching, and per-user CRUD statistics VSTs to name only a few, and a lot of design and performance enhancements, are not available to you as long as you stay on that version.

May I ask if there is a reason why you do not upgrade Progress, or feel that you cannot? If you are on current maintenance and have source code and a compiler then it should be relatively painless and it is definitely worth doing.

I've never seen or heard of a single Progress application with 20 or more databases (though in fairness I haven't seen many applications apart from my own). That seems... excessive, and would be an optimization challenge. However, it's a core design decision which will not likely change as a result of this conversation. You said the agents connect to the DBs via TCP, although they run on the same machine as the databases. That could certainly be a performance cost. Someone chose those startup parameter values and it wasn't you. Can you find out who it was and have that person justify their decision? If not, you're free to experiment. I assume you have a non-production environment where you can test possible configuration changes. Hopefully it mirrors prod as much as possible in configuration and in hardware.

Where you go from here to address your problems depends a lot on the application provider and your company's relationship to them. Is it an in-house application? Is it from a vendor with whom you have a current relationship? Or a vendor that you no longer have a relationship with? Do you have source code and a compiler? Do you have programmers who can competently update the code if necessary? Are they allowed to update the code? Do they know how to debug a Progress application? If so, is this application performance issue visible to the business owners and enough of a priority to reassign programming and other resources to address it?

You asked whether adding a broker will help with "latency". It is impossible to say at this point. Troubleshooting can be a complicated process; more so with a multi-client federated application. It requires detailed answers to a lot of specific questions. How long do individual agent requests last? How long do database transactions last? Are there many record locks in use simultaneously? What are the specific symptoms of the "latency" issues raised by clients? Are they experienced by all clients or by some? If only some, what are their defining characteristics? Are there any errors in client logs or database logs? These are the types of questions you need to ask yourself and answer, but doing this work effectively really requires hands-on access to the system. If this is enough of a problem and a priority for you, you might consider hiring a consultant to look at it. They could help you address the problem and at the same time you would learn valuable techniques by watching them work.

You say you have narrowed down one or a few databases where the problem may lie. You can get an idea of which tables and indexes see the most I/O by looking at the database CRUD statistics in _TableStat and _IndexStat. To get meaningful data in these tables you must first have set the appropriate database startup parameters: -tablerangesize and -indexrangesize. Set the first one higher than your highest application table number, and the second one higher than your highest application index number. Also look up -basetable and -baseindex in the DB Admin Guide and Reference. Then, after the database is restarted, you can write code or use a tool like ProTop to obtain information about creates, reads, updates, and deletes (CRUD). This information is helpful to a developer to track down code issues with, for example, excessive reads.

Finally, I suggest you re-read my earlier post as there was information I requested that you didn't provide or acknowledge.
 

TomBascom

Curmudgeon
9.1E is ancient, obsolete and unsupported. You should upgrade. 10.2B is very stable but will be obsolete sometime in the next couple of years. OpenEdge 11.3 is the current release.

SunOS -- what version? That's ancient, obsolete and supported by Oracle. What hardware?

Adding brokers and agents is probably not going to help unless you have a situation where they are all busy -- and you say that at worst 30% remain available. So, no, that won't help.

It is much more likely that you are suffering from a lack of tuning. Particularly since you say the problem seems related to one specific database.

Recursion, in itself, is not a bad thing. There is no special reason to blame it for performance issues.

The issue might be code -- code is often the root cause of performance problems. Drilling down to get to specific bits of code needing improvement is much more difficult in v9 than it is on modern releases -- the tools available are quite primitive and dull. Once you find the bad code you *might* get lucky and be able to easily improve its performance by doing something simple like modifying it to use a simpler algorithm. More likely you'll discover that it is requesting a *lot* of data and that satisfying that request takes a long time. If you're lucky you'll also discover that this is due to a poorly constructed query that can be easily improved. If you're less lucky (but still fairly lucky) the query will be ok but it needs better index support in the database. If you're not lucky you'lll be chasing red herrings for months. Maybe years.

Which is why people tend to look to the database for tuning improvements -- it is often much simpler and more effective to twist a few knobs on the db.

It might simply be that the database in question has grown over time and is now poorly tuned. Which may be highlighting code issues that don't matter very much much in small databases. This seems very likely because you seem to have almost no information to share about the database configuration. Which suggests that it is likely tuned to default settings. This hypothesis is supported by the fact that you have data in the schema area.

We can get more detailed if:

1) You post some performance metrics related to the database which seems to have problems when it is having problems. There are a couple of ways to go about that. For some reason I, personally, prefer to see ProTop output. Even though 9.1E is ancient, obsolete and unsupported the old character version of ProTop xxi works just fine. You can download it here: http://dbappraise.com/protop.html Or you can use PROMON. In either case start with samples from the summary screen during problem periods. 10 second samples are reasonable.

2) Tell us how the problem database is configured. The easy way to get the accurate information quickly is to open dbname.lg, find the last occurrence of "(333)" (that's the message number of the db startup message) and then extract the next 75 lines or so of text. Post it surrounded by [ C O D E ] ... [ / C O D E ] tags so that it is readable. Rob also requested that you update and post the .st file. DBANALYS would also be interesting.

I'll be that if you provide the information above we will have more interesting answers.

You could also hire a consultant if you want to save time ;)
 

Manmohan

New Member
Rob and Tom. Thanks for your reply and views.
This application belonged to a small company which was overtook by a big org and was re-branded. The application/HW/etc. is around 10 Yr old and was initially created for few thousand customers. In last 8 months, the no of users have increased 6 times.
In last 5 years, we are 3rd or 4th vendor who is supporting it and we have barely any documentation and very basic monitoring. There is no one to justify the present setup which is quite a mess. Programs read in EXCL lock and many have database connections hard-coded.
The business wishes to use this app in its current progress version for at least next 1-2 years. They are not very keen in investing into this app's future so we only have a HW upgrade in planning but no plans to upgrade the Progress version.
We are expected to fix the latency issues by working on DB/Webspeed/NS etc. or pinpointing the exact programs which may be at the root of the problem.

So at the baseline, i need to find a way to tune the DB/environment to whatever is best possible, which in turn may open doors for version upgrade.

I primarily worked on Progress DB and have little/bookish knowledge of Webspeed etc.

I am posting the approx 70-80 lines of DB startup which will provide the startup info of the most used DB (and on which generally issue comes). I had a bad day today with the environment so could not get time to fetch any more info today. Will post the ST file details, and other info requested, tomorrow.

[ C O D E ]
20:03:04 BROKER 0: Multi-user session begin. (333)
20:03:04 BROKER 0: Begin Physical Redo Phase at 2048 . (5326)
20:03:05 BROKER 0: Physical Redo Phase Completed at blk 4520 off 976 upd 3916. (7161)
20:03:05 BROKER 0: Started for db_name-sql using TCP, pid 13799. (5644)
20:03:05 BROKER 1: Started for db_name using TCP, pid 14134. (5644)
20:03:05 BROKER 1: This is an additional broker for this protocol. (5645)
20:03:05 BROKER 1: This broker supports 4GL server groups only. (8863)
20:03:05 WDOG 15: Started. (2518)
20:03:05 BIW 16: Started. (2518)
20:03:05 AIW 17: Started. (2518)
20:03:05 APW 18: Started. (2518)
20:03:05 APW 19: Started. (2518)
20:03:05 APW 20: Started. (2518)
20:03:06 BROKER 0: Progress OpenEdge Release 9.1E on SOLARIS. (4234)
20:03:06 BROKER 0: Server started by progress on /dev/pts/1. (4281)
20:03:06 BROKER 0: Started using pid: 13799. (6574)
20:03:06 BROKER 0: Physical Database Name (-db): /..../...../db_path. (4235)
20:03:06 BROKER 0: Database Type (-dt): PROGRESS. (4236)
20:03:06 BROKER 0: Force Access (-F): Not Enabled. (4237)
20:03:06 BROKER 0: Direct I/O (-directio): Enabled. (4238)
20:03:06 BROKER 0: Number of Database Buffers (-B): 500000. (4239)
20:03:06 BROKER 0: Maximum private buffers per user (-Bpmax): 64. (9422)
20:03:06 BROKER 0: Excess Shared Memory Size (-Mxs): 16431. (4240)
20:03:06 BROKER 0: The shared memory segment is not locked in memory. (10014)
20:03:06 BROKER 0: Current Size of Lock Table (-L): 50016. (4241)
20:03:06 BROKER 0: Hash Table Entries (-hash): 125117. (4242)
20:03:06 BROKER 0: Current Spin Lock Tries (-spin): 15000. (4243)
20:03:06 BROKER 0: Number of Semaphore Sets (-semsets): 1. (6526)
20:03:06 BROKER 0: Crash Recovery (-i): Enabled. (4244)
20:03:06 BROKER 0: Database Blocksize (-blocksize): 8192. (6573)
20:03:06 BROKER 0: Delay of Before-Image Flush (-Mf): 3. (4245)
20:03:06 BROKER 0: Before-Image File I/O (-r -R): Reliable. (4247)
20:03:06 BROKER 0: Before-Image Truncate Interval (-G): 0. (4249)
20:03:06 BROKER 0: Before-Image Cluster Size: 33554432. (4250)
20:03:06 BROKER 0: Before-Image Block Size: 16384. (4251)
20:03:06 BROKER 0: Number of Before-Image Buffers (-bibufs): 50. (4252)
20:03:06 BROKER 0: BI File Threshold size (-bithold): 768.0 MBytes. (9238)
20:03:06 BROKER 0: BI File Threshold Stall (-bistall): Enabled. (6551)
20:03:06 BROKER 0: After-Image Stall (-aistall): Enabled. (4254)
20:03:06 BROKER 0: After-Image Block Size: 8192. (4255)
20:03:06 BROKER 0: Number of After-Image Buffers (-aibufs): 75. (4256)
20:03:06 BROKER 0: Storage object cache size (-omsize): 1024 (8527)
20:03:06 BROKER 0: Maximum Number of Clients Per Server (-Ma): 11. (4257)
20:03:06 BROKER 0: Maximum Number of Servers (-Mn): 15. (4258)
20:03:06 BROKER 0: Minimum Clients Per Server (-Mi): 1. (4259)
20:03:06 BROKER 0: Maximum Number of Users (-n): 161. (4260)
20:03:06 BROKER 0: Host Name (-H): host_name.local. (4261)
20:03:06 BROKER 0: Service Name (-S): db_name-sql. (4262)
20:03:06 BROKER 0: Network Type (-N): TCP. (4263)
20:03:06 BROKER 0: Character Set (-cpinternal): ISO8859-1. (4264)
20:03:06 BROKER 0: Parameter File: /some_path/db_name.pf. (4282)
20:03:06 BROKER 0: Maximum Servers Per Broker (-Mpb): 3. (5647)
20:03:06 BROKER 0: Minimum Port for Auto Servers (-minport): 9100. (5648)
20:03:06 BROKER 0: Maximum Port for Auto Servers (-maxport): 9499. (5649)
20:03:06 BROKER 0: This broker supports SQL server groups only. (8864)
20:03:06 BROKER 0: Large database file access has been enabled. (9426)
20:03:06 BROKER 0: Created shared memory with segment_id: 33554494 (9336)
20:03:06 BROKER 0: Created shared memory with segment_id: 33554495 (9336)
20:03:06 BROKER 0: Created shared memory with segment_id: 33554496 (9336)
20:03:06 BROKER 0: Created shared memory with segment_id: 33554497 (9336)
20:03:06 SRV 2: Started on port 1025 using TCP, pid 14217. (5646)
20:03:07 SRV 2: Login usernum 174, userid abcd, on host_name.local batch. (742)
20:03:46 SRV 2: Logout usernum 174, userid , on host_name.local batch. (739)
20:05:01 Usr 21: Login by progress on batch. (452)
20:05:01 Usr 21: Logout by on batch. (453)
20:05:01 SRV 2: Login usernum 174, userid progress, on host_name.local batch. (742)
20:05:01 Usr 21: Login by some_usr on batch. (452)
20:05:01 Usr 21: Logout by on batch. (453)
20:05:02 SRV 2: Logout usernum 174, userid , on host_name.local batch. (739)

20:07:00 prostrct list session begin for progress on batch. (451)
20:07:00 prostrct list session end. (334)
20:07:00 SRV 2: Login usernum 174, userid progress, on host_name.local batch. (742)
20:07:00 SRV 2: Logout usernum 174, userid , on host_name.local batch. (739)
20:07:00 RFUTIL 21: Login by progress on batch. (452)
20:07:00 RFUTIL 21: Logout by progress on batch. (453)
20:07:00 RFUTIL 21: Login by progress on batch. (452)
20:07:00 RFUTIL 21: Logout by progress on batch. (453)
20:07:00 RFUTIL 21: Login by progress on batch. (452)
[ /C O D E ]


A QQ: If somehow brokers go down and while starting we get following error:

===============================
ERROR: cannot start server. (8100)
L-8985>(Date/Time) Exception unbinding broker (not bound) : broker_name (8525)
main>(Date/Time) ubroker v91E (03-Mar-06) done. (8041)
===============================

Is it a good approach to go for a nameserver-wsadmin restart? or;
Is killing all processes on the box with the name of broker, a quick and safe fix?

Regards.
 
Last edited:

Manmohan

New Member
Please find the additional Info as requested:

uname -a
SunOS host_name.local 5.10 Generic_142900-13 sun4u sparc SUNW,SPARC-Enterprise

Its a Sun SPARC M4000 machine.

The st file of problem DB is:

Code:
#
b db_path/bi/db_name.b1 f 2048000
b db_path/bi/db_name.b2
#
d "Schema Area":6,64 db_path/data1/db_name.d1 f 64000
d "Schema Area":6,64 db_path/data1/db_name.d2
#
d "AdaIndex":7,1 db_path/index1/db_name_7.d1 f 22000000
d "AdaIndex":7,1 db_path/index1/db_name_7.d2 f 6000000
d "AdaIndex":7,1 db_path/index1/db_name_7.d3
#
d "AdaData":8,64 db_path/data1/db_name_8.d1 f 60000000
d "AdaData":8,64 db_path/data1/db_name_8.d2 f 13000064
d "AdaData":8,64 db_path/data1/db_name_8.d3
#
a db_path/ai/db_name.a1 f 64000
#
a db_path/ai/db_name.a2 f 64000
#
d "NotesIndex":11,1 db_path/index1/db_name_11.d1 f 5120000
d "NotesIndex":11,1 db_path/index1/db_name_11.d2
#
d "NotesData":12,64 db_path/data1/db_name_12.d1 f 36000000
d "NotesData":12,64 db_path/data1/db_name_12.d2
#
d "AuditIndex":13,1 db_path/index1/db_name_13.d1 f 26000000
d "AuditIndex":13,1 db_path/index1/db_name_13.d2 f 3000064
d "AuditIndex":13,1 db_path/index1/db_name_13.d3
#
d "AuditData":14,64 db_path/data1/db_name_14.d1 f 130000000
d "AuditData":14,64 db_path/data1/db_name_14.d2 f 16000000
d "AuditData":14,64 db_path/data1/db_name_14.d3
#
d "HstryIndex":15,1 db_path/index1/db_name_15.d1 f 22000000
d "HstryIndex":15,1 db_path/index1/db_name_15.d2 f 13952
d "HstryIndex":15,1 db_path/index1/db_name_15.d3 f 22000000
d "HstryIndex":15,1 db_path/index1/db_name_15.d4
#
d "HstryData":16,64 db_path/data1/db_name_16.d1 f 70000000
d "HstryData":16,64 db_path/data1/db_name_16.d2 f 88704
d "HstryData":16,64 db_path/data1/db_name_16.d3 f 70000000
d "HstryData":16,64 db_path/data1/db_name_16.d4
#
d "AccntIndex":17,1 db_path/index1/db_name_17.d1 f 1024000
d "AccntIndex":17,1 db_path/index1/db_name_17.d2
#
d "AccntData":18,64 db_path/data1/db_name_18.d1 f 3072000
d "AccntData":18,64 db_path/data1/db_name_18.d2
#
d "RefIndex":19,1 db_path/index1/db_name_19.d1 f 512000
d "RefIndex":19,1 db_path/index1/db_name_19.d2
#
d "RefData":20,256 db_path/data1/db_name_20.d1 f 512000
d "RefData":20,256 db_path/data1/db_name_20.d2
#
d "CDetailIndex":21,1 db_path/index1/db_name_21.d1 f 3072000
d "CDetailIndex":21,1 db_path/index1/db_name_21.d2
#
d "CDetailData":22,256 db_path/data1/db_name_22.d1 f 18000000
d "CDetailData":22,256 db_path/data1/db_name_22.d2
#
a db_path/ai/db_name.a3 f 64000
#
a db_path/ai/db_name.a4 f 64000
#
a db_path/ai/db_name.a5 f 64000
#
a db_path/ai/db_name.a6 f 64000
#
d "ChksmAttrIndex":29,1 db_path/index1/db_name_29.d1 f 1024000
d "ChksmAttrIndex":29,1 db_path/index1/db_name_29.d2
#
d "ChksmAttrData":30,64 db_path/data1/db_name_30.d1 f 5120000
d "ChksmAttrData":30,64 db_path/data1/db_name_30.d2
#
d "GdaaIndex":31,1 db_path/data1/db_name_31.d1 f 16000000
d "GdaIndex":31,1 db_path/data1/db_name_31.d2 f 4000000
d "GdaIndex":31,1 db_path/data1/db_name_31.d3
#
d "GdaData":32,128 db_path/data1/db_name_32.d1 f 30000000
d "GdaData":32,128 db_path/data1/db_name_32.d2 f 6000000
d "GdaData":32,128 db_path/data1/db_name_32.d3
#
d "PHstryIndex":33,1 db_path/index1/db_name_33.d1 f 3072000
d "PHstryIndex":33,1 db_path/index1/db_name_33.d2
#
d "PHstryData":34,128 db_path/data1/db_name_34.d1 f 7168000
d "PHstryData":34,128 db_path/data1/db_name_34.d2
#
d "SHstryIndex":35,1 db_path/index1/db_name_35.d1 f 2048000
d "SHstryIndex":35,1 db_path/index1/db_name_35.d2
#
d "SHstryData":36,256 db_path/data1/db_name_36.d1 f 5120000
d "SHstryData":36,256 db_path/data1/db_name_36.d2

PS: This DB does not have tables in schema area.

The DBANALYS result is huge text, should i post from some specific details?
I will capture the PROMON results when the issue will come and post accordingly.
 

TomBascom

Curmudgeon
On the bright side you have after-imaging enabled -- congratulations!

And your .st file at least has multiple storage areas. Hard to say if there are enough or if they are the right ones or if rows per block is good without seeing the dbanalys though. You can attach files -- no need to post the whole text inline.

However -- the .st file does seem to be using a "functional" approach to naming storage areas. So it is probably bogus. See this: http://dbappraise.com/ppt/sos.pptx

I fail to see the value in obscuring the db name & path. It's not exactly an SSN or credit card number... ;)

-directio is a waste of effort. It never helps and may hurt.

-B 500,000 -- I know the 9.1E docs say that the max is 500,000. That is not accurate. You may even have a 64 bit port -- those were available even back then. Run "file $DLC/_progres" and see what sort of executables you have. Your block size is 8k so that says that you have 4GB in -B (which a 32 bit process should not be able to do...) so you can probably go higher. How much RAM does the server have?

While I am asking about the server... how many cores and what sort of storage subsystem?

-Ma and -Mn seem poorly set -- I would go with something more like -Ma 2 -Mn 100 The idea is to spread the connections not to concentrate them.

You are running SQL -- have you ever run "UPDATE STATISTICS"? (This is also one of those things that is dramatically improved in more recent releases...)

-tablerangesize and -indexrangesize do not appear to be set. Thus you can only gather utilization information on the first 50 tables and indexes. You should set these to the number of tables and indexes in your schema so that usage information about tables and indexes can be collected -- this will go a long ways towards identifying likely trouble spots. ProTop is especially helpful in monitoring that.

Regardless of the long term plan -- 1 or 2 more years is a long time. The performance benefits of upgrading are huge. You should very, very strongly consider doing so. If you have the source and can recompile (and it sounds like you should) then it is technically quite simple and safe. If your company has been staying current with maintenance then obtaining the upgrade from Progress should also be trivial.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
So according to your structure your total DB size, assuming no data in variable extents, is about 581 GB. That's pretty big. What is the actual high water mark for the DB?
 

Manmohan

New Member
Hello Rob,

There is no high water mark set as of now. Space consumed is monitored and added if needed.

Hello Tom,

uname -x results in the following info:

Machine = sun4u
NumCPU = 32

Memory is 32GB

The issue is yet to repeat but looking at old logs I found an error multiple times in DB logs at time of error in past:
During the issue following happened:

latency was reported
I checked and agents were available around 40%.
Process count suddenly increased to about 900.
All agents became busy. Bouncing a broker reduced the count but it sprung back again.
Process count again increased above 1k.
CPU usage by the specific user/module was >48%
The process count became > 1500, load avg was 25 and server became unresponsive.
After 5-7 mins, server started to respond and i found the following error in logs:

SYSTEM ERROR: Too many sub-processes, cannot fork. Errno=11. (358)
====================================================

The storage is old SAN version (from what i know).


PS: Yes its not that sensitive info, but my 'Guru' said to me: 'do you best'. Hope its not a problem. :)
 

TomBascom

Curmudgeon
Hello Rob,

There is no high water mark set as of now. Space consumed is monitored and added if needed.

Hello Tom,

uname -x results in the following info:

Machine = sun4u
NumCPU = 32

Memory is 32GB

And the bitness of the executables is what?

The issue is yet to repeat but looking at old logs I found an error multiple times in DB logs at time of error in past:
During the issue following happened:

latency was reported
I checked and agents were available around 40%.
Process count suddenly increased to about 900.
All agents became busy.

This is new information.

What is the "normal" process count?

What are these new processes that are spawned? (What executables are running?)

Bouncing a broker reduced the count but it sprung back again.

So basically, the system state that was causing the problem is not impacted by bouncing a broker. (Possibly because whatever the users were asking for that caused the problem they continued to ask for.)

Process count again increased above 1k.
CPU usage by the specific user/module was >48%
The process count became > 1500, load avg was 25 and server became unresponsive.
After 5-7 mins, server started to respond and i found the following error in logs:

SYSTEM ERROR: Too many sub-processes, cannot fork. Errno=11. (358)

That would be expected if something is spawning ever increasing numbers of processes.

It also often translates to "out of swap space".

The storage is old SAN version (from what i know).

That's not very helpful. Except to confirm the general diagnosis -- you've got an old, mostly uncared for system that has grown considerably. It needs tuning.

As we have both told you -- the best thing that you could possibly do is to upgrade and take advantage of performance and diagnostic improvements that are available in modern releases.

PS: Yes its not that sensitive info, but my 'Guru' said to me: 'do you best'. Hope its not a problem. :)

No, I just hate to see people wasting time on pointless crap when they could be answering important questions.
 
Top