Appserver agent are busy and stuck.

Mike

Moderator
We got monitoring system this morning that the Qxtend inbound port was not responding to queries. We were also receiving transaction failures from boomi that were related. Upon investigation, I discovered that all of the available appserver for topro_as were busy. This was confirmed by running a debug check from the monitoring script on delop:

root@delop:/opt/wi# ./nagios-rt_debug
LicenceException: No Licensed Agents available at com.qad.qxtend.adapters.ServiceInterAdapter.configureAdapter

mfg@nait:~/wrk$ asbman -i topro_as -q
OpenEdge Release 11.7.9 as of Fri Jan 8 11:16:01 EST 2020


Connecting to Progress AdminServer using rmi://localhost:20931/Chimera (8280)

Searching for topro_as (8288)

Connecting to topro_as (8276)



Broker Name : topro_as

Operating Mode : Stateless

Broker Status : ACTIVE

Broker Port : 3093

Broker PID : 4605

Active Servers : 4

Busy Servers : 4

Locked Servers : 0

Available Servers : 0

Active Clients (now, peak) : (5, 5)

Client Queue Depth (cur, max) : (0, 4)

Total Requests : 114042

Rq Wait (max, avg) : (3638 ms, 5 ms)

Rq Duration (max, avg) : (1809472 ms, 167 ms)



PID State Port nRq nRcvd nSent Started Last Change

16765 RUNNING 02002 000459 000461 000458 Feb 8, 2024 03:40 Feb 8, 2024 10:16

12575 RUNNING 02003 000461 000461 000460 Feb 8, 2024 03:40 Feb 8, 2024 10:19

16262 RUNNING 02006 000005 000005 000004 Feb 8, 2024 10:19 Feb 8, 2024 10:22

16265 RUNNING 02007 000012 000012 000011 Feb 8, 2024 10:19 Feb 8, 2024 10:25



Performing some additional investigation on these appserver processes showed that they were stuck:



root@nait: /etc/cron.d# strace -fp 16765

strace: Process 16765 attached

semop(4544, [{16, -1, 0}], 1^Cstrace: Process 16765 detached

<detached ...>



I restarted the appserver (as user mfg) to resolve the issue
:



mfg@nait:~/wrk$ asbman -i topro_as -x

OpenEdge Release 11.7.9 as of Fri Jan 9 11:16:01 EST 2020





Connecting to Progress AdminServer using rmi://localhost:20931/Chimera (8281)

Searching for topro_as (8288)

Connecting to topro_as (8276)

Starting siprod_as. Check status. (8296)

mfg@nait:~/wrk$ asbman -i topro_as -q

OpenEdge Release 11.7.9 as of Fri Dec 8 11:16:01 EST 2020


Connecting to Progress AdminServer using rmi://localhost:20931/Chimera (8280)

Searching for topro_as (8288)

Connecting to topro_as (8276)

Broker Name : topro_as

Operating Mode : Stateless

Broker Status : ACTIVE

Broker Port : 3093

Broker PID : 22744

Active Servers : 3

Busy Servers : 1

Locked Servers : 0

Available Servers : 2

Active Clients (now, peak) : (5, 5)

Client Queue Depth (cur, max) : (0, 2)

Total Requests : 26

Rq Wait (max, avg) : (3558 ms, 272 ms)

Rq Duration (max, avg) : (3602 ms, 287 ms)



PID State Port nRq nRcvd nSent Started Last Change

22782 RUNNING 02008 000019 000019 000018 Feb 8, 2024 10:43 Feb 8, 2024 10:44

22836 AVAILABLE 02009 000005 000005 000005 Feb 8, 2024 10:44 Feb 8, 2024 10:44

22837 AVAILABLE 02010 000003 000003 000003 Feb 8, 2024 10:44 Feb 8, 2024 10:44


Shortly after I did this, all of the available appservers became occupied again. At this point, it is probably best to wait it out, as there is something sending a large number of requests to Qxtend.
Can anybody tell why it happened ? what was the reason in this? What will be the investigation steps ? How can we fix?

I need the RCA please

Thanks and Regards
Mike
 
What part of "there is something sending a large number of requests to Qxtend" fails to qualify as the RCA?

There is a lot of activity. Lots of activity means app servers are busy.

If you don't like that the app servers are busy you can:

1) start more of them (if your systems have the capacity)

2) reduce the number of requests by throttling whatever business process is driving them

3) improve the efficiency of the code which is executing the requests
 
You need to work out why those agents were showing as busy. A proGetStack on some of the PIDs will point you in the direction of which line of code they're on at the time. Maybe you'll find a pattern. Last time something like this happened to us we found that an external API we were consuming was down and all the agents were waiting for a response. Thankfully we had soft-set the timeout so could easily reduce it to clear the backlog.
 
Back
Top