[Progress Communities] [Progress OpenEdge ABL] Forum Post: Zombie remote servers - need some advice

Status
Not open for further replies.
3

302218

Guest
In November last year we migrated from OpenEdge 11.3.3 on Solaris SPARC – Solaris 10 zones to OpenEdge 11.7.3 on RHEL 7.7 (Maipo) – Vmware vSphere 6.0 (10719132), ESXi 6.0 Everything went according to plans and we got a huge performance boost: Database online backup – 10 time faster Most batch jobs – 7 times faster A huge succes, everything was good … … until runtime clients on Windows 7 Enterprise system randomly started to get SSL connections error when trying to connect to the database. So far so bad, but the situation is getting worse over time rendering the database unavailable for connections due to zombie remote server processes for as long as 17 plus minutes. This is the error the clients are getting : SSL error 12072 - SSL Client handshake failure (336130315) SSL routines occurred. (12168) Error starting SSL handshake with the OpenEdge database server. (12167) This is what I can see in the database log file : [2020/02/17@09:31:33.820+0100] P-1045937 T-140231528724288 I SRV 13: (12151) SSL error 12067 - SSL accept failed occurred. [2020/02/17@09:31:33.820+0100] P-1045937 T-140231528724288 I SRV 13: (12154) Error while attempting to create the SSL Client instance. …… [2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (1334) Rejecting login -- too many users for this server. [2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (-----) User count inconsistency detected: usrcnt=5 users=15 [2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (-----) User count corrected: usrcnt=15 users=15 Relevant database startup parameters - (250 concurrent remote clients max.): -n 850 -Mn 40 -Ma 15 -Mi 5 -S 47311 -minport 8400 -maxport 8499 -ServerType 4GL -PendConnTime 10 Therefore I opened a TechSupport case. I turns out the issue which causes a remote server to become a zombie as soon as it encounters a 12151 error is a product defect (OCTA-19107). The TechSupport engineer was able to reproduce it on 11.7.5 and so far a fix for that is expected to make it into 11.7.6 which is scheduled sometime in Q2/2020. What we’ve found out so far is that the 12151 error on the database remote server renders it to a zombie while the database broker still forwards connection requests to that zombie remote server which all fail for minutes until the same server adjusts the user count to its max when the clients get the error that the server has no more resources. Still, I have not found out what causes the initial SSL connection error to a given remote server so that it encounters the 12151 error which makes it zombie server and what causes the database broker to adjust the user count to its max some time later. Nevertheless - until I found out the root cause and OpenEdge 11.7.6 is released - I need a strategy to mitigate the problem in some way and this is what I’ve come up with: Increase pending connection time out – as long as a zombie server has a pending connection subsequent connections request should be forwarded to other remote servers Change –Mn 80, -Mi 2, -Ma 8 – to have the remote clients spread over more remote servers Implement a monitoring job which greps the database log file to identify remote servers which got an 12151 error and eventually terminate them via promon – terminate zombie servers Any thoughts are welcome! Thanks in Advance.

Continue reading...
 
Status
Not open for further replies.
Top