Server has no more resources

RealHeavyDude · Feb 29, 2012

OpenEdge 10.1C SP 4 on Sun Solaris 64Bit.

Since a few weeks I am randomly seeing the following entries in the database log file on our test system:

Code:

 [2012/02/29@11:29:42.488+0100] P-2670       T-1     I SRV    21: (1334)  Rejecting login -- too many users for this server.
[2012/02/29@11:29:42.488+0100] P-2670       T-1     I SRV    21: (-----) User count inconsistency detected: usrcnt=1 users=8
[2012/02/29@11:29:42.488+0100] P-2670       T-1     I SRV    21: (-----) User count corrected: usrcnt=8 users=8

And they coincide with users getting an error message "Server has no more resources". Initially I thought that the database start parameters were not configured the right way but the problem occurs when there are just a few users on the system.

These are the relevant start paramters: -n 500 -Mn 25 -Mi 3 -Ma 8 -ServerType 4GL

This is what PROMON says: 22 Servers, 9 Users (8 Local, 1 Remote, 7 Batch), 6 Apws

Does anybody have a clue what is going on here?

Thanks in Advance, RealHeavyDude.

TomBascom · Feb 29, 2012

Well, just for starters 10.1C is, you know, ancient... but at least it is 10.1 instead of some really old and grizzled dinosuar

"Out of resources" can mean lots of things. The most prevalent ones are semaphores and memory. In this case I would suspect swap (because your user count is low, thus semaphores are unlikely to be the issue). I'd look in the syslog to see if there are any interesting messages that appear at the same time you have these problems.

RealHeavyDude · Feb 29, 2012

Hello Tom,

Thanks for your reply - I will have the operator of the machine check the syslog.

RealHeavyDude.

P.S. - BTW, an upgrade to OE11 is scheduled for Q4 this year. I hope that by then OE11 is seasoned ...

Rob Fitzpatrick · Feb 29, 2012

I didn't come up with much in the KB for these errors. For the 1334 it suggests you may have hit -n and have to increase it, but your numbers say that's not the case. However if it is somehow a problem with -n or _Connect slots, other users apart from remote 4GL could contribute to that. Do you have a lot of self-service users? Do you run a SQL broker? If so what are its parameters?

I am also wondering about those unnumbered "user count" errors in the DB log after the 1334. My wild-*** guess is that the user counts there refer to the fields in _Servers for current and maximum users. But why the discrepancy? No idea. And if you're only at 22 servers and your -Mi is 3 and Mn is 25 and server 21 appears to be at 8 users already when another user tries to connect, why wouldn't the broker try to instantiate another server? I'd be interested to see what it says in Servers by Broker (promon R&D | 1 | 17) when the errors occur. Also, do you know which specific error your clients get? Is it a 748?

To me this feels like an issue with starting a new server. I've seen cases where a server can't be started because the ports given in minport/maxport are already used by other processes, so new servers can't bind to them. Are you specifying minport/maxport, or using the defaults?

RealHeavyDude · Mar 1, 2012

Thanks Rob! That is very useful information to me.

To answer your question:

No, we don't have any SQL user nor do we have a secondary login broker running. All users are strictly 4GL
Yes, I did specify -minport 15000 -maxport 30000. The problem appeared first when -minport and -maxport were not set and I was hoping to fix the issue that way.
As this is a development machine we have at the most 40 users ( about 25 self-service clients which would be AppServer agents and batch processes, and 15 remote users which are either running the development environment [AppBuilder] or the run time client).

What strikes me is that in R&D 1 / 17 almost all of the 25 servers show Current users 8 (which equals -Ma) although the activity summary only shows 11 users (8 local, 3 remote, 7 batch). On the other hand all but 4 servers do not have a Login count at all ... like they would never have been used, just started and occupied by themselves ...

I would say that, from some reason that eludes me, servers do not recognize when remote clients disconnect whereas the broker does.

That could be an explanation to the behavior that, when a user gets the error, the server corrects it's Current Users count and users are able to connect again until, again all servers end up with an incorrect user count again ...

Still I have no clue what is the root of all evil.

Thanks, RealHeavyDude.

TomBascom · Mar 1, 2012

There is a bug related to online backups creating phantom users that might explain this...

Rob Fitzpatrick · Mar 1, 2012

TomBascom said:
There is a bug related to online backups creating phantom users that might explain this...

That's in 10.2B, pre-SP02. It also reports the users in the .lic, although not counting them against -n, in SP03 and 04. That cosmetic issue is fixed in 05. But I never encountered either bug in 10.1C.

RealHeavyDude · Mar 2, 2012

One can learn something new every day. A reboot of the machine seems to have solved the issue. When I now look into PROMON R&D/1/17 it looks like it should be.

Strange. I always thought that rebooting a machine is an appropriate procedure to solve mysterious issues on Windows boxes. On Unix that is new to me.

Thanks for all your insights and support.

RealHeavyDude.

TomBascom · Mar 2, 2012

Keep an eye on it. If it starts to re-appear you might have detected a bug.

Rob Fitzpatrick · Mar 2, 2012

It could be an issue in the broker, or maybe in the TCP/IP stack. Are there any available kernel patches related to networking that haven't been applied?

Cringer · Mar 5, 2012

Re: i have claue

<snip><snip>

RealHeavyDude · Mar 5, 2012

You can bet everything you believe in that I surely won't take that advice ...

In the meantime I found out more: It looks like that every 30 minutes there is some port scanning going on which coincides with message like

Code:

[2012/03/03@07:55:19.569+0100] P-12543      T-1     I SRV     3: (5646)  Started on port 48113 using TCP IPV4 address 0.0.0.0, pid 12543. [2012/03/03@07:55:20.709+0100] P-12543      T-1     I SRV     3: (-----) TCP/IP write error occurred with errno 0

I am still trying to hunt down the details.

Thanks for all your insights and support.

RealHeavyDude.

Rob Fitzpatrick · Mar 5, 2012

Interesting. What is the nature of the "port scanning" you see? Can you tell where it is coming from? Do you think that a non-ABL TCP connection to the broker is making it think a client is connecting, and causing the broker to increment a server's user count?

If you have tcpdump installed on Solaris you can capture traffic on that port and see where these connections are originating from, which may in turn indicate why these connections are happening.

SRV 3: (5646) Started on port 48113 using TCP IPV4 address 0.0.0.0, pid 12543

One other thing, I like to keep my minport and maxport below 32000, as the range above that is used for dynamic ports.

Rob Fitzpatrick · Mar 5, 2012

Re: i have claue

raviraju said:
<snip>

Deleting shared memory at the OS level was a very bad idea when you wanted to do it to yourself. Please don't try to convince anyone else to corrupt their database.

Cringer · Mar 5, 2012

Re: i have claue

Rob Fitzpatrick said:
Deleting shared memory at the OS level was a very bad idea

Indeed - so bad in fact that I think I'm going to snip the posts completely to protect the innocent.

RealHeavyDude · Mar 5, 2012

The port range is just a wild guess. I changed it several times to no effect. In the meantime I have opened a Tech Support case so that they can get their hands dirty too.

I will ask the operator whether they can make us of tcpdump.

At the moment I have no glue where the port scanning is coming from. But, as I work in the most paranoid of paranoid environments ( a Swiss bank ), I can tell that security people will always shoot first and then probably ask. Namely there is a new program called data leaking prevention in place which does all sort of things on the network to prevent customer identifying data to be disclosed ... such a knowledgeable guy when it comes to networking I am not.

Thanks and Best Regards,
RealHeavyDude.

Rob Fitzpatrick · Mar 5, 2012

If this is caused by some IT security function scanning various machines on various ports, then I would strongly suspect that that machine isn't also being used by someone who needs access to the database.

My clients are also in financial services, and very security-conscious. Typically they implement a firewall between the clients and server, so only those machines that should connect to the database are actually able to. That would make you more secure and also keep you from having to spend your valuable time investigating phantom users and rebooting your box.

TomBascom · Mar 5, 2012

Sometimes the security guys forget that there are "users"...

I've also seen them justify their attacks on production systems as "testing for denial of service vulnerabilities".

RealHeavyDude · Mar 6, 2012

I was suggested by Tech Support to use the -PendConnTime start parameter for the database. It was something I've already played around with setting it's value to 5, 10 and 15 without making any difference. I will give it another try and increase it's value to, say, 30.

If I've read the KB articles

https://progress.my.salesforce.com/kA030000000O5Py?popup=true&lang=en_US&version=2
https://progress.my.salesforce.com/kA030000000O8W0?popup=true&lang=en_US&version=1[/QOUTE]
correct then I would say just using the parameter should prevent the broker from not noticing clients that could reach the broker but not the remote server they have been forwarded to. Nevertheless it is at least something I can try.

I am still trying to find out more details about the port scanning.

Thanks for your insights and support, RealHeavyDude.

raviraju · Mar 15, 2012

Re: i have claue

"Deleting shared memory at the OS level was a very bad idea when you wanted to do it to yourself. Please don't try to convince anyone else to corrupt their database." but if it is happened by fault so how we can restore and recover the database... this is the question... this is an issue..........

Server has no more resources

Well-Known Member

Curmudgeon

Well-Known Member

ProgressTalk.com Sponsor

Well-Known Member

Curmudgeon

ProgressTalk.com Sponsor

Well-Known Member

Curmudgeon

ProgressTalk.com Sponsor

ProgressTalk.com Moderator

Well-Known Member

ProgressTalk.com Sponsor

ProgressTalk.com Sponsor

ProgressTalk.com Moderator

Well-Known Member

ProgressTalk.com Sponsor

Curmudgeon

Well-Known Member

Member

Similar threads