Shared Memory overflow

The article you cite references a bug in 10.2A FCS, fixed in 10.2A01, where 168 bytes of memory is leaked with every client disconnect/reconnect. This is a small amount of memory, and it is not clear whether it is the cause of your (6495) errors, especially as you are restarting your databases at least monthly. Regardless, once you upgrade to 10.2B08 you won't be faced with this bug.
Hi Rob,

So you mean the bug exists in 10.2A right?

Currently plan is to start watchdog, use -Mxs 512 and increase -B to 50000 along with 10.2A03 patch application.

Upgrade to 10.2B would be next year..
Thanks a lot.
 

TomBascom

Curmudgeon
-B 50,000 is ridiculously low.

The article says that bug referenced exists in 10.2a. I see no reason to doubt that.

Whether it is "the" bug causing your problem or problems is unknown because your problem or problems are not well enough defined to say that we know the root cause or causes and we therefore cannot judge that.

The whole idea of a singular bug in such a diffuse area as memory leaks is problematic. It goes hand in hand with the idea that fixing one singular bug will magically resolve all sorts of problems whose connection to the bug in question is tenuous at best.

I am reminded of a team of half a dozen people who wasted several weeks pursuing "solutions" to the wrong problem because the lead was obsessed with pursuing the *last* error message in the .lg file rather than the first message to indicate a problem. Similarly this message reported an out of memory type of problem. Ultimately that turned out to be a red herring -- the system was running out of memory because of a long series of events that started well before the out of memory condition. But quite a lot of time and energy was wasted increasing memory, changing OS limits and chasing exotic hypothetical bugs when actually, just reading the first message and understanding it led almost directly to a fairly minor coding mistake that was easily fixed and which not only completely remedied the problem but also improved performance of that chunk of code by an order of magnitude.

To "save face" a couple of months were then invested in a major rewrite of that chunk of code which provided zero benefits other than a fun little exercise in learning about certain new 4gl features.
 
Hi All,

Thanks a lot for all your information. We applied 10.2A03 SP and the shared memory leak is resolved. So in order to isolate the root cause we are following below:

1. Apply SP 10.2A03.
2. Start watchdog.
3. Monitor for one month if the issue repeats.
4. If issue repeats, Use -Mxs 512 and also get application to revise their coding.
5. If no issues, then tune -B in phases to increase buffer cache hit to 99% and monitor HWM and increase -L accordingly.

Thanks to all... Cheers...
 

TomBascom

Curmudgeon
How do you know that "the" shared memory leak is resolved? That seems questionable since your plan involves monitoring for a month to see if it happens again.

Also, isolating the root cause after theoretically fixing it is rather like putting the cart before the horse. I could see looking for a root cause forensically after a fix -- and then testing the theory. But that doesn't seem to be what is going on here.
 
How do you know that "the" shared memory leak is resolved? That seems questionable since your plan involves monitoring for a month to see if it happens again.

Also, isolating the root cause after theoretically fixing it is rather like putting the cart before the horse. I could see looking for a root cause forensically after a fix -- and then testing the theory. But that doesn't seem to be what is going on here.
Hi Tom,

We created concurrent remote connections and checked shared memory that remained constant after Patch apply which is not the case before patch apply. This lead us to that the memory leak is resolved.

If the issue still happens with -Mxs, we will get application team to involve in verifying their codes. As far as I get to know from you all is -Mxs issue can still be there.

2 areas were suspected: first database followed by second application program. DB is fixed and if no more issues we take it as DB issue. If not then application. To fix for application we will also play around with -Mxs and -L.

Thanks.
 

TomBascom

Curmudgeon
We created concurrent remote connections and checked shared memory that remained constant after Patch apply which is not the case before patch apply.

This sounds like a completely new set of previously undisclosed symptoms.

How did you check that "shared memory remained constant"? What tool did you use to do this and what numbers are you observing? (Shared memory usage is not something that is trivially observed -- so the tool that you used and the metric in question will tell us a lot about whether or not you are really seeing what you think you are seeing. Of course the very nature of shared memory is that it is *shared* which means that it generally gets allocated up front and everyone uses it -- it doesn't grow or shrink as connections are made and a "leak" isn't really a sensible notion.)

What were the previous non-constant values?

Apparently creating "concurrent remote connections" is your hypothesized root cause for a "memory leak". In the pre-patch scenario how does creating these connections correlate to increased shared memory usage? (What metric changes by how much when you add remote connections?) How does that metric change when remote connections are closed?

Does your test involve anything other than opening and closing connections? IOW do they do any work? Did you run the same tests pre-patch?

This lead us to that the memory leak is resolved.

Please pardon my skepticism but I doubt it. So far you haven't really described anything that would fit that description.
 
This sounds like a completely new set of previously undisclosed symptoms.

How did you check that "shared memory remained constant"? What tool did you use to do this and what numbers are you observing? (Shared memory usage is not something that is trivially observed -- so the tool that you used and the metric in question will tell us a lot about whether or not you are really seeing what you think you are seeing. Of course the very nature of shared memory is that it is *shared* which means that it generally gets allocated up front and everyone uses it -- it doesn't grow or shrink as connections are made and a "leak" isn't really a sensible notion.)

What were the previous non-constant values?

Apparently creating "concurrent remote connections" is your hypothesized root cause for a "memory leak". In the pre-patch scenario how does creating these connections correlate to increased shared memory usage? (What metric changes by how much when you add remote connections?) How does that metric change when remote connections are closed?

Does your test involve anything other than opening and closing connections? IOW do they do any work? Did you run the same tests pre-patch?



Please pardon my skepticism but I doubt it. So far you haven't really described anything that would fit that description.
Hi Tom,

Yes. I tested before and after patch and also in patched and non patched environment. I used promon option R&D followed by 1 and 14 options. The shared memory segment reduced each time there was a disconnection in non patched environment but the shared memory remained constant in case of patched environment. I noted the first and last figure as 5182376 in patched environment while in non patched environment it started from 869936 and ultimately we could see only 100.

The test also involved MRP to run in MFGPRO which actually caused -L issue and database down in non patched while in patched the database did not go down although the MRP failed.
Thanks a lot.

Good Day...
 

TomBascom

Curmudgeon
You are talking about this PROMON screen:
Code:
09/19/14  Status: Shared Memory Segments
12:08:29

Seg  Id  Size  Used  Free

  1  5963782  18132992  15695672  2437320

You are saying that merely connecting a remote client resulted in an increase in one of these numbers? (I'm guessing "used"?) And that disconnecting the client results in less shared memory being used? (and more free?)
 

TheMadDBA

Active Member
I think he means this bug in 10.2A (and 10.1C) (from earlier in the thread). Every time a client/server connection disconnected from the DB it caused a memory leak.

http://knowledgebase.progress.com/articles/Article/P138375
For each Client/Server connection to the database, 168 bytes are allocated from the shared-memory pool. In OpenEdge 10.1C this was deliberately left behind for re-use (on the next connection) instead of having to de-allocate and reallocate for every connection. Unfortunately, the database manager was over-writing the pointer to the 'left-behind' allocation from the client server connection causing the 168 bytes to be reallocated every time a new client server connection is made. This leak eventually leads to exhausting the total allocated shared-memory, where the rate of attrition is 168 bytes per Client/server connect/disconnect (ie network connections). Once the shared memory is exhausted, any connection will fail allocating shared-memory (ie including self-service connections).​
 

TomBascom

Curmudgeon
The confusing part is that if shared memory reduced after disconnect then that would seem to say that there was no "leak". I would have thought that it would keep increasing if the pre-patch behavior was a leak as described in the kbase.
 
The confusing part is that if shared memory reduced after disconnect then that would seem to say that there was no "leak". I would have thought that it would keep increasing if the pre-patch behavior was a leak as described in the kbase.
Hi Tom,

The free memory kept reducing on non patched environment while it remained same in patched environment in the promon output.

Tomorrow we are going to production change.

Thanks
 

kdefilip

Member
I think you have two different things going on.

A db crash due to -L being exceeded and -Mxs being exhausted is a reasonably well known thing. You fix it as has already been mentioned.

A database being frequently "hung" is a different cup of tea and, unless you are saying that there is always a -Mxs related message in the .lg file when you have a hung database then that should be treated as a distinct problem.
-L is expressed in the log as 800000 ; is that b k M ?

And while I'm at it, -Mxs set to 340 ; is that b k M as well?
 
Last edited:

TomBascom

Curmudgeon
-L is number of locks. Each lock takes a certain number of bytes which varies a bit from release to release. I forget exactly how many are needed in current releases -- the voice in my head is saying "12" but if I really wanted to know I would ask the kbase.

-Mxs is in kilobytes.
 
Top