OE 11.7.4 on Solaris 11 - System Hang

kmcgrane · Jan 25, 2021

Hi,

Looking for some advice on what could cause OE system to hang for a long period.

We had an incident on a customer site where the system stopped responding to any database updates, however we were able to read data from the tramlines using simple queries. The database extents appeared to be fine, BI files also looked normal (Variable extents had not grown). We have a number of databases running on the same server, and it appeared all of the database stopped accepting updates around the same time.

System had been running fine for a number of months and business usage had not increased.
The customer is using PRO2 for replication purposes but their DBA does not think this would have any impact on performance of the system.

Any help would be most welcome.

Kind Regards,
Keith

Rob Fitzpatrick · Jan 25, 2021

Can you describe this "hang" in more detail? When I hear about a system hang, I assume the person is referring to the machine or the OS. But as queries were working, it doesn't sound like this is the case. Do you mean specifically that database updates were frozen?

How long was the "long period" and what happened when it ended? Was something done to make it end? Did the pending writes complete successfully?

Some ideas about what could pause writes:

Issue on a shared storage device that prevents writes?
BI management: write activity is temporarily paused during checkpoint processing, when a BI cluster has filled and before the switch to the next one (which might also involve formatting and inserting a new cluster into the cluster ring). If something interfered with that, writes would be stalled temporarily but reads would be permitted.
Quiet points: imagine a system admin wants to move their DB server VM from one virtualization host to another. So they run a pre-move script that enables a quiet point on every running DB, then they migrate the VM across hosts, then they run a post-move script to disable all the quiet points. In between, write activity would be frozen but reads would be permitted.

Check your DB logs and OS logs during the period in question. Check your checkpoint stats for the period in question, if you still have them.

TomBascom · Jan 25, 2021

The start of an online backup also freezes updates while the bi file is being backed up. A large bi file combined with a very small backup extent size additionally combined with a manual extent feeding process is way to experience the described problem. (Don't ask me how I know that last part.)

kmcgrane · Jan 26, 2021

Rob Fitzpatrick said:
Can you describe this "hang" in more detail? When I hear about a system hang, I assume the person is referring to the machine or the OS. But as queries were working, it doesn't sound like this is the case. Do you mean specifically that database updates were frozen?

How long was the "long period" and what happened when it ended? Was something done to make it end? Did the pending writes complete successfully?

Some ideas about what could pause writes:

Issue on a shared storage device that prevents writes?

BI management: write activity is temporarily paused during checkpoint processing, when a BI cluster has filled and before the switch to the next one (which might also involve formatting and inserting a new cluster into the cluster ring). If something interfered with that, writes would be stalled temporarily but reads would be permitted.

Quiet points: imagine a system admin wants to move their DB server VM from one virtualization host to another. So they run a pre-move script that enables a quiet point on every running DB, then they migrate the VM across hosts, then they run a post-move script to disable all the quiet points. In between, write activity would be frozen but reads would be permitted.

Check your DB logs and OS logs during the period in question. Check your checkpoint stats for the period in question, if you still have them.

Hi Rob,

Thanks coming back to me on this.

Our customer complained that users could not log onto our application (which would perform some updates), whereas user already logged in could only perform read only functions. Once an update was attempted their session froze (hung) and did not respond. We noticed that background jobs also froze around this same time period. A restart of their environment was initiated as they could not afford too much down time (At this point the system was non-operation for around 40 mins). Once the environment was restarted no further issues were encountered.

CPU, Memory and Disk Usage all looked fine, nothing out the ordinary.
BI Extents looked ok
AI stopped writing around the same period as users experienced system Hang.
There was no checkpoint reporting in place so we don't have information on that side.
Customer said nothing new was running on their system (O/S scripts or Application scripts).
The database log files did not indicate any issues.

Kind Regards,

Keith

Rob Fitzpatrick · Jan 26, 2021

Do you know how many AI extents were in Empty status at the time of the issue?

Another possibility is that AI archiving had stopped, all AI extents were Full, and the -aistall primary broker parameter is in use so the database stalled rather than shutting down. This would result in the symptoms you describe. (In particular, this could happen with fixed AI extents, or variable extents with a size limit, or variable without large file support enabled.)

Is OE Replication in use?

TomBascom · Jan 26, 2021

I suggest taking a much closer look at the database .lg file.

Patrice Perrot · Jan 27, 2021

Hi,
I had a case like this one on a test machine (and describe by Rob "Quiet points").
A sysdamdin try to proof that proquiet + snapshot is better than probkup + AI.

But he forget to disable the proquiet.
=> read possible
=> No update allowed (inculding new connection which update the _connect)

Extract from : PROQUIET command (progress.com)
PROQUIET ENABLE stops all writes to the database

Patrice

Rob Fitzpatrick · Jan 27, 2021

Patrice Perrot said:
Hi,
I had a case like this one on a test machine (and describe by Rob "Quiet points").
A sysdamdin try to proof that proquiet + snapshot is better than probkup + AI.

But he forget to disable the proquiet.
=> read possible
=> No update allowed (inculding new connection which update the _connect)

Extract from : PROQUIET command (progress.com)
PROQUIET ENABLE stops all writes to the database

Patrice

That's why examining your DB logs is important. Proquiet writes log messages when quiet points are enabled and disabled.

kmcgrane · Feb 8, 2021

The customer is using Pro2 as a tool to replicate their Progress databases to SQL databases. Around the time the system started having issues they noted the Pro2 logs started growing quite large (> 1GB) and after investigation found that a field mismatch was outputting errors multiple times a second. They have since corrected the field and also performed a reboot of their Unix Server. This may or may not be related to the issue, but it's the only thing that appears to have changed on the system since it started having issues. They have not reported any issues since these changed were implemented.

Progress had investigated the db logs and found nothing unusual and we have arranged for Progress to perform an analysis of their environment to see if any improvements can be made.

Thanks to everyone who replied.

Keith.

OE 11.7.4 on Solaris 11 - System Hang

kmcgrane

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

kmcgrane

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

Patrice Perrot

Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

kmcgrane

New Member

Similar threads