Audit Archive Failure

KMoody

Member
We ran the following on Sunday, March 8, at 3 AM:

proutil /users/mf/live/db/mffg1 -C auditarchive -directory /users/mf/live/db/audit_archive/

The log shows the following:
Code:
OpenEdge Release 11.7.5 as of Fri Jun  7 08:29:03 EDT 2019

Archive all the records prior to 2020/03/08@03:00:01.530-0500 . (13532)
Archived 2 _Db-detail records. (13575)
Archived 0 _Client-session records. (13576)
Archived 178 _aud-event records. (13577)
Processing 20000 records of table _aud-audit-data. (13935)
User cancelled audit archive with CTL-C. (13531)
Audit archive failed. (13021)

This left us with a small .abd backup file.

If auditarchive fails and -nodelete is not on, then do any records get deleted from the database before the failure? Does auditarchive "roll back" deletions if it fails?
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
I've seen the (13531) error happen at a client site due to a long-running transaction in which audit data is written. Audit archive attempts during the transaction fail. Once the transaction ends, audit archive starts working again.

I haven't examined the aftermath personally as I don't have access. But I imagine the end result is what it would be if an ABL client were attempting the archive (and if it had permission to do so, which it doesn't): the client would try to read and delete the records in question, which would require it to obtain exclusive locks on those records. It can't do that as long as the client that created the records in question is still in the transaction that created them.

This is complicated by a few facts: the audit archive utility isn't an ABL client, so some of its behaviours might be specific to it; it doesn't delete all records in a single transaction (its scope is dictated by the -recs parameter which defaults to 100 records per transaction); the documentation doesn't specify when in the process the deletes happen, whether it is right after each record is read or at the end, after all records are read. Your output shows that auditarchive was trying to archive all audit records prior to 03:00 on 2020/03/08, and it was at least 20,000 records; that implies at least 200 transactions. I would expect at least some of those transaction to have completed successfully, until it hit the lock contention, unless it happened that you hit it during the first transaction.

I can't tell you much definitively, based on what I know. You said the .abd file was "small"; I don't know what to make of that. It would be useful, for example, to know its size compared with the size of a .abd file that captured data from a time when the application was idle. I would expect the file to contain the setup records (_db-detail, _aud-event) and no data (_aud-audit-data). If your file were the same size as such a file, you could reasonably assume that your failed audit archive dumped no records to the .abd, and therefore should also have deleted no records from _aud-audit-data.

You could also verify that by having a user with audit data reporter or audit adminstrator privilege read _aud_audit-data to count records prior to 03:00 on 2020/03/08. Alternatively, you can load the small .abd file into your audit archive database and then see how many _audit-audit-data records it reports loading.

In answer to your last question, the audit archive utility does use transactions to delete audit data, so a failed delete due to a lock conflict would roll back the deletion of those records that were part of that specific transaction (up to 100 records, by default). The open question, given the (13021) message, is what may have happened to audit data deleted in earlier transactions, if any. You'll have to due some sleuthing as outlined above, or open a case with Progress TS.
 

KMoody

Member
Thanks, Rob. Based on what you've said and what I've figured out, I have an idea what happened:

In cron, our databases run an audit archive at 2:45 AM, and our databases shut down for a cold backup at 3:01 AM. On 2020/03/08 at 1:59 AM, cron ran all jobs between 2:00 and 2:59 right after the other to jump ahead to 3:00 AM for Daylight Saving Time. The database shut down at 3:01 AM as scheduled, interrupting the audit archive.

Even though the archive didn't complete, it successfully archived and deleted some of the auditing information; records that existed in the audit archive file no longer existed in the database.

My solution was to load the interrupted audit archive into the auditing database, rerun auditarchive, and load the second archive file. When I loaded the interrupted archive file into my auditing archive database, it processed 382930 records. When I loaded the second archive file, it processed 491974 records.

I don't know if the audit archive interruption caused records to be deleted without being archived. What I do know now is that it's not a good idea to schedule a cronjob on a Sunday between 2 AM and 3 AM if you can avoid it! :oops:
 
Last edited:

Rob Fitzpatrick

ProgressTalk.com Sponsor
I don't know if the audit archive interruption caused records to be deleted without being archived.
One would hope not. If so, it's obviously a bad bug.

In principle it shouldn't be too hard to test, as causing audit archive to fail is straightforward. No promises, but if I find some spare time in the near future I'll try to test this. I'm interested to know the answer.
 
Top