Ai Roll Forward Error

catch.saravana · Mar 24, 2017

Version: 11.6
OS: Linux CentOS7

Below are the steps I did and am facing error during AI Roll Forward;

1. Create a new database, let's say 'devdb'
2. Binary D&L complete
3. Index Rebuild complete
4. Cross verified DB Analysis Report
5. All good and was able to start the db and query the records
6. Shutdown 'devdb'
7. Copied the data files (.dn) and (.bn) of 'devdb' and created 'hotspare1pp' db using 'prostrct builddb'
8. Started 'hotspare1pp' db and queried records from few tables and worked fine
9. Stopped the 'hotspare1pp' db
10. Copied the data files (.dn) and (.bn) of 'devdb' and created 'preprodxyz' db using 'prostrct builddb'
11. Started 'preprodxyz' db and queried records from few tables and worked fine
12. Stopped the 'preprodxyz' db
13. Added 5 AI (.an) of variable extents to 'preprodxyz' db
14. Added 2 new areas (.dn) to both 'preprodxyz' & 'hotspare1pp'
15. Start AI on 'preprodxyz' db
16. Start 'preprodxyz' db and APW/BIW/AIW
17. Create few 1000 records to a table
18. Run AI Switch (Current AI is marked as FULL and started filling next AI)
19. Copied the Full AI to a net new location '/netappxyz_preprod/dbadmin/aiprocess/aiclone/forhotspare1'
20. Roll Forward the copied AI extent to 'hotspare1pp' db and I receive below error

Error:
[sbalasub@dev-xyz ~]$ /opt/dlc/bin/_rfutil /opt/dbba/hotspare1pp -C roll forward -a /netappxyz_preprod/dbadmin/aiprocess/aiclone/forhotspare1/preprodxyz.20170322170414
** The database was last changed Tue Mar 14 17:08:19 2017. (831)
** The after-image file expected Tue Mar 14 17:05:08 2017. (832)
** Those dates don't match, so you have the wrong copy of one of them. (833)
roll forward open /netappxyz_preprod/dbadmin/aiprocess/aiclone/forhotspare1/preprodxyz.20170322170414 error: -1. (11014)

Is there a way to fix this issue without recreating 'hotspare1pp' db? Please advise.

Rob Fitzpatrick · Mar 24, 2017

If you want to take AI notes from preprodxyz and apply them (roll forward) to hotspare1pp then you shouldn't use this process. You should probkup preprodxyz and prorest to create hotspare1pp, just as you would take a copy of a production DB to create the DR DB.

It *might* have worked if you hadn't opened your target (hotspare1pp). Once you open the DB it is no longer a valid AI roll-forward target. I've never tried that, but my guess is that because the prostrct command is also opening the DB, it still would not match the date stamp of the source and roll forward would not work.

catch.saravana · Mar 24, 2017

I used this approach (prostrct builddb) on test db with same .st file but no data in the tables and it worked fine for me. That's when I was confident of doing the same in preprod but failed. The only difference is while running the test on test db I didn't open the hotspare db. Switch and Roll Forward worked fine.

As you pointed out starting the hotspare db would have initiated crash recovery. I wanted to get a confirmation from our forum that there is no other fix for this issue before I go ahead and recreate the hotspare db's.

catch.saravana · Mar 24, 2017

Rob Fitzpatrick said:
You should probkup preprodxyz and prorest to create hotspare1pp

In our case,

probkup - 7 hrs
prorest
1. net new location (fresh db) - 13 hrs
2. pregrown db - 8 hrs

even if we consider the db is pregrown and we are doing a restore overall to recreate a hotspare db it will take around 15 hrs (7 for backup + 8 for restore).

We have 2 hotspare db's and if I run in restore in parallel it slows down to ~11 hrs on a pregrown db. That's the reason we were exploring this option so that we can get the hotspare created in 5 hrs (OS Copy takes 5 hrs + prostrct builddb takes less than 5 seconds).

Rob Fitzpatrick · Mar 24, 2017

Just curious, is there a reason why you use prostrct builddb rather than prostrct repair after moving extents?

catch.saravana · Mar 24, 2017

@Rob Fitzpatrick No specific reason for using prostrct builddb over prostrct repair. Recently created db multiple times from zfs snapshot where once we get the snapshot I delete/rename the .db, .lg files and run the prostrct builddb to get the control area (db) recreated.

I guess both prostrct builddb and repair will not start a crash recovery which is what I need to make sure when creating hotspares. Is there any downside of using 'prostrct builddb' in this case?

Rob Fitzpatrick · Mar 24, 2017

catch.saravana said:
Is there any downside of using 'prostrct builddb' in this case?

No, I'm not aware of any downside. They seem to be pretty similar functionally.

catch.saravana · Mar 24, 2017

@Rob Fitzpatrick I used RETRY option and Roll Forward ran successfully.

[sbalasub@dev-xyz ~]$ /opt/dlc/bin/_rfutil /opt/dbba/hotspare1pp -C roll forward retry -a /netappxyz_preprod/dbadmin/aiprocess/aiclone/forhotspare1/preprodxyz.20170322170414
After-image dates for this after-image file: (1633)
Last AIMAGE BEGIN Fri Mar 17 09:53:12 2017 (1640)
This is aimage file number 1 since the last AIMAGE BEGIN. (1642)
This file was last opened for output on Fri Mar 17 09:53:39 2017. (1643)

Roll forward Retry Activated. (6804)

15:42:10: 10% of aimage file processed (9656 notes processed)... (17060)
15:42:10: 20% of aimage file processed (19351 notes processed)... (17060)
15:42:10: 30% of aimage file processed (29173 notes processed)... (17060)
15:42:11: 40% of aimage file processed (38997 notes processed)... (17060)
15:42:11: 50% of aimage file processed (47914 notes processed)... (17060)
15:42:11: 60% of aimage file processed (57440 notes processed)... (17060)
15:42:11: 70% of aimage file processed (67210 notes processed)... (17060)
15:42:12: 80% of aimage file processed (76826 notes processed)... (17060)
15:42:13: 90% of aimage file processed (85468 notes processed)... (17060)

90979 notes were processed. (1634)
0 in-flight transactions. (3785)
10011 transactions were started. (1635)
10011 transactions were completed. (11138)
At the end of the .ai file, 0 transactions were still active. (1636)

Rob Fitzpatrick · Mar 24, 2017

Interesting. I expect retry to work when you have rolled forward a partial file and the next one won't apply successfully. You can reapply the full version of the first file with retry. I'm surprised that retry works for a time stamp mismatch.

TomBascom · Mar 25, 2017

It sort of makes sense that a process like this works - otherwise the "mark backedup" option doesn't make a lot of sense.

It is none the less obviously a lot more "delicate" than the usual probkup based approach.

cj_brandt · Mar 26, 2017

I believe retry option allows rfutil to ignore the timestamp mismatch...

catch.saravana · Mar 27, 2017

TomBascom said:
It is none the less obviously a lot more "delicate" than the usual probkup based approach.

Any specific reason, @TomBascom ? In most of the cases I will be using probkup approach but in scenarios where we have time constraint (be it prod, preprod, dev) am sure we would love to go with the copy extents approach which is lot of faster. If you feel it's fragile please let me know what kind of issues do you foresee, so that either we will have some work around or always stick with probkup approach.

TomBascom · Mar 27, 2017

You have extra steps to go through, that alone makes it obviously more delicate.

Copying the db at the OS level, rather than using probkup/prorest always runs the risk that you will miss something. All you need is for a single extent to be added in an unexpected path (for instance: accidentally because of a typo, or deliberately because of space constraints) to spoil the whole thing.

That also makes it obviously more delicate.

You may very well be correct that you need to do it this way because you need the process to be faster. It is, none the less, more fragile and delicate. I would not take this approach by default. This is a special case solution that really should only be used when necessary. IMHO.

TomBascom · Mar 27, 2017

Also -- the times that you quote for restores tell me that your IO subsystem really sucks. I'd be a bit more focused on getting that upgraded to something reasonable.

catch.saravana · Mar 27, 2017

TomBascom said:
Also -- the times that you quote for restores tell me that your IO subsystem really sucks. I'd be a bit more focused on getting that upgraded to something reasonable.

The plan for us is to have our prod db in pure array and rest of the environments on netapp. Our db size is roughly around 1.3 TB. 2 rounds of migration testing is completed and the statistics that I provide are from that test. Please don't mind me asking this Tom, just wanted to know based on your experience how much time would you expect the restore to be done in my case? I can go back to our admin team with these statistics to see if they can get the I/O issue resolved.

catch.saravana · Mar 27, 2017

TomBascom said:
You have extra steps to go through, that alone makes it obviously more delicate.

Copying the db at the OS level, rather than using probkup/prorest always runs the risk that you will miss something. All you need is for a single extent to be added in an unexpected path (for instance: accidentally because of a typo, or deliberately because of space constraints) to spoil the whole thing.

That also makes it obviously more delicate.

You may very well be correct that you need to do it this way because you need the process to be faster. It is, none the less, more fragile and delicate. I would not take this approach by default. This is a special case solution that really should only be used when necessary. IMHO.

@TomBascom Very true, I totally agree with it.

TomBascom · Mar 27, 2017

A good IO subsystem ought to be able to restore that in 2 or 3 hours.

Notice I said "IO subsystem" -- not SAN... There is no such thing as "high performance SAN". There are only SANs that don't suck quite as bad as some others.

catch.saravana · Mar 27, 2017

TomBascom said:
A good IO subsystem ought to be able to restore that in 2 or 3 hours.

OMG! I would love to stick with probkup/prorest if this happens in our system rather than those hassles.

Is there any tool where I can monitor this during a restore and collect statistics to present the data to admin team?

TomBascom · Mar 27, 2017

The relevant statistics are:

1) Start time
2) End time

The clock on the wall ought to work just fine...

I suppose if the storage people want to look at something they could look at how many IO ops they manage to complete and where their internal bottlenecks are. There's not much chance that they will actually look -- it goes against their nature -- but, just in case you are lucky enough to have a storage person who is willing to look into it the most like bottlenecks are:

1) RAID5 (or, worse, RAID6) write operations
2) The network interfaces -- on the SAN and on the server
3) Data going from the SAN, over the network, to the server's CPUs and then back over the network to the SAN...
4) Anything else going on on the SAN competing for IO resources

If you want your IO to be fast:

1) Ditch the SAN
2) Use *internal* SSD -- you're just wasting your money putting SSD on a SAN (see point 4 above...)

Internal SSD is both *fast* and *cheap* (cheap compared to SAN storage anyway...). You do not often get to have both fast and cheap in the same package.

The downside is that your SAN admins will not be able to "manage" it using their SAN console. Personally I will take that particular trade every time.

catch.saravana · Mar 27, 2017

When I tested on Internal SSD's, restore took 2.8 hrs and OS copy option took under 1 hour. As you have pointed out if they are fine with using internal ssd's I will not be facing this issue at all but here they want the prod to be on iscsi san and non prod env on netapp (being a consultant I can go only to an extent to explain them the pros and cons of using internal ssd). I remember them saying we are on 10G network (point #3 shouldn't be an issue) and said this is the best they could get.

Ai Roll Forward Error

Member

ProgressTalk.com Sponsor

Member

Member

ProgressTalk.com Sponsor

Member

ProgressTalk.com Sponsor

Member

ProgressTalk.com Sponsor

Curmudgeon

Active Member

Member

Curmudgeon

Curmudgeon

Member

Member

Curmudgeon

Member

Curmudgeon

Member

Similar threads