Safest way to enable replication for very large 3TB database

dunedain

New Member
Hi,

I'm new with OpenEdge database, so please correct me if I'm wrong in some items.
Since it's 3 month that I joined team, and didn't have yet enough experience with OpenEdge, I'll be very appreciate to your help guys with my problem.
What we have :
2 Nodes - source/target
OpenEdge edition - 11.7
OS - CentOS 7.8
Replication mode - async
In our projects we have a 3 TB database with previously enabled replication by vendor DBA's, so all source/target config files for replication available without any changes. AI files with archiving enabled on the source DB.
Last week we lost our replication database, since main active DB extent file became corrupted. DB size already was 3 TB at replication crashing time, obviously AI files on source become LOCKED, and since we have only daily full backup of DB and don't have incremental backups for quick restoring target database, we decided to disable replication on source, to release back locked AI files in source database, because source database must be available 24/7.
I've a task to restore replication from this database from the scratch. Currently we observe exponential growth of DB, ~ 200 GB per week.
We have a NFS share between source/target DB with 10GB NIC.
Below I created row plan how to restore replication, can anyone help me is this plan is ok? I concern about facing issue with source AI while restoring target, to prevent AI files become again in "Locked" mode.

-- on source db
probkup online $db /mnt/store/progressdb/$db-repl.bk
proutil $db -C enablesitereplication source
probkup online $db incremental /mnt/store/progressdb/$db-repl-inc.bk -com -REPLTargetCreation
dsrutil /db1/$db/$db -C restart server

-- on target db
prorest $db /mnt/store/progressdb/$db-repl.bk
prorest $db /mnt/store/progressdb/$db-repl-inc.bk -REPLTransition
/opt/local/bin/dbup_slave.sh /db1/$db/$db
 

Cringer

ProgressTalk.com Moderator
Staff member
Hi dunedain. Thanks for posting.

Your plan looks sound to me. Obviously moving the initial backup to the target machine is going to take a long time do you want to do that and have that done BEFORE you enable replication on the source.

Make sure you have enough AI space in case the transition of the incremental takes longer than you expect.

Make sure your PICA is set high enough on the source.

Make sure you monitor your replication in future so it doesn't happen again. ProTop Monitoring and Alerting Service (or a.n.other mirror local to you) can help with that. If you're Europe based please reach out for a demo, or I can put you in touch with someone local to you if not.

As an aside, 200GB growth a week is not exponential. It's fast, but straight line, not exponential. ;)
 

dunedain

New Member
Hi dunedain. Thanks for posting.

Your plan looks sound to me. Obviously moving the initial backup to the target machine is going to take a long time do you want to do that and have that done BEFORE you enable replication on the source.

Make sure you have enough AI space in case the transition of the incremental takes longer than you expect.

Hi Cringer,

Thank you for reply.
We have NFS share between source and target db hosts with 10GB ethernet adapter.

According to your advise, I can take a full backup on source to NFS share, backup will be immediately visible for target DB, and once DB will be restored on target, take an incremental backup of source DB again directly to NFS share, and restore incremental backup on target, so target DB should become almost consistent with source DB. Once target DB restored I can enable replication on both site.

Since restoring 3TB database could take hours, how replication server/agent will sync both databases, is it going to use archived AI files created between full and incremental backup, or it's require operational AI files to be in Locked mode until both full/incremental backup will be restored in target db?

My main concern is that taking full/incremental backup on source and restoration process on target would cause of AI files become in Locked mode. Please correct me if I'm wrong.


"Make sure you have enough AI space in case the transition of the incremental takes longer than you expect."

We have 15 AI files with variable size limit up to 2GB each due to the extend size limit. As I understand, 14 AI file can be in Locked mode at time without affecting source DB, but in case of all 15 AI files will become in Locked mode until replication won't be restored, source DB will be stopped? Please correct me if I'm wrong
 

TomBascom

Curmudgeon
I would make the backup with -Bp 10.

The biggest challenge is to get it all done quickly enough.

Have you tested any of the timing for these steps?

Personally I've never found much advantage to incremental backups. In theory it seems like it ought to save time. But in practice? Not so much.

One thing to watch out for is that you cannot permit the source db to backup again. That is to say: there cannot be any new backups after the backup that you restore and plan to synchronize until your target is successfully replicated.

The shared NFS mount might be useful - it all depends on how quickly you can create a backup and restore it that way. In cases where the servers are in the same data center you might get the whole process done in a few hours. But if the source and the target are far away from each other it could take days.

If the shared filesystem is too slow you might try piping the data over an SSH connection. Or maybe you have some SAN based replication available?
 

dunedain

New Member
I would make the backup with -Bp 10.

The biggest challenge is to get it all done quickly enough.

Have you tested any of the timing for these steps?

Personally I've never found much advantage to incremental backups. In theory it seems like it ought to save time. But in practice? Not so much.

One thing to watch out for is that you cannot permit the source db to backup again. That is to say: there cannot be any new backups after the backup that you restore and plan to synchronize until your target is successfully replicated.

The shared NFS mount might be useful - it all depends on how quickly you can create a backup and restore it that way. In cases where the servers are in the same data center you might get the whole process done in a few hours. But if the source and the target are far away from each other it could take days.

If the shared filesystem is too slow you might try piping the data over an SSH connection. Or maybe you have some SAN based replication available?

Hi Tom,

Thank you for reply!

Both source and target are in the same Data Center.
NFS mount created and attached from HP storage directly, I guess it should be fast enough. Network speed between source and target = 10Gb.
So my plan is create full and incremental backup directly to the NFS mount on source, so once completed, backup will be available on target immediately.
Full backup without compression takes almost 7 hours.
Let's suppose that full backup restoration will take another 7 hours, so in this case Incremental backup should contain changes for total 14 hours. What will happen with AI files in this case? Will be they in Locked mode for all 14 hours? Or replication server going to use AI archived files, have been created for this 14 hour gap?

"The biggest challenge is to get it all done quickly enough"

That's why I concern about how quick backup/restoration process will complete and AI files status
 
Last edited:

Cringer

ProgressTalk.com Moderator
Staff member
-Bp 10 is private buffers. It reserves 10 buffers in the buffer pool for the backup to avoid pulling the entire database through the entire buffer pool, thus destroying it for users until it refreshes itself.
 

Cringer

ProgressTalk.com Moderator
Staff member
Regarding the AIs, if you set them as variable and have large files enabled, then the last extent would just grow until you run out of disk space. Not ideal, but buys you time for enabling replication.
 

TomBascom

Curmudgeon
Everything within one data center is good, that should help a lot.

Using a shared filesystem avoids the need to copy the backup and that will save a lot of time.

"HP Storage" implies some sort of SAN. That's less helpful.

Personally I would not bother with the incremental backup and restore. You can test it but I doubt that it is going to save you any significant amount of time. You could quite possibly be done restoring the first backup before the 2nd (incremental) backup completes.

As Cringer says, if you are using variable ai extents with large files your limitation is disk space not number of extents. You should be able to easily estimate the required amount of disk space from historical patterns of ai usage.

Regarding after-image extent switching - you might want to increase the interval between "aimage new" during this process so that you do not use up empty extents too quickly.

If you are using fixed extents then you need to make sure that you have plenty of extents pre-allocated.

In both cases do not forget to account for the period post-restore when the two databases are being synchronized.

FWIW I have a customer with a 2TB database that I occasionally have to rebaseline. The source and target are in the same datacenter and I use an NFS share as the backup target. The storage is internal to the servers - no SAN is involved so the storage is pretty fast. The backup takes about 4 hours. The restore takes about 2 hours. If I do this during a slow period the sync is generally complete within 30 minutes. Your timing will, of course, be different and the HP storage is, IMHO, very likely to be your "Achilles' Heel".
 

dunedain

New Member
As Cringer says, if you are using variable ai extents with large files your limitation is disk space not number of extents. You should be able to easily estimate the required amount of disk space from historical patterns of ai usage.

Thanks Tom,
I calculated AI usage based on AI archive files generated. 111 files per 2 GB each, so it approximately 230 GB of AI archive files generated per day for source DB. As I understand I need at least 230 GB free storage space for AI files until replication will be synchronized.

Regarding after-image extent switching - you might want to increase the interval between "aimage new" during this process so that you do not use up empty extents too quickly.

This parameter set to 3600

Thanks Cringer,

"Regarding the AIs, if you set them as variable and have large files enabled, then the last extent would just grow until you run out of disk space. Not ideal, but buys you time for enabling replication"

Yes, large files are enabled, but in db.st file all 15 AI files has been set with "v 2048000". Does it mean that even if AI created with variable length, max size for each AI files will be 2GB, and last extent will switch to the first , once it will reach 2GB size? As I understand, in this case last extent won't grow until run out of disk space due to the predefined size limit in db.st file, so it will switch to next one? Please correct me if I'm wrong
 

TomBascom

Curmudgeon
How did you calculate the total time of the expected outage without having gone through a test restore and observing the time required to synchronize?
 

dunedain

New Member
How did you calculate the total time of the expected outage without having gone through a test restore and observing the time required to synchronize?
I didn't yet. This is just expectation. Unfortunately I don't have resource and another env for testing exact outage.
 

TomBascom

Curmudgeon
If you have guessed wrong and it takes longer, you will need to be prepared to abort the process and disable replication on the source if you start to run out of ai space.

We've been talking about this for several days. It seems like you could have done a test restore by now.
 

dunedain

New Member
If you have guessed wrong and it takes longer, you will need to be prepared to abort the process and disable replication on the source if you start to run out of ai space.

We've been talking about this for several days. It seems like you could have done a test restore by now.

Thank you, Tom.

Yes, we planned reinitialize replication on the next weekends.
 

SerWal

New Member
I've been in similar situation. We had a 2.5 TB database. I needed to set up replication for it.
We didn't have fast shared filesystem between source and target servers. I didn't want to spend all day on it. So I tried different method.

I did something like this


Code:
# On source server
mkfifo -m 777 /tmp/source_fifo

# On target server
mkfifo -m 777 /tmp/target_fifo

# from source server staring to stream one contents of source_fifo to target_fifo
cat /tmp/source_fifo | ssh <user@target_server> "cat > /tmp/target_fifo"

# on source server
probkup online <dnname> /tmp/source_fifo -REPLTargetCreation -verbose

# on target server
prorest <dnname> /tmp/target_fifo -REPLTransition -verbose

Instead of taking backup -> copying backup -> restoring backup I could do all same time.
So instead of spending 10-12 hours for building replica I was done in 4 hours.

Disadvantages of this method (that I'm aware of).

- If something goes wrong you need to start from the beginning.

- If you are going to use this method - it's better to have a target database already created with areas extended to fit all the backup data from source. If you don't do it your source database will freeze waiting for slave extents to expand.



Would be nice to hear what what you think about this. If I'm missing something important here.
 

TomBascom

Curmudgeon
Yes, that is a very workable technique. We used something very similar a few years ago with a customer that had a multi-terabyte db and who needed to setup a replication target at a remote datacenter (so no fast access shared filesystem available).

As you mentioned, pre-building and expanding the target database saves a *lot* of time. That's actually helpful no matter what method you use. (BTW, this is also very helpful when doing a dump & load.)

One thing that went wrong... on the first attempt there was a network routing issue and our traffic somehow ended up going over the DSL backup link instead of the main fiber link. Needless to say that was unpleasant ;) The second time around that was fixed and we got it done in a much more reasonable amount of time.
 

SerWal

New Member
Now I remember one more thing that went wrong with this method.

Sometimes during the restoration process nagios monitoring managed to change ownership of replica db .lg file to "nagios" user.
Database was restored with a separate db user. So at some point (in the end of restoration), prorest couldn't write to .lg file and restoration failed.

Still no idea why that happened. Nagios user is only reading from .lg file. But several cases like this I started to disable nagios monitoring for replica while it's being restored this way.
 
Top