WAN - Async Replication, best practice / configuration OE 10.2B

MrSparks

New Member
Hi All,

This is my first post and I thought I should thank you to start with. I’ve been following the forum for some time now and it’s really helped me find solutions to day to day technical issues and greatly increased my OE/Progress technical knowledge. :)

Ok, so… I’ve recently moved a replica (target) server to a remote data centre. (Previously both the production and replica server were at the same location connected via 1Gb LAN) Now that I’ve moved the replica offsite I’m seeing replication outages around once a week (see error message below). I believe this due to connection drops, vpn key renegotiation etc. I can’t really stop those issues, it’s the nature of the beast (WAN). So here’s my question, how can I configure replication to cope with small connection outages, whether its a couple of seconds or indeed even longer periods when the ISP is carrying out maintenance on the WAN?

[error]
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (9407) Connection failure for host <server name> port 4387 transport TCP.
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (11713) A communications error -4008 in rpCOM_RecvMsg.
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Receive Error
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0000: e8fd ae00 0000 0000 0000 0000 2311 0000 993a 0000 993a 0000 0200 0000 4400 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0020: 7a0b 0000 6a0a 0000 0000 0000 5f0b ae4f 0000 0000 3821 0000 0100 0000 1900 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0040: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 00c0: 0000 0000 0000 0000 0000 0000 6572 6973 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 00e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0100: 0000 0000 0000 0000 0000 0000 3137 322e 3139 2e31 2e32 3200 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0120: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (-----) 0140: 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (10492) A communications error -157 occurred in function rpNLS_PollListener while receiving a message.
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (10661) The Fathom Replication Server is beginning recovery for agent agent1.
[2012/05/12@08:06:19.495+0100] P-1408 T-5584 I RPLS 69: (10842) Connecting to Fathom Replication Agent agent1.
[/error]

Best

MrSparks
 

TomBascom

Curmudgeon
Isn't this just telling you that communications "burped"?

I have seen similar things at customer sites. As I recall replication just carries on after one of these "blips" (that's what the "recovery" and "connecting" messages are saying). It's annoying but not harmful.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Judging from the timestamps this was a very short-lived interruption. Do you experience longer network outages?

What settings do you have in <dbname>.repl.properties on the source side?
 

MrSparks

New Member
Hi Tom & Rob,

Thanks for the replies. Below is a proper breakdown of the failure logs. (sorry I should have posted this to start with) Also, I've included the repl.properties from the source server at the bottom of the post.

@Tom, you're right the source server does seem to be trying to reconnect, however it looks like the target server switches to PRE-TRANSITION rather than waiting for a reconnection from the source. Maybe there's a setting the stop the target going into this state? e.g. transition-timeout?

@Rob, outages are normally very small, (seconds) however ideally I’d like to be able to cope with much longer outage periods to cover maintenance Windows etc.

I've checked the time and date on both the source & target server. They are in sync.

Best
Marc



<Source Server Log>
[2012/05/12@08:04:59.560+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 4389 transport TCP.
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (11713) A communications error -4004 in rpCOM_SendMsg.
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Send Error
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0000: 10b3 af00 0000 0000 0000 0000 2511 0000 f92a 0000 f92a 0000 0200 0000 4200 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0020: b278 0e00 1f77 0100 0000 0000 770b ae4f 0000 0000 3821 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0040: 0000 0000 baff ffff 4627 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 00c0: 0000 0000 0000 0000 0000 0000 6572 6973 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 00e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0100: 0000 0000 0000 0000 0000 0000 3137 322e 3139 2e31 2e32 3200 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0120: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (-----) 0140: 0000 0000 0000 0000 0000 0000
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (10491) A communications error -155 occurred in function rpNLS_SendAIBlockToAgent while sending AIBLOCK.
[2012/05/12@08:04:59.575+0100] P-5924 T-1320 I RPLS 35: (10661) The Fathom Replication Server is beginning recovery for agent agent1.
[2012/05/12@08:04:59.591+0100] P-5924 T-1320 I RPLS 35: (10842) Connecting to Fathom Replication Agent agent1.

<Source Server Log> <connection retries> <target server is alive and the connection is up durring this period>
[2012/05/12@08:05:41.496+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:06:31.697+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:07:21.883+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:08:12.069+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:09:02.177+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:09:52.253+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:10:42.236+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:11:32.438+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:12:22.420+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:13:12.419+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:14:02.308+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:14:52.416+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:15:42.586+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:16:32.881+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.
[2012/05/12@08:17:22.973+0100] P-5924 T-1320 I RPLS 35: (9407) Connection failure for host <target server name> port 11001 transport TCP.

<Target Server Log>
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (9407) Connection failure for host <target server name> port 2051 transport TCP.
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0000: 0000 0000 0000 0000 f8a0 ae00 2511 0000 2311 0000 9411 0000 0200 0000 2400 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0020: 2177 0100 b178 0e00 0000 0000 760b ae4f 0000 0000 3821 0000 0000 0000 7201 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0040: 0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 00c0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 00e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0100: 0000 0000 0000 0000 0000 0000 3137 322e 3137 2e31 2e32 3200 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0120: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (-----) 0140: 0000 0000 0000 0000 0000 0000
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message.
[2012/05/12@08:06:40.745+0100] P-1216 T-292 I RPLA 35: (11699) A TCP/IP failure has occurred. The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server.
[2012/05/14@09:37:35.391+0100] P-1332 T-192 I RPLU 40: (452) Login by <admin user> on CON:.
[2012/05/14@09:37:35.407+0100] P-1332 T-192 I RPLU 40: (7129) Usr 40 set name to <admin user>.
[2012/05/14@09:37:35.422+0100] P-1332 T-192 I RPLU 40: (453) Logout by <admin user> on CON:.


<sourceserver.repl.properties>
[server]
control-agents=agent1
database=mydb
transition=manual
transition-timeout=600

[control-agent.agent1]
name=agent1
database=mydb
host=targetserver
port=11001
connect-timeout=120
replication-method=async
critical=0

[transition]
database-role=normal
 

MrSparks

New Member
Hi Guys,

It took a while for my post to be approved by admin. Wondered if you had any thoughts around my last post? :)

Best
Marc
 

Cringer

ProgressTalk.com Moderator
Staff member
Apologies for the delay in approval. I've been away and it wasn't clear it was moderated :eek:
 
Top