How to estimate reconnection / failover time between gorouter and nats
Masumi Ito
Hi,
Can anyone explain about the expected reconnection / failover time for gorouter when one of the nats VMs hangs up accidentally? The background of this question is that I found the gorouter had some timeframe to return "404 Not found Err" for app requests temporarily when one of the clusted nats was not responsive. This happened after about 2 min and then recovered in another 2-3min. I understand it is mainly due to pruning stale routes and reconnection / failover time to a healthy nats by gorouter. First 2 min can be explained as droplet_stale_threshold value. However I am wondering if what exactly happened in another 2-3min. Note that bosh health monitor detected an unresponsive nats and recreated it finally however the gorouter had received "router.register" from DEAs before the recreation was complete. Therefore I think this indicates the failover to the other nats rather than reconnecting to the recreated nats which was previously down. I believe some connection parameters in the yagnats and apcera/nats client are keys for this. - Timeout: timeout to create a new connection - ReconnectWait: wait time before reconnect happens - MaxReconnect: unlimited reconnect times if this value is -1 - PingInterval: interval of each pinging to check if a connection is healthy - MaxPingOut: trial times of pinging before determining reconnection is necessary 1. When one of nats hangs up, the connection might still exist until TCP timeout has been reached. 2. PingTimer periodically sends ping to check if the connection is stale totally (PingInterval * MaxPingOut) times and concluds it is necessary to reconnect to the next nats server. 3. Before reconecting it, the gorouter waits in ReconnectWait. 4. Create a new connection for the next nats server within Timeout. 5. After that, the gorouter starts to register app routes from DEAs through the nats connected. Therefore my rough estimation is: PingInterval(2 min) * MaxPingOut(2) + ReconnectWait(500 millisec) + Timeout(2 sec) I would appreciate if someone could correct this rough explanation or give some more details. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/How-to-estimate-reconnection-failover-time-between-gorouter-and-nats-tp2980.html Sent from the CF Dev mailing list archive at Nabble.com. |
|