How to estimate reconnection / failover time between gorouter and nats

Masumi Ito


Can anyone explain about the expected reconnection / failover time for
gorouter when one of the nats VMs hangs up accidentally?

The background of this question is that I found the gorouter had some
timeframe to return "404 Not found Err" for app requests temporarily when
one of the clusted nats was not responsive. This happened after about 2 min
and then recovered in another 2-3min. I understand it is mainly due to
pruning stale routes and reconnection / failover time to a healthy nats by
gorouter. First 2 min can be explained as droplet_stale_threshold value.
However I am wondering if what exactly happened in another 2-3min.

Note that bosh health monitor detected an unresponsive nats and recreated it
finally however the gorouter had received "router.register" from DEAs before
the recreation was complete. Therefore I think this indicates the failover
to the other nats rather than reconnecting to the recreated nats which was
previously down.

I believe some connection parameters in the yagnats and apcera/nats client
are keys for this.

- Timeout: timeout to create a new connection
- ReconnectWait: wait time before reconnect happens
- MaxReconnect: unlimited reconnect times if this value is -1
- PingInterval: interval of each pinging to check if a connection is healthy
- MaxPingOut: trial times of pinging before determining reconnection is

1. When one of nats hangs up, the connection might still exist until TCP
timeout has been reached.

2. PingTimer periodically sends ping to check if the connection is stale
totally (PingInterval * MaxPingOut) times and concluds it is necessary to
reconnect to the next nats server.

3. Before reconecting it, the gorouter waits in ReconnectWait.

4. Create a new connection for the next nats server within Timeout.

5. After that, the gorouter starts to register app routes from DEAs through
the nats connected.

Therefore my rough estimation is:
PingInterval(2 min) * MaxPingOut(2) + ReconnectWait(500 millisec) +
Timeout(2 sec)

I would appreciate if someone could correct this rough explanation or give
some more details.


View this message in context:
Sent from the CF Dev mailing list archive at

Join { to automatically receive all group messages.