Re: How to estimate reconnection / failover time between gorouter and nats


Christopher Piraino <cpiraino@...>
 

Hi Masumi,

The sequence/estimation that you describe sounds accurate to us. I think
ideally we should configure that NATs reconnection logic to initiate a
reconnect before the stale_threshold value. We have put a story in our
icebox <https://www.pivotaltracker.com/story/show/110199022> for our PM to
prioritize.

We also have some upcoming work around being able to configure the router
to not prune routes when NATs is down. See this issue
<https://github.com/cloudfoundry/gorouter/issues/102> on the GoRouter with
related discussion.

Chris and Shash - CF Routing Team

On Mon, Dec 7, 2015 at 8:28 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi,

Can anyone explain about the expected reconnection / failover time for
gorouter when one of the nats VMs hangs up accidentally?

The background of this question is that I found the gorouter had some
timeframe to return "404 Not found Err" for app requests temporarily when
one of the clusted nats was not responsive. This happened after about 2 min
and then recovered in another 2-3min. I understand it is mainly due to
pruning stale routes and reconnection / failover time to a healthy nats by
gorouter. First 2 min can be explained as droplet_stale_threshold value.
However I am wondering if what exactly happened in another 2-3min.

Note that bosh health monitor detected an unresponsive nats and recreated
it
finally however the gorouter had received "router.register" from DEAs
before
the recreation was complete. Therefore I think this indicates the failover
to the other nats rather than reconnecting to the recreated nats which was
previously down.

I believe some connection parameters in the yagnats and apcera/nats client
are keys for this.

- Timeout: timeout to create a new connection
- ReconnectWait: wait time before reconnect happens
- MaxReconnect: unlimited reconnect times if this value is -1
- PingInterval: interval of each pinging to check if a connection is
healthy
- MaxPingOut: trial times of pinging before determining reconnection is
necessary

1. When one of nats hangs up, the connection might still exist until TCP
timeout has been reached.

2. PingTimer periodically sends ping to check if the connection is stale
totally (PingInterval * MaxPingOut) times and concluds it is necessary to
reconnect to the next nats server.

3. Before reconecting it, the gorouter waits in ReconnectWait.

4. Create a new connection for the next nats server within Timeout.

5. After that, the gorouter starts to register app routes from DEAs through
the nats connected.

Therefore my rough estimation is:
PingInterval(2 min) * MaxPingOut(2) + ReconnectWait(500 millisec) +
Timeout(2 sec)

I would appreciate if someone could correct this rough explanation or give
some more details.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/How-to-estimate-reconnection-failover-time-between-gorouter-and-nats-tp2980.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.