Date
1 - 4 of 4
How to estimate reconnection / failover time between gorouter and nats
Masumi Ito
Hi,
Can anyone explain about the expected reconnection / failover time for gorouter when one of the nats VMs hangs up accidentally? The background of this question is that I found the gorouter had some timeframe to return "404 Not found Err" for app requests temporarily when one of the clusted nats was not responsive. This happened after about 2 min and then recovered in another 2-3min. I understand it is mainly due to pruning stale routes and reconnection / failover time to a healthy nats by gorouter. First 2 min can be explained as droplet_stale_threshold value. However I am wondering if what exactly happened in another 2-3min. Note that bosh health monitor detected an unresponsive nats and recreated it finally however the gorouter had received "router.register" from DEAs before the recreation was complete. Therefore I think this indicates the failover to the other nats rather than reconnecting to the recreated nats which was previously down. I believe some connection parameters in the yagnats and apcera/nats client are keys for this. - Timeout: timeout to create a new connection - ReconnectWait: wait time before reconnect happens - MaxReconnect: unlimited reconnect times if this value is -1 - PingInterval: interval of each pinging to check if a connection is healthy - MaxPingOut: trial times of pinging before determining reconnection is necessary 1. When one of nats hangs up, the connection might still exist until TCP timeout has been reached. 2. PingTimer periodically sends ping to check if the connection is stale totally (PingInterval * MaxPingOut) times and concluds it is necessary to reconnect to the next nats server. 3. Before reconecting it, the gorouter waits in ReconnectWait. 4. Create a new connection for the next nats server within Timeout. 5. After that, the gorouter starts to register app routes from DEAs through the nats connected. Therefore my rough estimation is: PingInterval(2 min) * MaxPingOut(2) + ReconnectWait(500 millisec) + Timeout(2 sec) I would appreciate if someone could correct this rough explanation or give some more details. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/How-to-estimate-reconnection-failover-time-between-gorouter-and-nats-tp2980.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Christopher Piraino <cpiraino@...>
Hi Masumi,
toggle quoted message
Show quoted text
The sequence/estimation that you describe sounds accurate to us. I think ideally we should configure that NATs reconnection logic to initiate a reconnect before the stale_threshold value. We have put a story in our icebox <https://www.pivotaltracker.com/story/show/110199022> for our PM to prioritize. We also have some upcoming work around being able to configure the router to not prune routes when NATs is down. See this issue <https://github.com/cloudfoundry/gorouter/issues/102> on the GoRouter with related discussion. Chris and Shash - CF Routing Team On Mon, Dec 7, 2015 at 8:28 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:
Hi, |
|
Masumi Ito
Hi Chris and Shash,
We have put a story in our icebox for our PM to prioritize.Thanks a lot. I would like to add another case regarding this, which is all the clustered nats down case. There were two nats in the cluster. When bosh were down almost within 10 sec, it took up to 15min although a nats VM had been recreated in about 7min each almost sequencially by MicroBOSH. Do you have any ideas of what happened internally?. See this issue on the GoRouter with related discussion.Can I ask when do we expect this new function will be implemented? Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/How-to-estimate-reconnection-failover-time-between-gorouter-and-nats-tp2980p3134.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Shannon Coen
I have prioritized the story to try reconnect when NATS is unavailable
toggle quoted message
Show quoted text
before pruning. I cannot provide an ETA. Before we prioritize the option to disable pruning, we'd like to verify with the Garden team that risk of port reuse has been reasonably mitigated. Shannon Coen Product Manager, Cloud Foundry Pivotal, Inc. On Thu, Dec 17, 2015 at 5:52 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:
Hi Chris and Shash,We have put a story in our icebox for our PM to prioritize.Thanks a lot. |
|