Feedback request: changing etcd's DNS health check
Amit Kumar Gupta
Hi all,
The CF Infrastructure team is working on bumping the version of etcd in etcd-release to the latest patch within the version 2 major line [0]. We've run into an issue [1] with the way that etcd starts up in newer versions, and are considering a workaround that would make the etcd DNS health check more lenient. The impact would be that when running etcd in TLS mode, every etcd node in the cluster would register itself as part of the etcd service to Consul as soon as the local consul agent were up and running, rather than once the etcd server itself was up and running. Currently, only Diego uses etcd in TLS mode. In CF v241, other components which use etcd will start to be able to use etcd in TLS mode (up until now, there have been two separate etcd clusters in most CF deployments, a secure one for Diego and an insecure one for DEA/HM9k, loggregator, and routing API). Soon, Diego will stop using etcd altogether. We tested various scenarios, and so far things have seemed to be fine. The possible impact of this is that current or future clients of a TLS etcd cluster may sometimes hit an etcd instance that isn't actually up and serving etcd yet, and so they would have to have some retry logic. They should have this retry logic anyways, and Loggregator is currently resilient (as far as I can tell) to cases when it can't talk to etcd. I wanted to check with the community and the core development teams involved if you all had any feedback on this proposal before we pull the trigger to make the DNS health check more lenient and bump etcd. [0] https://www.pivotaltracker.com/story/show/126948757 [1] https://github.com/coreos/etcd/issues/6262 Thanks, Amit, CF Infrastructure team PM |
|