Feedback request: changing etcd's DNS health check


Amit Kumar Gupta
 

Hi all,

The CF Infrastructure team is working on bumping the version of etcd in
etcd-release to the latest patch within the version 2 major line [0].

We've run into an issue [1] with the way that etcd starts up in newer
versions, and are considering a workaround that would make the etcd DNS
health check more lenient. The impact would be that when running etcd in
TLS mode, every etcd node in the cluster would register itself as part of
the etcd service to Consul as soon as the local consul agent were up and
running, rather than once the etcd server itself was up and running.

Currently, only Diego uses etcd in TLS mode. In CF v241, other components
which use etcd will start to be able to use etcd in TLS mode (up until now,
there have been two separate etcd clusters in most CF deployments, a secure
one for Diego and an insecure one for DEA/HM9k, loggregator, and routing
API). Soon, Diego will stop using etcd altogether.

We tested various scenarios, and so far things have seemed to be fine. The
possible impact of this is that current or future clients of a TLS etcd
cluster may sometimes hit an etcd instance that isn't actually up and
serving etcd yet, and so they would have to have some retry logic. They
should have this retry logic anyways, and Loggregator is currently
resilient (as far as I can tell) to cases when it can't talk to etcd.

I wanted to check with the community and the core development teams
involved if you all had any feedback on this proposal before we pull the
trigger to make the DNS health check more lenient and bump etcd.

[0] https://www.pivotaltracker.com/story/show/126948757
[1] https://github.com/coreos/etcd/issues/6262

Thanks,
Amit, CF Infrastructure team PM

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.