Re: Wiping out data for consul/etcd with bosh drain script?


Amit Kumar Gupta
 

Hey Tomoe

Great question. I would also prefer disaster recovery to be possible via
"bosh stop" then "bosh start". This is almost possible in etcd, except for
the conditional you found. The reason for the conditional is that it
allows rolling a 1-node cluster without data loss. I'd like to entertain
the idea that etcd-release's SLA (currently embodied in the acceptance
tests [0
<https://github.com/cloudfoundry-incubator/etcd-release/tree/master/src/acceptance-tests>])
should drop the requirement of maintaining data for a 1-node cluster roll.
That reduced SLA will probably be fine with the community, and the improved
disaster recovery experience would be worth the reduced SLA, but I haven't
validated that.

consul-release is a little further removed from this possibility because
consul requires different orchestration logic, and currently the
implementation doesn't wipe out the data as aggressively. We have a story [
1 <https://www.pivotaltracker.com/story/show/120648349>] already to explore
whether we could do that without reducing SLA [2
<https://github.com/cloudfoundry-incubator/consul-release/tree/master/src/acceptance-tests>
].

By the way, have you tried this? If you have a 3 node cluster with etcd
(healthy), and bosh stop then bosh start, does the cluster recover? If you
have a 3 node cluster with data, and you do this, does the cluster recover
(with data loss, which is acceptable in this case)? Even more interesting
would be to see what happens if you have an actual out of sync cluster, and
try this. This would be helpful input to have before we would get a chance
to prioritize the implementation.

[0]
https://github.com/cloudfoundry-incubator/etcd-release/tree/master/src/acceptance-tests
[1] https://www.pivotaltracker.com/story/show/120648349
[2]
https://github.com/cloudfoundry-incubator/consul-release/tree/master/src/acceptance-tests

Best,
Amit

On Wed, Jun 15, 2016 at 10:00 PM, Tomoe Sugihara <tsugihara(a)pivotal.io>
wrote:

Hi there,

I have seen issues multiple times where consul and/or etcd nodes went out
of sync and needed to be clean-restarted as explained in the Disaster
Recovery instructions [1][2].

I am wondering if it makes sense to add those steps in bosh drain
script[3], That way, you can always get a clean start, and if something is
wrong, you can recover by "bosh stop" followed by "bosh start". FWIT, I
noticed that etcd does that conditionally[4][5].

Maybe there're some drawbacks, but I thought I'd start a thread to hear
from experts.

[1]:
https://github.com/cloudfoundry-incubator/etcd-release#disaster-recovery
[2]:
https://github.com/cloudfoundry-incubator/consul-release#disaster-recovery
[3]: https://bosh.io/docs/drain.html

[4]:
https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/drain#L12
[5]:
https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/etcd_bosh_utils.sh.erb#L147

Best,
Tomoe

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.