Re: Wiping out data for consul/etcd with bosh drain script?


Tomoe Sugihara
 

Hi Amit,

On Thu, Jun 16, 2016 at 4:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, the modification should be as simple as flipping these two lines:
https://github.com/cloudfoundry-incubator/etcd-release/blob/master/jobs/etcd/templates/etcd_bosh_utils.sh.erb#L153-L154.

It requires more setup to try the scenario with data, but it's more
informative. If you have the time, yeah, I'd go straight to testing the
with-data scenario. If not, even the more basic test (without data, just
check if the cluster comes back healthy) would be valuable info.
I flipped the lines as you suggested and did some testing against 3
job instances that collocate diego bbs and etcd in a CF deployment.

I did a CF system level testing where while running the following
script, I was monitoring `cf app some-app` as well as response for
http://some-app.app.domain.

---------
#!/bin/bash -x

for t in {0..5}; do
echo === $t, `date`
for i in {0..2} ; do bosh -n stop bbs-etcd $i ; done
for i in {0..2} ; do bosh -n start bbs-etcd $i ; done

done
---------

It looked like, as soon as the 1st job instance starts up, `cf app`
starts responding normally.

Even more interesting would be to see what happens if you have an actual out of sync cluster, and try this. This would be helpful input to have before we would get a chance to prioritize the implementation.
Haven't done this yet, but any suggestions on how to produce this out
of sync state?

Thanks,
Tomoe


On Wed, Jun 15, 2016 at 11:45 PM, Tomoe Sugihara <tsugihara(a)pivotal.io>
wrote:

Thanks Amit for sharing your insight. That was helpful. Some followup
questions inline:

On Thu, Jun 16, 2016 at 3:05 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Tomoe

Great question. I would also prefer disaster recovery to be possible via
"bosh stop" then "bosh start". This is almost possible in etcd, except for
the conditional you found. The reason for the conditional is that it allows
rolling a 1-node cluster without data loss. I'd like to entertain the idea
that etcd-release's SLA (currently embodied in the acceptance tests [0])
should drop the requirement of maintaining data for a 1-node cluster roll.
That reduced SLA will probably be fine with the community, and the improved
disaster recovery experience would be worth the reduced SLA, but I haven't
validated that.

consul-release is a little further removed from this possibility because
consul requires different orchestration logic, and currently the
implementation doesn't wipe out the data as aggressively. We have a story
[1] already to explore whether we could do that without reducing SLA [2].

By the way, have you tried this?


If you have a 3 node cluster with etcd (healthy), and bosh stop then bosh
start, does the cluster recover?

Just to confirm, with a modified drain script that *always* wipes out(
not conditionally as in current script) ?


If you have a 3 node cluster with data, and you do this, does the cluster
recover (with data loss, which is acceptable in this case)?

What's the difference between this and the first one above? "with data" or
not?



Even more interesting would be to see what happens if you have an
actual out of sync cluster, and try this. This would be helpful input to
have before we would get a chance to prioritize the implementation.


I'll find some time to test those scenarios so I can share my findings.

Thanks,
Tomoe




[0]
https://github.com/cloudfoundry-incubator/etcd-release/tree/master/src/acceptance-tests
[1] https://www.pivotaltracker.com/story/show/120648349
[2]
https://github.com/cloudfoundry-incubator/consul-release/tree/master/src/acceptance-tests

Best,
Amit

On Wed, Jun 15, 2016 at 10:00 PM, Tomoe Sugihara <tsugihara(a)pivotal.io>
wrote:

Hi there,

I have seen issues multiple times where consul and/or etcd nodes went
out of sync and needed to be clean-restarted as explained in the Disaster
Recovery instructions [1][2].

I am wondering if it makes sense to add those steps in bosh drain
script[3], That way, you can always get a clean start, and if something is
wrong, you can recover by "bosh stop" followed by "bosh start". FWIT, I
noticed that etcd does that conditionally[4][5].

Maybe there're some drawbacks, but I thought I'd start a thread to hear
from experts.

[1]:
https://github.com/cloudfoundry-incubator/etcd-release#disaster-recovery
[2]:
https://github.com/cloudfoundry-incubator/consul-release#disaster-recovery
[3]: https://bosh.io/docs/drain.html

[4]:
https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/drain#L12
[5]:
https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/etcd_bosh_utils.sh.erb#L147

Best,
Tomoe

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.