Hi Amit, On Thu, Jun 16, 2016 at 4:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Yup, the modification should be as simple as flipping these two lines: https://github.com/cloudfoundry-incubator/etcd-release/blob/master/jobs/etcd/templates/etcd_bosh_utils.sh.erb#L153-L154.
It requires more setup to try the scenario with data, but it's more informative. If you have the time, yeah, I'd go straight to testing the with-data scenario. If not, even the more basic test (without data, just check if the cluster comes back healthy) would be valuable info. I flipped the lines as you suggested and did some testing against 3 job instances that collocate diego bbs and etcd in a CF deployment. I did a CF system level testing where while running the following script, I was monitoring `cf app some-app` as well as response for http://some-app.app.domain. --------- #!/bin/bash -x for t in {0..5}; do echo === $t, `date` for i in {0..2} ; do bosh -n stop bbs-etcd $i ; done for i in {0..2} ; do bosh -n start bbs-etcd $i ; done done --------- It looked like, as soon as the 1st job instance starts up, `cf app` starts responding normally. Even more interesting would be to see what happens if you have an actual out of sync cluster, and try this. This would be helpful input to have before we would get a chance to prioritize the implementation. Haven't done this yet, but any suggestions on how to produce this out of sync state? Thanks, Tomoe On Wed, Jun 15, 2016 at 11:45 PM, Tomoe Sugihara <tsugihara(a)pivotal.io> wrote:
Thanks Amit for sharing your insight. That was helpful. Some followup questions inline:
On Thu, Jun 16, 2016 at 3:05 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Hey Tomoe
Great question. I would also prefer disaster recovery to be possible via "bosh stop" then "bosh start". This is almost possible in etcd, except for the conditional you found. The reason for the conditional is that it allows rolling a 1-node cluster without data loss. I'd like to entertain the idea that etcd-release's SLA (currently embodied in the acceptance tests [0]) should drop the requirement of maintaining data for a 1-node cluster roll. That reduced SLA will probably be fine with the community, and the improved disaster recovery experience would be worth the reduced SLA, but I haven't validated that.
consul-release is a little further removed from this possibility because consul requires different orchestration logic, and currently the implementation doesn't wipe out the data as aggressively. We have a story [1] already to explore whether we could do that without reducing SLA [2].
By the way, have you tried this?
If you have a 3 node cluster with etcd (healthy), and bosh stop then bosh start, does the cluster recover?
Just to confirm, with a modified drain script that *always* wipes out( not conditionally as in current script) ?
If you have a 3 node cluster with data, and you do this, does the cluster recover (with data loss, which is acceptable in this case)?
What's the difference between this and the first one above? "with data" or not?
Even more interesting would be to see what happens if you have an actual out of sync cluster, and try this. This would be helpful input to have before we would get a chance to prioritize the implementation.
I'll find some time to test those scenarios so I can share my findings.
Thanks, Tomoe
[0] https://github.com/cloudfoundry-incubator/etcd-release/tree/master/src/acceptance-tests [1] https://www.pivotaltracker.com/story/show/120648349 [2] https://github.com/cloudfoundry-incubator/consul-release/tree/master/src/acceptance-tests
Best, Amit
On Wed, Jun 15, 2016 at 10:00 PM, Tomoe Sugihara <tsugihara(a)pivotal.io> wrote:
Hi there,
I have seen issues multiple times where consul and/or etcd nodes went out of sync and needed to be clean-restarted as explained in the Disaster Recovery instructions [1][2].
I am wondering if it makes sense to add those steps in bosh drain script[3], That way, you can always get a clean start, and if something is wrong, you can recover by "bosh stop" followed by "bosh start". FWIT, I noticed that etcd does that conditionally[4][5].
Maybe there're some drawbacks, but I thought I'd start a thread to hear from experts.
[1]: https://github.com/cloudfoundry-incubator/etcd-release#disaster-recovery [2]: https://github.com/cloudfoundry-incubator/consul-release#disaster-recovery [3]: https://bosh.io/docs/drain.html
[4]: https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/drain#L12 [5]: https://github.com/cloudfoundry-incubator/etcd-release/blob/e809b46202be89a24c7bfeebf270d24b17589260/jobs/etcd/templates/etcd_bosh_utils.sh.erb#L147
Best, Tomoe
|