Re: Remarks about the “confab” wrapper for consul


Benjamin Gandon
 

Thank you Amit for your answer.


I ran again in the “all-consuls-go-crazy” situation today, as quite every day actually. As soon as they start this flapping membership issue, the whole cf+diego deployment goes down.

Before I delete the content of the persistent storage, when I restart the consul servers, they don’t manage to elect a leader :
https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4 <https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4>

After I delete /var/vcap/store/consul_agent on all 3 consul servers, a consul leader is properly elected, but the cluster rapidly re-start flapping again with failures suspicions, missing acks, and timeouts :
https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc <https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc>

And at that time, the load of the bosh-ite VM goes up to 280+ and everything becomes very unresponsive.

How is it possible to bring the consul cluster in a healthy state again? I don’t want to reboot the bosh-lite VM and recreate all deployments with cloudchecks anymore.


/Benjamin

Le 11 avr. 2016 à 22:40, Amit Gupta <agupta(a)pivotal.io> a écrit :

Orchestrating a raft cluster in a way that requires no manual intervention is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069
<https://www.pivotaltracker.com/story/show/112018069>

For dealing with wedged states like the one you encountered, we have some recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery <https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery>

We have acceptance tests we run in CI that exercise rolling a 3 node cluster, so if you hit a failure it would be useful to get logs if you have any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote:
Actually, doing some further tests, I realize a mere 'join' is definitely not enough.

Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300 <http://10.244.0.58:8300/>","10.244.2.54:8300 <http://10.244.2.54:8300/>","10.244.0.54:8300 <http://10.244.0.54:8300/>"]' > /var/vcap/store/consul_agent/raft/peers.json

And make sure you start them quite at the same time with “monit start consul_agent”

So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.