Remarks about the “confab” wrapper for consul


Amit Kumar Gupta
 

Hi Benjamin,

Re bumping consul:

Yes, soon: https://www.pivotaltracker.com/story/show/113007637

Re updating confab so that people can tweak their consul settings directly
from BOSH deployment:

Currently that's not in the plans, no. However, we'd very much like to
understand your findings after changing configurations.

Re updating this “skip_leave_on_interrupt” config in confab:

We don't currently plan on changing it. Because RAFT needs to be very
carefully orchestrated when rolling clusters, scaling up, scaling down,
etc. we would need to see that there is a problem and that this is the root
cause. There are Consul acceptance tests (CONSATS) that exercise all this
orchestration:
https://github.com/cloudfoundry-incubator/consul-release#acceptance-tests

If you're having a lot of flapping, suspicions, failed acks, etc. this
points to a different root cause. This often has to do with restricted UDP
traffic, network ACLs, etc. I opened an issue about this on the consul
repo a year ago, and it's still open:
https://github.com/hashicorp/consul/issues/916.

Please keep us posted on your discoveries.

Best,
Amit

On Fri, Apr 15, 2016 at 5:32 AM, Benjamin Gandon <benjamin(a)gandon.org>
wrote:

As an update, it looks like I’m running into the Node health flapping
<https://github.com/hashicorp/consul/issues/1212> issue that is more
frequent with consul 0.5.x servers compared to 0.6.x servers.

→ Q1: Are you planning to upgrade the consul version used in CF and Diego
from 0.5.2 to 0.6.4 in near future?


Also, people recommend the following settings to mitigate the issue.

"dns_config": {
"allow_stale": true,
"node_ttl": "5s",
"service_ttl": {
"*": "5s"
}
}

I’ll try those and keep you updated with the results next week.
Unfortunately, I’ll have to fork the consul-release
<https://github.com/cloudfoundry-incubator/consul-release> because those
settings are also hardwired to their default
<https://github.com/cloudfoundry-incubator/consul-release/blob/master/src/confab/config/consul_config_definer.go#L13-L35> in
confab.

→ Q2: Are you planing so update of confab so that people can tweak their
consul settings directly from BOSH deployment?


Regarding my previous remark about properly configuring “
skip_leave_on_interrupt” and “leave_on_terminate” in confab, I understand
that the default value of “true” for “leave_on_terminate” might be
necessary to properly scale down a consul cluster with BOSH.

But I saw today that skip_leave_on_interrupt will default to true
<https://github.com/hashicorp/consul/blob/master/CHANGELOG.md> for consul
*servers* in the upcoming version 0.7.0. Currently, this config is
hard-wired to its default value of “false” in confab.

→ Q3: Are you planning to update this “skip_leave_on_interrupt” config in
confab?


/Benjamin


Le 14 avr. 2016 à 17:00, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Thank you Amit for your answer.


I ran again in the “all-consuls-go-crazy” situation today, as quite every
day actually. As soon as they start this flapping membership issue, the
whole cf+diego deployment goes down.

Before I delete the content of the persistent storage, when I restart the
consul servers, they don’t manage to elect a leader :
https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4

After I delete */var/vcap/store/consul_agent* on all 3 consul servers, a
consul leader is properly elected, but the cluster rapidly re-start
flapping again with failures suspicions, missing acks, and timeouts :
https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc

And at that time, the load of the bosh-ite VM goes up to 280+ and
everything becomes very unresponsive.

How is it possible to bring the consul cluster in a healthy state again? I
don’t want to reboot the bosh-lite VM and recreate all deployments with
cloudchecks anymore.


/Benjamin


Le 11 avr. 2016 à 22:40, Amit Gupta <agupta(a)pivotal.io> a écrit :

Orchestrating a raft cluster in a way that requires no manual intervention
is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069

For dealing with wedged states like the one you encountered, we have some
recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery

We have acceptance tests we run in CI that exercise rolling a 3 node
cluster, so if you hit a failure it would be useful to get logs if you have
any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org>
wrote:

Actually, doing some further tests, I realize a mere 'join' is definitely
not enough.

Instead, you need to restore the raft/peers.json on each one of the 3
consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' >
/var/vcap/store/consul_agent/raft/peers.json


And make sure you start them quite at the same time with “monit start
consul_agent”

So this advocates a strongly for setting *skip_leave_on_interrupt=true*
and *leave_on_terminate=false* in confab, because loosing the peers.json
is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my
consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment
down, until you get a deeper understanding of consul, and figure out they
just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I
could just see monit restarting the “agent_ctl” every 60 seconds because
confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1
and consul_z2/0) had properly left oneanother uppon a graceful shutdown.
This state was persisted in /var/vcap/store/raft/peers.json being “null” on
each one of them, so they would not get back together on restart. A manual
'join' was necessary. But it took me hours to get there because I’m no
expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and
monit was constantly starting and stopping it every 60 seconds. But once
you step back, you realize confab was actually waiting for the new leader
to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab *that* late really makes sense?
2. Could someone please write some minimal documentation about confab, at
least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not
here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no
consul server should not perform any graceful 'leave'. With this different
approach, then they would properly be back up even when performing a
complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin




Benjamin Gandon
 

As an update, it looks like I’m running into the Node health flapping <https://github.com/hashicorp/consul/issues/1212> issue that is more frequent with consul 0.5.x servers compared to 0.6.x servers.

→ Q1: Are you planning to upgrade the consul version used in CF and Diego from 0.5.2 to 0.6.4 in near future?


Also, people recommend the following settings to mitigate the issue.
"dns_config": {
"allow_stale": true,
"node_ttl": "5s",
"service_ttl": {
"*": "5s"
}
}
I’ll try those and keep you updated with the results next week. Unfortunately, I’ll have to fork the consul-release <https://github.com/cloudfoundry-incubator/consul-release> because those settings are also hardwired to their default <https://github.com/cloudfoundry-incubator/consul-release/blob/master/src/confab/config/consul_config_definer.go#L13-L35> in confab.

→ Q2: Are you planing so update of confab so that people can tweak their consul settings directly from BOSH deployment?


Regarding my previous remark about properly configuring “skip_leave_on_interrupt” and “leave_on_terminate” in confab, I understand that the default value of “true” for “leave_on_terminate” might be necessary to properly scale down a consul cluster with BOSH.

But I saw today that skip_leave_on_interrupt will default to true <https://github.com/hashicorp/consul/blob/master/CHANGELOG.md> for consul servers in the upcoming version 0.7.0. Currently, this config is hard-wired to its default value of “false” in confab.

→ Q3: Are you planning to update this “skip_leave_on_interrupt” config in confab?


/Benjamin

Le 14 avr. 2016 à 17:00, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Thank you Amit for your answer.


I ran again in the “all-consuls-go-crazy” situation today, as quite every day actually. As soon as they start this flapping membership issue, the whole cf+diego deployment goes down.

Before I delete the content of the persistent storage, when I restart the consul servers, they don’t manage to elect a leader :
https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4 <https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4>

After I delete /var/vcap/store/consul_agent on all 3 consul servers, a consul leader is properly elected, but the cluster rapidly re-start flapping again with failures suspicions, missing acks, and timeouts :
https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc <https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc>

And at that time, the load of the bosh-ite VM goes up to 280+ and everything becomes very unresponsive.

How is it possible to bring the consul cluster in a healthy state again? I don’t want to reboot the bosh-lite VM and recreate all deployments with cloudchecks anymore.


/Benjamin


Le 11 avr. 2016 à 22:40, Amit Gupta <agupta(a)pivotal.io <mailto:agupta(a)pivotal.io>> a écrit :

Orchestrating a raft cluster in a way that requires no manual intervention is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069
<https://www.pivotaltracker.com/story/show/112018069>

For dealing with wedged states like the one you encountered, we have some recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery <https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery>

We have acceptance tests we run in CI that exercise rolling a 3 node cluster, so if you hit a failure it would be useful to get logs if you have any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote:
Actually, doing some further tests, I realize a mere 'join' is definitely not enough.

Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300 <http://10.244.0.58:8300/>","10.244.2.54:8300 <http://10.244.2.54:8300/>","10.244.0.54:8300 <http://10.244.0.54:8300/>"]' > /var/vcap/store/consul_agent/raft/peers.json

And make sure you start them quite at the same time with “monit start consul_agent”

So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin


Benjamin Gandon
 

Thank you Amit for your answer.


I ran again in the “all-consuls-go-crazy” situation today, as quite every day actually. As soon as they start this flapping membership issue, the whole cf+diego deployment goes down.

Before I delete the content of the persistent storage, when I restart the consul servers, they don’t manage to elect a leader :
https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4 <https://gist.github.com/bgandon/08707466324be7c9a093a56fd95a64e4>

After I delete /var/vcap/store/consul_agent on all 3 consul servers, a consul leader is properly elected, but the cluster rapidly re-start flapping again with failures suspicions, missing acks, and timeouts :
https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc <https://gist.github.com/bgandon/cab53c22da66b24beff46389ba7f0bdc>

And at that time, the load of the bosh-ite VM goes up to 280+ and everything becomes very unresponsive.

How is it possible to bring the consul cluster in a healthy state again? I don’t want to reboot the bosh-lite VM and recreate all deployments with cloudchecks anymore.


/Benjamin

Le 11 avr. 2016 à 22:40, Amit Gupta <agupta(a)pivotal.io> a écrit :

Orchestrating a raft cluster in a way that requires no manual intervention is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069
<https://www.pivotaltracker.com/story/show/112018069>

For dealing with wedged states like the one you encountered, we have some recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery <https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery>

We have acceptance tests we run in CI that exercise rolling a 3 node cluster, so if you hit a failure it would be useful to get logs if you have any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote:
Actually, doing some further tests, I realize a mere 'join' is definitely not enough.

Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300 <http://10.244.0.58:8300/>","10.244.2.54:8300 <http://10.244.2.54:8300/>","10.244.0.54:8300 <http://10.244.0.54:8300/>"]' > /var/vcap/store/consul_agent/raft/peers.json

And make sure you start them quite at the same time with “monit start consul_agent”

So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin


Amit Kumar Gupta
 

Orchestrating a raft cluster in a way that requires no manual intervention
is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069

For dealing with wedged states like the one you encountered, we have some
recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery

We have acceptance tests we run in CI that exercise rolling a 3 node
cluster, so if you hit a failure it would be useful to get logs if you have
any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org>
wrote:

Actually, doing some further tests, I realize a mere 'join' is definitely
not enough.

Instead, you need to restore the raft/peers.json on each one of the 3
consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' >
/var/vcap/store/consul_agent/raft/peers.json


And make sure you start them quite at the same time with “monit start
consul_agent”

So this advocates a strongly for setting *skip_leave_on_interrupt=true*
and *leave_on_terminate=false* in confab, because loosing the peers.json
is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my
consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment
down, until you get a deeper understanding of consul, and figure out they
just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I
could just see monit restarting the “agent_ctl” every 60 seconds because
confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and
consul_z2/0) had properly left oneanother uppon a graceful shutdown. This
state was persisted in /var/vcap/store/raft/peers.json being “null” on each
one of them, so they would not get back together on restart. A manual
'join' was necessary. But it took me hours to get there because I’m no
expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and
monit was constantly starting and stopping it every 60 seconds. But once
you step back, you realize confab was actually waiting for the new leader
to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab *that* late really makes sense?
2. Could someone please write some minimal documentation about confab, at
least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not
here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no
consul server should not perform any graceful 'leave'. With this different
approach, then they would properly be back up even when performing a
complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin



Benjamin Gandon
 

Actually, doing some further tests, I realize a mere 'join' is definitely not enough.

Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' > /var/vcap/store/consul_agent/raft/peers.json

And make sure you start them quite at the same time with “monit start consul_agent”

So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!

/Benjamin

Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin


Benjamin Gandon
 

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin