Hi,
Ok - after doing the recovery scenario - the cluster was back, and finally
ping point the root cause.
The reason - on 2 DIEGO cells - the consul_agent (client) is running out of
disk space to write up the keys. Per my understanding is that cluster
server node will swap if any errors occurred from nodes during the
registration.
How can I have /var/vcap/store map on the ephemeral disk and not root
partition when not using persisted disk in diego deployment?
Sylvain
On Tue, Dec 20, 2016 at 8:39 AM, Etourneau Gwenn <gwenn.etourneau(a)gmail.com>
wrote:
Hi,
You can check recovery scenario here https://github.com/
cloudfoundry-incubator/consul-release#failure-recovery
Thanks.
Gwenn
2016-12-20 16:12 GMT+09:00 Sylvain Gibier <sylvain(a)munichconsulting.de>:
Hi,
Any hint on how to fix it ? From a network topology - nothing changed,
and I can't find anything usefull in consul documentation for reforming my
cluster. Currently the 2 second consul node server is experiencing the
issue, so running on one consul node (server and leader)...
From CF perspective - how can I reinitialize the consul cluster, and
impact on the other components - as I'm starting to see failing routing
requests at this stage.
Sylvain
On Tue, Dec 20, 2016 at 2:40 AM, Yitao Jiang <jiangyt.cn(a)gmail.com>
wrote:
we once had the same issue which causing by network issue, the consul
server follower couldn't connect to the leader, but what difference is that
we are running on openstack.
On Tue, Dec 20, 2016 at 12:32 AM, Sylvain Gibier <
sylvain(a)munichconsulting.de> wrote:
Hi,
Diego has been default in my CF installation (H/A over 3 AZ) - and
today, while trying a simple BOSH CF update of a stemcell - the consul_z1/0
keeps on "failing after update".
If I look in the log file - I can see the following:
"
++ logger -p user.info -t vcap.consul-agent
++ tee -a /var/vcap/sys/log/consul_agent/consul_agent.stdout.log error
during start: 2/30 nodes reported failure
2016/12/19 14:49:50 [ERR] agent.client: Failed to decode response
header: EOF
2016/12/19 14:49:50 [ERR] agent.client: Failed to decode response
header: EOF
"
Also it seems that I have a bunch of errors:
"
2016/12/19 13:54:32 [INFO] consul: adding server consul-z3-0 (Addr:
10.10.30.37:8300) (DC: dc1)
2016/12/19 13:54:32 [INFO] consul: adding server consul-z2-0 (Addr:
10.10.20.37:8300) (DC: dc1)
2016/12/19 13:54:32 [ERR] agent: failed to sync remote state: No
cluster leader
2016/12/19 13:54:32 [INFO] agent: Joining cluster...
2016/12/19 13:54:32 [INFO] agent: (LAN) joining: [10.10.10.37
10.10.20.37 10.10.30.37]
2016/12/19 13:54:32 [INFO] agent: (LAN) joined: 3 Err: <nil>
2016/12/19 13:54:32 [INFO] agent: Join completed. Synced with 3
initial agents
2016/12/19 13:54:32 [WARN] raft: Failed to get previous log: 503710
log not found (last: 503708)
2016/12/19 13:54:32 [INFO] raft: Removed ourself, transitioning to
follower
"
I can definitely confirm in my case - that consul_z3 is the Leader (via
consul info) in my current setup.
Any help/point on how to fix that ?
Releases: CF: v234, Diego: 0.1467.0
IaaS: AWS
--
Regards,
Yitao