Failed to deploy diego 0.1452.0 on openstack: database_z2/0 is not running after update


Yunata, Ricky <rickyy@...>
 

Hi,

I'm currently deploying diego on my openstack environment, however I got an error when it was updating database_z2
Below is the error message from debug.log
<Bosh::Director::AgentJobNotRunning: `database_z2/0 (16c88d30-fe70-4d42-8307-34cc85521ca7)' is not running after update. Review logs for failed jobs: etcd>

My environment are:
Stemcell : Ubuntu-trusty Version 3192
CF Release : Version 230
Diego : Version 0.1452.0
Etcd : Version 38
Garden-linux : Version 0.334.0

I'm experiencing similar error as this, however the solution didn't work for me.
https://github.com/cloudfoundry-incubator/diego-release/issues/119

This is what I'm seeing on my error log

Monit summary
Process 'etcd' not monitored
Process 'bbs' running
Process 'consul_agent' running
Process 'metron_agent' running
System 'system_localhost' running

etcd_ctl.err.log
[2016-03-23 01:22:33+0000] + /var/vcap/packages/etcd/etcdctl -ca-file=/var/vcap/jobs/etcd/config/certs/server-ca.crt -cert-file=/var/vcap/jobs/etcd/config/certs/client.crt -key-file=/var/vcap/jobs/etcd/config/certs/client.key -C https://database-z2-0.etcd.service.cf.internal:4001 ls
[2016-03-23 01:22:33+0000] Error: cannot sync with the cluster using endpoints https://database-z2-0.etcd.service.cf.internal:4001

etcd.stderr.log
2016/03/23 00:56:52 etcdmain: couldn't find local name "database-z2-0" in the initial cluster configuration

consul_agent.stdout.log
2016/03/23 01:23:26 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:29 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:32 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:35 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:38 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:41 [WARN] agent: Check 'service:etcd' is now critical
2016/03/23 01:23:41 [WARN] dns: node 'database-z2-0' failing health check 'service:etcd: Service 'etcd' check', dropping from service 'etcd'
2016/03/23 01:23:41 [WARN] dns: node 'database-z2-0' failing health check 'service:etcd: Service 'etcd' check', dropping from service 'etcd'
2016/03/23 01:23:42 [WARN] dns: node 'database-z2-0' failing health check 'service:etcd: Service 'etcd' check', dropping from service 'etcd'
2016/03/23 01:23:42 [WARN] dns: node 'database-z2-0' failing health check 'service:etcd: Service 'etcd' check', dropping from service 'etcd'


This is when I run bosh instances --ps
+------------------------------------------------------------+---------+-----+------------------+--------------+
| Instance | State | AZ | VM Type | IPs |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| access_z1/0 (598f16db-60c2-4c13-bcec-85ae2a38102d)* | running | n/a | access_z1 | 192.168.3.44 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| access_z2/0 (a83d049d-6c95-417e-84f4-9aced8a9136f)* | running | n/a | access_z2 | 192.168.4.56 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| brain_z1/0 (a95c56bb-a84d-41b4-91b1-ade57c773dbe)* | running | n/a | brain_z1 | 192.168.3.40 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| brain_z2/0 (eb386b16-c8e4-4c04-9582-20f4161f6e03)* | running | n/a | brain_z2 | 192.168.4.52 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| cc_bridge_z1/0 (b9870145-26d7-4e59-9358-97c43db6a110)* | running | n/a | cc_bridge_z1 | 192.168.3.42 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| cc_bridge_z2/0 (7477b06f-e501-4757-abda-8e29c7c15464)* | running | n/a | cc_bridge_z2 | 192.168.4.54 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| cell_z1/0 (a6ef0a8c-52c0-4bd2-abfb-2fcf0101dd24)* | running | n/a | cell_z1 | 192.168.3.41 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| cell_z2/0 (36f012e3-2013-44aa-9a92-18161d6854ad)* | running | n/a | cell_z2 | 192.168.4.53 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| database_z1/0 (5428cca8-9832-42f4-9b3a-a822eb6d7e96)* | running | n/a | database_z1 | 192.168.3.39 |
| etcd | running | | | |
| bbs | running | | | |
| consul_agent | running | | | |
| metron_agent | running | | | |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| database_z2/0 (16c88d30-fe70-4d42-8307-34cc85521ca7)* | failing | n/a | database_z2 | 192.168.4.51 |
| etcd | unknown | | | |
| bbs | running | | | |
| consul_agent | running | | | |
| metron_agent | running | | | |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| database_z3/0 (c802162f-0681-479e-bb9c-98dac7d78941)* | running | n/a | database_z3 | 192.168.5.31 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| route_emitter_z1/0 (f7f7a8f3-9784-4b99-b0a5-6efb4d193cf5)* | running | n/a | route_emitter_z1 | 192.168.3.43 |
+------------------------------------------------------------+---------+-----+------------------+--------------+
| route_emitter_z2/0 (7f4e7fb7-7986-432e-a2e3-b298d3070753)* | running | n/a | route_emitter_z2 | 192.168.4.55 |
+------------------------------------------------------------+---------+-----+------------------+--------------+

I tried to stop all running etcds on database_z1 and database_z2, then `rm -rf /var/vcap/store/etcd/*` on both of the VMs and monit start the etcd process again. It seems that only 1 etcd service can be run. If I monit start etcd on the database_z2 first before database_z1, database_z2 will be running, instead database_z1 will fail. But, if I do it on database_z1 first before database_z2, then database_z1 will be running and database_z2 will fail.

Anyone has an idea on how to solve this? Thanks

Regards
Ricky
Disclaimer

The information in this e-mail is confidential and may contain content that is subject to copyright and/or is commercial-in-confidence and is intended only for the use of the above named addressee. If you are not the intended recipient, you are hereby notified that dissemination, copying or use of the information is strictly prohibited. If you have received this e-mail in error, please telephone Fujitsu Australia Software Technology Pty Ltd on + 61 2 9452 9000 or by reply e-mail to the sender and delete the document and all copies thereof.


Whereas Fujitsu Australia Software Technology Pty Ltd would not knowingly transmit a virus within an email communication, it is the receiver’s responsibility to scan all communication and any files attached for computer viruses and other defects. Fujitsu Australia Software Technology Pty Ltd does not accept liability for any loss or damage (whether direct, indirect, consequential or economic) however caused, and whether by negligence or otherwise, which may result directly or indirectly from this communication or any files attached.


If you do not wish to receive commercial and/or marketing email messages from Fujitsu Australia Software Technology Pty Ltd, please email unsubscribe(a)fast.au.fujitsu.com

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.