Re: 3 etcd nodes don't work well in single zone
Tony
Hi Amit,
Let me explain the error I got in details. My env info: CentOS 6.5, Openstack Icehouse, Single-AZ 2 hm9000 instances, 3 etcd instances, Manifest: - name: etcd_z1 instances: 3 networks: - name: cf1 static_ips: - 100.64.1.21 - 100.64.1.22 - 100.64.1.23 persistent_disk: 10024 properties: metron_agent: deployment: metron_agent.deployment zone: z1 networks: apps: cf1 etcd: election_timeout_in_milliseconds: 1000 heartbeat_interval_in_milliseconds: 50 log_sync_timeout_in_seconds: 30 resource_pool: medium_z1 templates: - name: etcd release: cf - name: etcd_metrics_server release: cf - name: metron_agent release: cf update: {} properties: etcd: machines: - 100.64.1.21 - 100.64.1.22 - 100.64.1.23 etcd_metrics_server: nats: machines: - 100.64.1.11 - 100.64.1.12 ... I cf push dora app with 2 instances (https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/dora) And I can always get response from it. (curl dora.runmyapp.io --> "Hi, I'm Dora") The app runs well. Then I "cf app dora" and got ... requested state: started instances: ?/2 ... Then I "cf app dora" again after about 1 minute, and got ... requested state: started instances: 2/2 ... The instances' number varies between ?/2 and 2/2 after that. I also wrote a small script to send "cf app dora" every second and check the instances' number. if the number changed, then record it. Wed Jul 15 06:50:57 UTC 2015 instances: ?/2 (32s) Wed Jul 15 06:51:29 UTC 2015 instances: 2/2 (6s) Wed Jul 15 06:51:35 UTC 2015 instances: ?/2 (1m30s) Wed Jul 15 06:53:05 UTC 2015 instances: 2/2 (17s) Wed Jul 15 06:53:22 UTC 2015 instances: ?/2 (3m40s) Wed Jul 15 06:57:02 UTC 2015 instances: 2/2 (21s) Wed Jul 15 06:57:23 UTC 2015 instances: ?/2 (2m4s) Wed Jul 15 06:59:27 UTC 2015 instances: 2/2 ... From above we can see that: 1. instance number varies between ?/2 and 2/2 2. "?/2" can be got more often than "2/2" The app instances' number is always "2/2" when there is only one etcd instance. So I reckon the problem is in multi etcd instances. Other things I tried, but none of them works: 1. Stop etcd service on one etcd vm(monit stop etcd). 2. restart 3 etcd services one by one 3. restart all 3 etcd vms(terminate vms and let them restart automatically.) 4. Restart two hm9000 vms 5. Restart haproxy(because I don’t know whether the “for HA” means haproxy) http://bosh.io/releases/github.com/cloudfoundry/cf-release?version=210 Upgrade etcd server to 2.0.1 details Should be run as 1 node (for small deployments) or 3 nodes spread across zones (for HA) 6. Add these properties according to http://bosh.io/jobs/etcd?source=github.com/cloudfoundry/cf-release&version=210 election_timeout_in_milliseconds: 1000 heartbeat_interval_in_milliseconds: 50 log_sync_timeout_in_seconds: 30 Anyway, it doesn't work when "- three instances in a one-zone deployment, will all three instances in the same zone " as you mentioned. Do you have any suggestion about it? Or is there any mistake in my manifest? Thanks, Tony -- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p756.html Sent from the CF Dev mailing list archive at Nabble.com. |
|