Re: 3 etcd nodes don't work well in single zone


Tony
 

Hi Amit,

Let me explain the error I got in details.

My env info:
CentOS 6.5,
Openstack Icehouse,
Single-AZ
2 hm9000 instances,
3 etcd instances,


Manifest:
- name: etcd_z1
instances: 3
networks:
- name: cf1
static_ips:
- 100.64.1.21
- 100.64.1.22
- 100.64.1.23
persistent_disk: 10024
properties:
metron_agent:
deployment: metron_agent.deployment
zone: z1
networks:
apps: cf1
etcd:
election_timeout_in_milliseconds: 1000
heartbeat_interval_in_milliseconds: 50
log_sync_timeout_in_seconds: 30
resource_pool: medium_z1
templates:
- name: etcd
release: cf
- name: etcd_metrics_server
release: cf
- name: metron_agent
release: cf
update: {}

properties:
etcd:
machines:
- 100.64.1.21
- 100.64.1.22
- 100.64.1.23
etcd_metrics_server:
nats:
machines:
- 100.64.1.11
- 100.64.1.12
...


I cf push dora app with 2 instances
(https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/dora)

And I can always get response from it. (curl dora.runmyapp.io --> "Hi, I'm
Dora")
The app runs well.

Then I "cf app dora" and got
...
requested state: started
instances: ?/2
...

Then I "cf app dora" again after about 1 minute, and got
...
requested state: started
instances: 2/2
...

The instances' number varies between ?/2 and 2/2 after that.

I also wrote a small script to send "cf app dora" every second and check the
instances' number.
if the number changed, then record it.

Wed Jul 15 06:50:57 UTC 2015 instances: ?/2 (32s)

Wed Jul 15 06:51:29 UTC 2015 instances: 2/2 (6s)

Wed Jul 15 06:51:35 UTC 2015 instances: ?/2 (1m30s)

Wed Jul 15 06:53:05 UTC 2015 instances: 2/2 (17s)

Wed Jul 15 06:53:22 UTC 2015 instances: ?/2 (3m40s)

Wed Jul 15 06:57:02 UTC 2015 instances: 2/2 (21s)

Wed Jul 15 06:57:23 UTC 2015 instances: ?/2 (2m4s)

Wed Jul 15 06:59:27 UTC 2015 instances: 2/2
...


From above we can see that:
1. instance number varies between ?/2 and 2/2
2. "?/2" can be got more often than "2/2"


The app instances' number is always "2/2" when there is only one etcd
instance.
So I reckon the problem is in multi etcd instances.


Other things I tried, but none of them works:

1. Stop etcd service on one etcd vm(monit stop etcd).

2. restart 3 etcd services one by one

3. restart all 3 etcd vms(terminate vms and let them restart
automatically.)

4. Restart two hm9000 vms

5. Restart haproxy(because I don’t know whether the “for HA” means haproxy)
http://bosh.io/releases/github.com/cloudfoundry/cf-release?version=210
Upgrade etcd server to 2.0.1 details
Should be run as 1 node (for small deployments) or 3 nodes spread across
zones (for HA)

6. Add these properties according to
http://bosh.io/jobs/etcd?source=github.com/cloudfoundry/cf-release&version=210
election_timeout_in_milliseconds: 1000
heartbeat_interval_in_milliseconds: 50
log_sync_timeout_in_seconds: 30


Anyway, it doesn't work when "- three instances in a one-zone deployment,
will all three instances in the same zone " as you mentioned.

Do you have any suggestion about it? Or is there any mistake in my manifest?

Thanks,
Tony



--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p756.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.