Re: 3 etcd nodes don't work well in single zone


James Bayer
 

cf project does not support or test cf-release with centos6.5, only ubuntu
14.04.

etcd nodes should not necessarily be aware of which AZs they are in. the
only difference might be in the bosh manifests if they are in different
zones they likely have different job names and you'd need to ensure that
despite the different job names that they were configured to find each
other correctly.

On Thu, Jul 16, 2015 at 6:51 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:

Hi Amit,

Let me explain the error I got in details.

My env info:
CentOS 6.5,
Openstack Icehouse,
Single-AZ
2 hm9000 instances,
3 etcd instances,


Manifest:
- name: etcd_z1
instances: 3
networks:
- name: cf1
static_ips:
- 100.64.1.21
- 100.64.1.22
- 100.64.1.23
persistent_disk: 10024
properties:
metron_agent:
deployment: metron_agent.deployment
zone: z1
networks:
apps: cf1
etcd:
election_timeout_in_milliseconds: 1000
heartbeat_interval_in_milliseconds: 50
log_sync_timeout_in_seconds: 30
resource_pool: medium_z1
templates:
- name: etcd
release: cf
- name: etcd_metrics_server
release: cf
- name: metron_agent
release: cf
update: {}

properties:
etcd:
machines:
- 100.64.1.21
- 100.64.1.22
- 100.64.1.23
etcd_metrics_server:
nats:
machines:
- 100.64.1.11
- 100.64.1.12
...


I cf push dora app with 2 instances
(
https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/dora
)

And I can always get response from it. (curl dora.runmyapp.io --> "Hi,
I'm
Dora")
The app runs well.

Then I "cf app dora" and got
...
requested state: started
instances: ?/2
...

Then I "cf app dora" again after about 1 minute, and got
...
requested state: started
instances: 2/2
...

The instances' number varies between ?/2 and 2/2 after that.

I also wrote a small script to send "cf app dora" every second and check
the
instances' number.
if the number changed, then record it.

Wed Jul 15 06:50:57 UTC 2015 instances: ?/2 (32s)

Wed Jul 15 06:51:29 UTC 2015 instances: 2/2 (6s)

Wed Jul 15 06:51:35 UTC 2015 instances: ?/2 (1m30s)

Wed Jul 15 06:53:05 UTC 2015 instances: 2/2 (17s)

Wed Jul 15 06:53:22 UTC 2015 instances: ?/2 (3m40s)

Wed Jul 15 06:57:02 UTC 2015 instances: 2/2 (21s)

Wed Jul 15 06:57:23 UTC 2015 instances: ?/2 (2m4s)

Wed Jul 15 06:59:27 UTC 2015 instances: 2/2
...


From above we can see that:
1. instance number varies between ?/2 and 2/2
2. "?/2" can be got more often than "2/2"


The app instances' number is always "2/2" when there is only one etcd
instance.
So I reckon the problem is in multi etcd instances.


Other things I tried, but none of them works:

1. Stop etcd service on one etcd vm(monit stop etcd).

2. restart 3 etcd services one by one

3. restart all 3 etcd vms(terminate vms and let them restart
automatically.)

4. Restart two hm9000 vms

5. Restart haproxy(because I don’t know whether the “for HA” means
haproxy)
http://bosh.io/releases/github.com/cloudfoundry/cf-release?version=210
Upgrade etcd server to 2.0.1 details
Should be run as 1 node (for small deployments) or 3 nodes spread across
zones (for HA)

6. Add these properties according to

http://bosh.io/jobs/etcd?source=github.com/cloudfoundry/cf-release&version=210
election_timeout_in_milliseconds: 1000
heartbeat_interval_in_milliseconds: 50
log_sync_timeout_in_seconds: 30


Anyway, it doesn't work when "- three instances in a one-zone deployment,
will all three instances in the same zone " as you mentioned.

Do you have any suggestion about it? Or is there any mistake in my
manifest?

Thanks,
Tony



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p756.html
Sent from the CF Dev mailing list archive at Nabble.com.
_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
--
Thank you,

James Bayer

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.