Re: 3 etcd nodes don't work well in single zone


Tony
 

Hi Amit,

Here is the latest logs I got from etcd and hm9k (I use scp instead of bosh logs to avoid missing something) immediately after finishing test.

May I mention that there is a test folder in the zip file:

test-etcd.sh is a simple script I use, it sends cf app dora every second, and records responses in status.log,

if instance number changed ,then records it in variation.log

In in variation.log, you can see the instance number varies between 2/2 and ?/2 eight times within about 10 minutes.

Thu Jul 23 08:39:29 UTC 2015
instances: 2/2
Thu Jul 23 08:42:59 UTC 2015
instances: ?/2
Thu Jul 23 08:43:36 UTC 2015
instances: 2/2
Thu Jul 23 08:44:55 UTC 2015
instances: ?/2
Thu Jul 23 08:45:32 UTC 2015
instances: 2/2
Thu Jul 23 08:48:31 UTC 2015
instances: ?/2
Thu Jul 23 08:49:02 UTC 2015
instances: 2/2
Thu Jul 23 08:50:05 UTC 2015
instances: ?/2
Thu Jul 23 08:50:41 UTC 2015
instances: 2/2


The start time of this test is “Thu Jul 23 08:39:29 UTC 2015” , it is around "timestamp":1437640773, so I delete most of content before 143763… to make the logs clear.

I didn’t delete any log after 1437640773. If you see the last line of some file(e.g. hm9000_sender.log) is before 1437640773, that just means it didn’t print any log since then.


And I find that at the moments it varies, there isn’t any error recorded in etcd log.

So it seems that the problem is in hm. I’m not sure.

Regards,
Tony

From: Amit Gupta [via CF Dev] [mailto:ml-node+s70369n810h86(a)n6.nabble.com]
Sent: Wednesday, 22 July 2015 10:09 AM
To: Li, Tony
Subject: Re: [cf-dev] 3 etcd nodes don't work well in single zone

Hi Tony,

The logs you've retrieved only go back to Jul 21, which I can't correlate with the "?/2" issues you were seeing. If you could possibly record again a bunch of occurrences of flapping between "2/2" and "?/2" for an app (along with datetime stamps), and then immediately get logs from *all* the HM and etcd nodes (`bosh logs` only gets logs from one node at a time), I can try to dig in more. It's important to get the logs from the HM and etcd VMs soon after recording the "?/2" events, otherwise BOSH may rotate/archive the logs and then make them harder to obtain.

Best,
Amit

On Tue, Jul 21, 2015 at 4:53 PM, Amit Gupta <[hidden email]</user/SendEmail.jtp?type=node&node=810&i=0>> wrote:
You should definitely not run etcd with 2 instances. You can read more about
recommended cluster sizes in the etcd docs:

https://github.com/coreos/etcd/blob/740187f199a12652ca1b7bddb7b3489160103d84/Documentation/admin_guide.md#fault-tolerance-table

I will look at the attached logs and get back to you, but wanted to make
sure to advise you to run either 1 or 3 nodes. With 2, you can wedge the
system, because it will need all nodes to be up to achieve quorum. If you
roll one of the two nodes, it will not be able to rejoin the cluster, and
the service will be stuck in an unavailable state.



-----
Amit, CF OSS Release Integration PM
Pivotal Software, Inc.
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p809.html
Sent from the CF Dev mailing list archive at Nabble.com.
_______________________________________________
cf-dev mailing list
[hidden email]</user/SendEmail.jtp?type=node&node=810&i=1>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


_______________________________________________
cf-dev mailing list
[hidden email]</user/SendEmail.jtp?type=node&node=810&i=2>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
Amit, CF OSS Release Integration PM
Pivotal Software, Inc.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p810.html
To unsubscribe from [cf-dev] 3 etcd nodes don't work well in single zone, click here<http://cf-dev.70369.x6.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=746&code=VG9ueWxAZmFzdC5hdS5mdWppdHN1LmNvbXw3NDZ8LTQ5MjU5Njk1Nw==>.
NAML<http://cf-dev.70369.x6.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
Disclaimer

The information in this e-mail is confidential and may contain content that is subject to copyright and/or is commercial-in-confidence and is intended only for the use of the above named addressee. If you are not the intended recipient, you are hereby notified that dissemination, copying or use of the information is strictly prohibited. If you have received this e-mail in error, please telephone Fujitsu Australia Software Technology Pty Ltd on + 61 2 9452 9000 or by reply e-mail to the sender and delete the document and all copies thereof.


Whereas Fujitsu Australia Software Technology Pty Ltd would not knowingly transmit a virus within an email communication, it is the receiver’s responsibility to scan all communication and any files attached for computer viruses and other defects. Fujitsu Australia Software Technology Pty Ltd does not accept liability for any loss or damage (whether direct, indirect, consequential or economic) however caused, and whether by negligence or otherwise, which may result directly or indirectly from this communication or any files attached.


If you do not wish to receive commercial and/or marketing email messages from Fujitsu Australia Software Technology Pty Ltd, please email unsubscribe(a)fast.au.fujitsu.com


logs.zip (103K) <http://cf-dev.70369.x6.nabble.com/attachment/847/0/logs.zip>




--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p847.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.