Re: Announcement: default etcd cluster to TLS in cf-release spiff templates
Grifalconi, Michael <michael.grifalconi@...>
It’s been a couple of days that we are struggling on the update of CF v240 to v241 together with the TLS upgrade for etcd.
We are following the guide provided but we always get random deployment failures during the step about adding the etcd TLS node and the etcd http proxy job.
Our deployment failed because of hm9000, loggregator_trafficcontroller and doppler. Not all together, but one after the others: first if failed because of hm9000, hit deploy again and at the 3rd time it worked; same for loggregator.
For doppler it didn’t help to try multiple times (error was ‘panic: sync cluster failed’). We solved this by restarting both the single etcd (with TLS) and the proxy.
We deployed v240 from scratch and did the upgrade several times and we never had a ‘clean’ deployment, we always got a lot of issues that has been fixed by just a second (or third) try with no changes or by stopping all etcd vms, deleting the persistent storage and restarting them.
This is a no-go for productive deployments.
Do you have any ideas on this topic? Are we doing something wrong?
Thanks a lot
From: Amit Gupta <agupta(a)pivotal.io>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Thursday 15 September 2016 at 03:34
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Subject: [cf-dev] Announcement: default etcd cluster to TLS in cf-release spiff templates
I'd like to change the cf-release manifest generation templates to default to running etcd in secure TLS mode. It currently supports both TLS and non-TLS modes of operation. The etcd job will support both modes of operation for the near future, but I'd like to make the manifest scripts only support TLS, meaning anyone using those templates will either need to switch to TLS mode or do their own post-processing of the manifest to disable TLS.
Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with zero downtime are here: https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit. Note that this should allow for zero app and logging downtime, but minimal downtime for certain features such as binding a syslog-drain-url service.
Please let me know if you have any feedback about this forthcoming change.