Announcement: default etcd cluster to TLS in cf-release spiff templates


Amit Kumar Gupta
 

Hi all,

I'd like to change the cf-release manifest generation templates to default
to running etcd in secure TLS mode. It currently supports both TLS and
non-TLS modes of operation. The etcd job will support both modes of
operation for the near future, but I'd like to make the manifest scripts
only support TLS, meaning anyone using those templates will either need to
switch to TLS mode or do their own post-processing of the manifest to
disable TLS.

Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with
zero downtime are here:
https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit.
Note that this should allow for zero app and logging downtime, but minimal
downtime for certain features such as binding a syslog-drain-url service.

Please let me know if you have any feedback about this forthcoming change.

Best,
Amit


Grifalconi, Michael <michael.grifalconi@...>
 

Hello all,

It’s been a couple of days that we are struggling on the update of CF v240 to v241 together with the TLS upgrade for etcd.

We are following the guide provided but we always get random deployment failures during the step about adding the etcd TLS node and the etcd http proxy job.

Our deployment failed because of hm9000, loggregator_trafficcontroller and doppler. Not all together, but one after the others: first if failed because of hm9000, hit deploy again and at the 3rd time it worked; same for loggregator.

For doppler it didn’t help to try multiple times (error was ‘panic: sync cluster failed’). We solved this by restarting both the single etcd (with TLS) and the proxy.

We deployed v240 from scratch and did the upgrade several times and we never had a ‘clean’ deployment, we always got a lot of issues that has been fixed by just a second (or third) try with no changes or by stopping all etcd vms, deleting the persistent storage and restarting them.

This is a no-go for productive deployments.

Do you have any ideas on this topic? Are we doing something wrong?


Thanks a lot

Best,
Michael

From: Amit Gupta <agupta(a)pivotal.io>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Thursday 15 September 2016 at 03:34
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Subject: [cf-dev] Announcement: default etcd cluster to TLS in cf-release spiff templates

Hi all,

I'd like to change the cf-release manifest generation templates to default to running etcd in secure TLS mode. It currently supports both TLS and non-TLS modes of operation. The etcd job will support both modes of operation for the near future, but I'd like to make the manifest scripts only support TLS, meaning anyone using those templates will either need to switch to TLS mode or do their own post-processing of the manifest to disable TLS.

Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with zero downtime are here: https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit. Note that this should allow for zero app and logging downtime, but minimal downtime for certain features such as binding a syslog-drain-url service.

Please let me know if you have any feedback about this forthcoming change.

Best,
Amit


Adrian Zankich
 

Hi Michael,

Are you still experiencing upgrade issues? Are you deploying multiple instances of the hm9000, loggregator_trafficcontroller and doppler jobs?

- Adrian


Rich Wohlstadter
 

Hi Michael,

We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:

consul:
agent:
services:
etcd:
name: cf-etcd

That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.

-Rich


Grifalconi, Michael <michael.grifalconi@...>
 

Hello,

We had the same issue described for almost every upgrade tentative. The best way to summarize the deployments is ‘It works in the end but never at the first try’.

Yes, we have 2 hm9000, 2 loggregator_trafficcontroller and 2 doppler jobs.

Thanks,
Michael.

On 26/09/16 23:57, "Adrian Zankich" <azankich(a)pivotal.io> wrote:

Hi Michael,

Are you still experiencing upgrade issues? Are you deploying multiple instances of the hm9000, loggregator_trafficcontroller and doppler jobs?

- Adrian


Adrian Zankich
 

Hi Michael,

Have you tried Rich's suggestion: https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/MPUKXCBNG7H642ITVYKRMQ5ZQ6YLJKDU/

- Adrian


Grifalconi, Michael <michael.grifalconi@...>
 

Hello,

Sorry for the long delay but I had to wait to test your suggestion.. and it worked!
We still had some issue that needed a restart of some VMs but nothing compared the issue we faced before.

I strongly suggest to include this step in the update guide you provided.

Thanks

Regards,
Michael

On 27/09/16 15:22, "Rich Wohlstadter" <lethwin(a)gmail.com> wrote:

Hi Michael,

We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:

consul:
agent:
services:
etcd:
name: cf-etcd

That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.

-Rich


Amit Kumar Gupta
 

Thanks all,

I've updated the document to correct for the omission pointed out by Rich.

Best,
Amit

On Fri, Oct 7, 2016 at 7:37 AM, Grifalconi, Michael <
michael.grifalconi(a)sap.com> wrote:

Hello,

Sorry for the long delay but I had to wait to test your suggestion.. and
it worked!
We still had some issue that needed a restart of some VMs but nothing
compared the issue we faced before.

I strongly suggest to include this step in the update guide you provided.

Thanks

Regards,
Michael

On 27/09/16 15:22, "Rich Wohlstadter" <lethwin(a)gmail.com> wrote:

Hi Michael,

We were hitting the same issue. It turned out to be that that the
etcd_proxy (temporarily on etcd_z2) was advertising dns for
cf-etcd.service.cf.internal which caused some of the below services to try
and contact the proxy securely which would fail. What we did is added a
step after you generate the manifest and get ready to deploy the upgrade to
v241, edit and delete the following consul property on your etcd_z2 job
before deploying:

consul:
agent:
services:
etcd:
name: cf-etcd

That solved the issue. Once everything is talking to the secure etcd
standalone and you scale back up the generation scripts will add it back in
and your good to go. Hope this helps.

-Rich