Date
1 - 8 of 8
Announcement: default etcd cluster to TLS in cf-release spiff templates
Amit Kumar Gupta
Hi all,
I'd like to change the cf-release manifest generation templates to default
to running etcd in secure TLS mode. It currently supports both TLS and
non-TLS modes of operation. The etcd job will support both modes of
operation for the near future, but I'd like to make the manifest scripts
only support TLS, meaning anyone using those templates will either need to
switch to TLS mode or do their own post-processing of the manifest to
disable TLS.
Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with
zero downtime are here:
https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit.
Note that this should allow for zero app and logging downtime, but minimal
downtime for certain features such as binding a syslog-drain-url service.
Please let me know if you have any feedback about this forthcoming change.
Best,
Amit
I'd like to change the cf-release manifest generation templates to default
to running etcd in secure TLS mode. It currently supports both TLS and
non-TLS modes of operation. The etcd job will support both modes of
operation for the near future, but I'd like to make the manifest scripts
only support TLS, meaning anyone using those templates will either need to
switch to TLS mode or do their own post-processing of the manifest to
disable TLS.
Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with
zero downtime are here:
https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit.
Note that this should allow for zero app and logging downtime, but minimal
downtime for certain features such as binding a syslog-drain-url service.
Please let me know if you have any feedback about this forthcoming change.
Best,
Amit
Grifalconi, Michael <michael.grifalconi@...>
Hello all,
It’s been a couple of days that we are struggling on the update of CF v240 to v241 together with the TLS upgrade for etcd.
We are following the guide provided but we always get random deployment failures during the step about adding the etcd TLS node and the etcd http proxy job.
Our deployment failed because of hm9000, loggregator_trafficcontroller and doppler. Not all together, but one after the others: first if failed because of hm9000, hit deploy again and at the 3rd time it worked; same for loggregator.
For doppler it didn’t help to try multiple times (error was ‘panic: sync cluster failed’). We solved this by restarting both the single etcd (with TLS) and the proxy.
We deployed v240 from scratch and did the upgrade several times and we never had a ‘clean’ deployment, we always got a lot of issues that has been fixed by just a second (or third) try with no changes or by stopping all etcd vms, deleting the persistent storage and restarting them.
This is a no-go for productive deployments.
Do you have any ideas on this topic? Are we doing something wrong?
Thanks a lot
Best,
Michael
From: Amit Gupta <agupta(a)pivotal.io>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Thursday 15 September 2016 at 03:34
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Subject: [cf-dev] Announcement: default etcd cluster to TLS in cf-release spiff templates
Hi all,
I'd like to change the cf-release manifest generation templates to default to running etcd in secure TLS mode. It currently supports both TLS and non-TLS modes of operation. The etcd job will support both modes of operation for the near future, but I'd like to make the manifest scripts only support TLS, meaning anyone using those templates will either need to switch to TLS mode or do their own post-processing of the manifest to disable TLS.
Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with zero downtime are here: https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit. Note that this should allow for zero app and logging downtime, but minimal downtime for certain features such as binding a syslog-drain-url service.
Please let me know if you have any feedback about this forthcoming change.
Best,
Amit
It’s been a couple of days that we are struggling on the update of CF v240 to v241 together with the TLS upgrade for etcd.
We are following the guide provided but we always get random deployment failures during the step about adding the etcd TLS node and the etcd http proxy job.
Our deployment failed because of hm9000, loggregator_trafficcontroller and doppler. Not all together, but one after the others: first if failed because of hm9000, hit deploy again and at the 3rd time it worked; same for loggregator.
For doppler it didn’t help to try multiple times (error was ‘panic: sync cluster failed’). We solved this by restarting both the single etcd (with TLS) and the proxy.
We deployed v240 from scratch and did the upgrade several times and we never had a ‘clean’ deployment, we always got a lot of issues that has been fixed by just a second (or third) try with no changes or by stopping all etcd vms, deleting the persistent storage and restarting them.
This is a no-go for productive deployments.
Do you have any ideas on this topic? Are we doing something wrong?
Thanks a lot
Best,
Michael
From: Amit Gupta <agupta(a)pivotal.io>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Thursday 15 September 2016 at 03:34
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Subject: [cf-dev] Announcement: default etcd cluster to TLS in cf-release spiff templates
Hi all,
I'd like to change the cf-release manifest generation templates to default to running etcd in secure TLS mode. It currently supports both TLS and non-TLS modes of operation. The etcd job will support both modes of operation for the near future, but I'd like to make the manifest scripts only support TLS, meaning anyone using those templates will either need to switch to TLS mode or do their own post-processing of the manifest to disable TLS.
Detailed instructions for upgrading a non-TLS cluster to a TLS cluster with zero downtime are here: https://docs.google.com/document/d/1ZzWzp3H6H3t1ikk6Fl-8x1LX2a_0dHPJ5MMLEwY0inI/edit. Note that this should allow for zero app and logging downtime, but minimal downtime for certain features such as binding a syslog-drain-url service.
Please let me know if you have any feedback about this forthcoming change.
Best,
Amit
Rich Wohlstadter
Hi Michael,
We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:
consul:
agent:
services:
etcd:
name: cf-etcd
That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.
-Rich
We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:
consul:
agent:
services:
etcd:
name: cf-etcd
That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.
-Rich
Grifalconi, Michael <michael.grifalconi@...>
Hello,
We had the same issue described for almost every upgrade tentative. The best way to summarize the deployments is ‘It works in the end but never at the first try’.
Yes, we have 2 hm9000, 2 loggregator_trafficcontroller and 2 doppler jobs.
Thanks,
Michael.
toggle quoted message
Show quoted text
We had the same issue described for almost every upgrade tentative. The best way to summarize the deployments is ‘It works in the end but never at the first try’.
Yes, we have 2 hm9000, 2 loggregator_trafficcontroller and 2 doppler jobs.
Thanks,
Michael.
On 26/09/16 23:57, "Adrian Zankich" <azankich(a)pivotal.io> wrote:
Hi Michael,
Are you still experiencing upgrade issues? Are you deploying multiple instances of the hm9000, loggregator_trafficcontroller and doppler jobs?
- Adrian
Hi Michael,
Are you still experiencing upgrade issues? Are you deploying multiple instances of the hm9000, loggregator_trafficcontroller and doppler jobs?
- Adrian
Adrian Zankich
Hi Michael,
Have you tried Rich's suggestion: https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/MPUKXCBNG7H642ITVYKRMQ5ZQ6YLJKDU/
- Adrian
Have you tried Rich's suggestion: https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/MPUKXCBNG7H642ITVYKRMQ5ZQ6YLJKDU/
- Adrian
Grifalconi, Michael <michael.grifalconi@...>
Hello,
Sorry for the long delay but I had to wait to test your suggestion.. and it worked!
We still had some issue that needed a restart of some VMs but nothing compared the issue we faced before.
I strongly suggest to include this step in the update guide you provided.
Thanks
Regards,
Michael
toggle quoted message
Show quoted text
Sorry for the long delay but I had to wait to test your suggestion.. and it worked!
We still had some issue that needed a restart of some VMs but nothing compared the issue we faced before.
I strongly suggest to include this step in the update guide you provided.
Thanks
Regards,
Michael
On 27/09/16 15:22, "Rich Wohlstadter" <lethwin(a)gmail.com> wrote:
Hi Michael,
We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:
consul:
agent:
services:
etcd:
name: cf-etcd
That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.
-Rich
Hi Michael,
We were hitting the same issue. It turned out to be that that the etcd_proxy (temporarily on etcd_z2) was advertising dns for cf-etcd.service.cf.internal which caused some of the below services to try and contact the proxy securely which would fail. What we did is added a step after you generate the manifest and get ready to deploy the upgrade to v241, edit and delete the following consul property on your etcd_z2 job before deploying:
consul:
agent:
services:
etcd:
name: cf-etcd
That solved the issue. Once everything is talking to the secure etcd standalone and you scale back up the generation scripts will add it back in and your good to go. Hope this helps.
-Rich
Amit Kumar Gupta
Thanks all,
I've updated the document to correct for the omission pointed out by Rich.
Best,
Amit
On Fri, Oct 7, 2016 at 7:37 AM, Grifalconi, Michael <
michael.grifalconi(a)sap.com> wrote:
I've updated the document to correct for the omission pointed out by Rich.
Best,
Amit
On Fri, Oct 7, 2016 at 7:37 AM, Grifalconi, Michael <
michael.grifalconi(a)sap.com> wrote:
Hello,
Sorry for the long delay but I had to wait to test your suggestion.. and
it worked!
We still had some issue that needed a restart of some VMs but nothing
compared the issue we faced before.
I strongly suggest to include this step in the update guide you provided.
Thanks
Regards,
Michael
On 27/09/16 15:22, "Rich Wohlstadter" <lethwin(a)gmail.com> wrote:
Hi Michael,
We were hitting the same issue. It turned out to be that that the
etcd_proxy (temporarily on etcd_z2) was advertising dns for
cf-etcd.service.cf.internal which caused some of the below services to try
and contact the proxy securely which would fail. What we did is added a
step after you generate the manifest and get ready to deploy the upgrade to
v241, edit and delete the following consul property on your etcd_z2 job
before deploying:
consul:
agent:
services:
etcd:
name: cf-etcd
That solved the issue. Once everything is talking to the secure etcd
standalone and you scale back up the generation scripts will add it back in
and your good to go. Hope this helps.
-Rich