Update Parallelization in Cloud Foundry


Omar Elazhary <omazhary@...>
 

Hello everyone,

I know it is possible to update and redeploy components in parallel in cloud foundry by setting the "serial" property in the deployment manifest to "false". However, is such a thing recommended? Are there particular job dependencies that I need to pay attention to?

Regards,
Omar


Amit Kumar Gupta
 

Hey Omar,

You can set the "serial" property at the global level of a deployment (you
can think of it as setting a default for all jobs), and then override it at
the individual job levels. You will want the consul server jobs to be
deployed first, with serial: true, and max_in_flight: 1. The important
thing here is, if you have more than one server in your consul cluster,
they need to come up one at a time to ensure the cluster orchestration goes
smoothly. The same is true if your etcd cluster has more than one server
in it. If you're using the postgres job for CCDB and/or UAADB (instead of
some external database), then you will want the postgres job to come up
before CC and/or UAA. Similarly, if you're using the provided blobstore
job instead of an external blobstore, you'll want it up before CC comes up.

You might be able to get away with parallelizing some of the things above.
E.g. if you bring the CC and blobstore up at the same time, CC might fail
to start for a while until Blobstore comes up, and then CC might
successfully start up. Monit also generally keeps retrying even after BOSH
gives up. So your deploy might fail but later on, you might see everything
up and running.

Cheers,
Amit

On Mon, Mar 7, 2016 at 5:54 AM, Omar Elazhary <omazhary(a)gmail.com> wrote:

Hello everyone,

I know it is possible to update and redeploy components in parallel in
cloud foundry by setting the "serial" property in the deployment manifest
to "false". However, is such a thing recommended? Are there particular job
dependencies that I need to pay attention to?

Regards,
Omar


Marco Voelz
 

Does NATS also need to come up before any of the other components?

On 07/03/16 21:16, "Amit Gupta" <agupta(a)pivotal.io<mailto:agupta(a)pivotal.io>> wrote:

Hey Omar,

You can set the "serial" property at the global level of a deployment (you can think of it as setting a default for all jobs), and then override it at the individual job levels. You will want the consul server jobs to be deployed first, with serial: true, and max_in_flight: 1. The important thing here is, if you have more than one server in your consul cluster, they need to come up one at a time to ensure the cluster orchestration goes smoothly. The same is true if your etcd cluster has more than one server in it. If you're using the postgres job for CCDB and/or UAADB (instead of some external database), then you will want the postgres job to come up before CC and/or UAA. Similarly, if you're using the provided blobstore job instead of an external blobstore, you'll want it up before CC comes up.

You might be able to get away with parallelizing some of the things above. E.g. if you bring the CC and blobstore up at the same time, CC might fail to start for a while until Blobstore comes up, and then CC might successfully start up. Monit also generally keeps retrying even after BOSH gives up. So your deploy might fail but later on, you might see everything up and running.

Cheers,
Amit

On Mon, Mar 7, 2016 at 5:54 AM, Omar Elazhary <omazhary(a)gmail.com<mailto:omazhary(a)gmail.com>> wrote:
Hello everyone,

I know it is possible to update and redeploy components in parallel in cloud foundry by setting the "serial" property in the deployment manifest to "false". However, is such a thing recommended? Are there particular job dependencies that I need to pay attention to?

Regards,
Omar


Amit Kumar Gupta
 

You can probably try to start everything in parallel, and either set very
long update timeouts, or allow the deployment to fail with the expectation
that it will eventually correct itself. Or you can start things in a
strict order, and have stronger constraints on the possible failure
scenarios, and be able to debug the root cause of a failure better.

Certain things do depend on NATS, and thus won't work until NATS is up.
The main thing I can currently think of is registering routes with
gorouter, which is done both for apps and for system components (e.g. the
route-registrar registers api.SYSTEM_DOMAIN on behalf of the CC).

Best,
Amit

On Tue, Mar 8, 2016 at 2:14 AM, Voelz, Marco <marco.voelz(a)sap.com> wrote:

Does NATS also need to come up before any of the other components?

On 07/03/16 21:16, "Amit Gupta" <agupta(a)pivotal.io> wrote:

Hey Omar,

You can set the "serial" property at the global level of a deployment (you
can think of it as setting a default for all jobs), and then override it at
the individual job levels. You will want the consul server jobs to be
deployed first, with serial: true, and max_in_flight: 1. The important
thing here is, if you have more than one server in your consul cluster,
they need to come up one at a time to ensure the cluster orchestration goes
smoothly. The same is true if your etcd cluster has more than one server
in it. If you're using the postgres job for CCDB and/or UAADB (instead of
some external database), then you will want the postgres job to come up
before CC and/or UAA. Similarly, if you're using the provided blobstore
job instead of an external blobstore, you'll want it up before CC comes up.

You might be able to get away with parallelizing some of the things
above. E.g. if you bring the CC and blobstore up at the same time, CC
might fail to start for a while until Blobstore comes up, and then CC might
successfully start up. Monit also generally keeps retrying even after BOSH
gives up. So your deploy might fail but later on, you might see everything
up and running.

Cheers,
Amit

On Mon, Mar 7, 2016 at 5:54 AM, Omar Elazhary <omazhary(a)gmail.com> wrote:

Hello everyone,

I know it is possible to update and redeploy components in parallel in
cloud foundry by setting the "serial" property in the deployment manifest
to "false". However, is such a thing recommended? Are there particular job
dependencies that I need to pay attention to?

Regards,
Omar



Marco Voelz
 

Thanks for clarifying this for me, Amit.

Warm regards
Marco

On 09/03/16 07:43, "Amit Gupta" <agupta(a)pivotal.io<mailto:agupta(a)pivotal.io>> wrote:

You can probably try to start everything in parallel, and either set very long update timeouts, or allow the deployment to fail with the expectation that it will eventually correct itself. Or you can start things in a strict order, and have stronger constraints on the possible failure scenarios, and be able to debug the root cause of a failure better.

Certain things do depend on NATS, and thus won't work until NATS is up. The main thing I can currently think of is registering routes with gorouter, which is done both for apps and for system components (e.g. the route-registrar registers api.SYSTEM_DOMAIN on behalf of the CC).

Best,
Amit

On Tue, Mar 8, 2016 at 2:14 AM, Voelz, Marco <marco.voelz(a)sap.com<mailto:marco.voelz(a)sap.com>> wrote:
Does NATS also need to come up before any of the other components?

On 07/03/16 21:16, "Amit Gupta" <agupta(a)pivotal.io<mailto:agupta(a)pivotal.io>> wrote:

Hey Omar,

You can set the "serial" property at the global level of a deployment (you can think of it as setting a default for all jobs), and then override it at the individual job levels. You will want the consul server jobs to be deployed first, with serial: true, and max_in_flight: 1. The important thing here is, if you have more than one server in your consul cluster, they need to come up one at a time to ensure the cluster orchestration goes smoothly. The same is true if your etcd cluster has more than one server in it. If you're using the postgres job for CCDB and/or UAADB (instead of some external database), then you will want the postgres job to come up before CC and/or UAA. Similarly, if you're using the provided blobstore job instead of an external blobstore, you'll want it up before CC comes up.

You might be able to get away with parallelizing some of the things above. E.g. if you bring the CC and blobstore up at the same time, CC might fail to start for a while until Blobstore comes up, and then CC might successfully start up. Monit also generally keeps retrying even after BOSH gives up. So your deploy might fail but later on, you might see everything up and running.

Cheers,
Amit

On Mon, Mar 7, 2016 at 5:54 AM, Omar Elazhary <omazhary(a)gmail.com<mailto:omazhary(a)gmail.com>> wrote:
Hello everyone,

I know it is possible to update and redeploy components in parallel in cloud foundry by setting the "serial" property in the deployment manifest to "false". However, is such a thing recommended? Are there particular job dependencies that I need to pay attention to?

Regards,
Omar


Dieu Cao <dcao@...>
 

It should also be considered that in some scenarios the order of deployment
as recommended serially will most often be the most tested in terms of
ensuring backwards compatibility of code changes during deployment.

For example, a new end point might be added to cloud controller to be used
by DEAs/CELLs and it is assumed that because of the serial deployment
order, that all cloud controller's will have completed updating and thus
the new end point available prior to DEAs/CELLs updating so then code
changes to DEAs/CELLs can simply switch over to using the new end points as
they update and there is no need to keep the code on DEAs/CELLs that used
the older end points.

-Dieu
CF Runtime PMC Lead

On Wed, Mar 9, 2016 at 2:34 AM, Voelz, Marco <marco.voelz(a)sap.com> wrote:

Thanks for clarifying this for me, Amit.

Warm regards
Marco

On 09/03/16 07:43, "Amit Gupta" <agupta(a)pivotal.io> wrote:

You can probably try to start everything in parallel, and either set very
long update timeouts, or allow the deployment to fail with the expectation
that it will eventually correct itself. Or you can start things in a
strict order, and have stronger constraints on the possible failure
scenarios, and be able to debug the root cause of a failure better.

Certain things do depend on NATS, and thus won't work until NATS is up.
The main thing I can currently think of is registering routes with
gorouter, which is done both for apps and for system components (e.g. the
route-registrar registers api.SYSTEM_DOMAIN on behalf of the CC).

Best,
Amit

On Tue, Mar 8, 2016 at 2:14 AM, Voelz, Marco <marco.voelz(a)sap.com> wrote:

Does NATS also need to come up before any of the other components?

On 07/03/16 21:16, "Amit Gupta" <agupta(a)pivotal.io> wrote:

Hey Omar,

You can set the "serial" property at the global level of a deployment
(you can think of it as setting a default for all jobs), and then override
it at the individual job levels. You will want the consul server jobs to
be deployed first, with serial: true, and max_in_flight: 1. The important
thing here is, if you have more than one server in your consul cluster,
they need to come up one at a time to ensure the cluster orchestration goes
smoothly. The same is true if your etcd cluster has more than one server
in it. If you're using the postgres job for CCDB and/or UAADB (instead of
some external database), then you will want the postgres job to come up
before CC and/or UAA. Similarly, if you're using the provided blobstore
job instead of an external blobstore, you'll want it up before CC comes up.

You might be able to get away with parallelizing some of the things
above. E.g. if you bring the CC and blobstore up at the same time, CC
might fail to start for a while until Blobstore comes up, and then CC might
successfully start up. Monit also generally keeps retrying even after BOSH
gives up. So your deploy might fail but later on, you might see everything
up and running.

Cheers,
Amit

On Mon, Mar 7, 2016 at 5:54 AM, Omar Elazhary <omazhary(a)gmail.com> wrote:

Hello everyone,

I know it is possible to update and redeploy components in parallel in
cloud foundry by setting the "serial" property in the deployment manifest
to "false". However, is such a thing recommended? Are there particular job
dependencies that I need to pay attention to?

Regards,
Omar




Omar Elazhary <omazhary@...>
 

Thanks everyone. What I understood from Amit's response is that I can parallelize certain components. What I also understood from both Amit's and Dieu's responses is that some components have hard dependencies, while others only have soft ones, and some components have no dependencies at all. My question is: how can I figure out these dependencies? Are they listed somewhere? The cloud foundry docs do a great job of describing each component separately, but they do not explain which should be up before which. That is what I need in order to work an execution plan in order to minimize update time, all the while keeping CF 100% available.

Thanks.

Regards,
Omar


Amit Kumar Gupta
 

If by "hard dependency" you mean something that has to be up strictly
before another thing for a deploy to possibly succeed, I'm not sure if
there are any such hard dependencies. PCFDev (formerly MicroPCF) brings up
all the components simultaneously on a single VM [1
<https://github.com/pivotal-cf/micropcf>]. Some processes will flap until
other ones are up, but they eventually do all come up.

There probably isn't a single solution to minimizing update time while
guaranteeing 100% uptime, as the answer will depend on a lot of different
things. Are you running DEA and/or Diego? External database and/or
external blobstore? Are you just talking about uptime of apps, or also of
the platform API? What about services as well?

If you find a colocation/update strategy that works for you, I think the
community would really appreciate hearing about it.

(Just for fun, there's also nanocf [2 <https://github.com/sclevine/nanocf>]
which is a Docker image with all of CF in it, and a bunch of videos where I
run nanocf in nanocf in BOSH-Lite CF [3
<https://www.youtube.com/watch?v=oMUGjaWg_Hk&list=PLdgSOpBLY_uFbzo1f1prmjW0hf4z1rWdm>
])

[1] https://github.com/pivotal-cf/micropcf
[2] https://github.com/sclevine/nanocf
[3]
https://www.youtube.com/watch?v=oMUGjaWg_Hk&list=PLdgSOpBLY_uFbzo1f1prmjW0hf4z1rWdm

Cheers,
Amit

On Thu, Mar 10, 2016 at 2:24 AM, Omar Elazhary <omazhary(a)gmail.com> wrote:

Thanks everyone. What I understood from Amit's response is that I can
parallelize certain components. What I also understood from both Amit's and
Dieu's responses is that some components have hard dependencies, while
others only have soft ones, and some components have no dependencies at
all. My question is: how can I figure out these dependencies? Are they
listed somewhere? The cloud foundry docs do a great job of describing each
component separately, but they do not explain which should be up before
which. That is what I need in order to work an execution plan in order to
minimize update time, all the while keeping CF 100% available.

Thanks.

Regards,
Omar