Asynchronous route-mapping and blue-green deployment


Anderer, Thomas
 

Hello everyone,

we already had an internal discussion on the topic and since we could not quite come to a solution, I'd like to ask in this list.

We're using a self-made blue-green-deployment-script which pretty much does what's explained on cloudfoundry.org (https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html), and additionally checks (HTTP-GET) the application after push, map-route and unmap-route. This script has been working fine on an instance A of CF (DEA-based and relatively small), but has issues on instance B (Diego-based, larger). The aforementioned HTTP-GET sometimes fails with "404 Not Found: Requested route does not exist" directly after push or map-route. We never experienced this issue on instance A. This could of course mean that the instance is small enough that all router operation are handled faster than our script is able to react. On instance B it sometimes takes up to a 1 second or even longer until the route mapping is finally completed.

In our internal discussion with our CF operators, we found a couple of parts of documentation which at least hint to different, maybe even inconsistent inner workings of the route mapping:
1) https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html: Step 3: Map-route - The CF Router immediately begins to load balance traffic for demo-time.example.com between Blue and Green.
- In my opinion this implies or at least strongly hints that the map-route call was supposed to be synchronous.
2) PUT to "/v2/apps/#{app_guid}/routes/#{route_guid}" returns 201 CREATED and not 202 ACCEPTED, which also implies that it is a synchronous operation. It also does not return an event or operation which could be pinged to wait for completion of route mapping.
3) Operations which show the state of the route, like for example 'cf routes' already shows the route as being successfully mapped, although 404 is still returned.
4) See https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#map-route: Applications running on the DEA architecture must be restarted after routes for an app are mapped or unmapped. Applications running on Diego do not need to be restarted.
- This is contrary to what I experienced, since route mapping on our instance A always worked synchronously and without restart of the application. Furthermore, it contradicts what's explained in 1) at least for DEA.
5) https://github.com/cloudfoundry/diego-design-notes#routing-translation-components:
Routing Translation Components: Route-Emitter
a) monitors DesiredLRP state and ActualLRP state via the BBS. When a change is detected, the Route-Emitter emits route registration and unregistration messages to the gorouter via the NATS message bus,
b) periodically emits the entire routing table to the router,
c) maintains a lock in consul to ensure only one route-emitter handles route registration at a time.
- This hints that route mapping is an asynchronous task.

Some of the issues could be fixed by repeatedly pinging the application in order to wait until the route has been mapped. But what about blue-green-deployment, where I map the application to a route on which there is already a running application. Here, I cannot find out in a general way, if the route mapping is completed and when I can start unmapping the route from the old application. If route-mapping is indeed an asynchronous task, in my opinion the description on https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html is misleading at best. And all blue-green-deployment-scripts which I've seen so far have this issue.

So, is the route mapping really supposed to be asynchronous? If so, is there any general way to find out when the route-mapping process has finished?

Thank you for your help and clarification,

Best regards,
--
Thomas Anderer
Agile Software Engineer
andrena objects ag
currently working at SAP


Chip Childers <cchilders@...>
 

+Shannon Coen <scoen(a)pivotal.io> +Nicholas Calugar <ncalugar(a)pivotal.io>

Shannon / Nicholas - between Routing and CAPI, can one of you help Thomas?

On Fri, Feb 3, 2017 at 1:40 AM Anderer, Thomas <thomas.anderer(a)sap.com>
wrote:

Hello everyone,

we already had an internal discussion on the topic and since we could not
quite come to a solution, I'd like to ask in this list.

We're using a self-made blue-green-deployment-script which pretty much
does what's explained on cloudfoundry.org (
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html), and
additionally checks (HTTP-GET) the application after push, map-route and
unmap-route. This script has been working fine on an instance A of CF
(DEA-based and relatively small), but has issues on instance B
(Diego-based, larger). The aforementioned HTTP-GET sometimes fails with
"404 Not Found: Requested route does not exist" directly after push or
map-route. We never experienced this issue on instance A. This could of
course mean that the instance is small enough that all router operation are
handled faster than our script is able to react. On instance B it sometimes
takes up to a 1 second or even longer until the route mapping is finally
completed.

In our internal discussion with our CF operators, we found a couple of
parts of documentation which at least hint to different, maybe even
inconsistent inner workings of the route mapping:
1) https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html:
Step 3: Map-route - The CF Router immediately begins to load balance
traffic for demo-time.example.com between Blue and Green.
- In my opinion this implies or at least strongly hints that the
map-route call was supposed to be synchronous.
2) PUT to "/v2/apps/#{app_guid}/routes/#{route_guid}" returns 201 CREATED
and not 202 ACCEPTED, which also implies that it is a synchronous
operation. It also does not return an event or operation which could be
pinged to wait for completion of route mapping.
3) Operations which show the state of the route, like for example 'cf
routes' already shows the route as being successfully mapped, although 404
is still returned.
4) See
https://docs.cloudfoundry.org/devguide/deploy-apps/routes-domains.html#map-route:
Applications running on the DEA architecture must be restarted after routes
for an app are mapped or unmapped. Applications running on Diego do not
need to be restarted.
- This is contrary to what I experienced, since route mapping on our
instance A always worked synchronously and without restart of the
application. Furthermore, it contradicts what's explained in 1) at least
for DEA.
5)
https://github.com/cloudfoundry/diego-design-notes#routing-translation-components
:
Routing Translation Components: Route-Emitter
a) monitors DesiredLRP state and ActualLRP state via the BBS. When a
change is detected, the Route-Emitter emits route registration and
unregistration messages to the gorouter via the NATS message bus,
b) periodically emits the entire routing table to the router,
c) maintains a lock in consul to ensure only one route-emitter handles
route registration at a time.
- This hints that route mapping is an asynchronous task.

Some of the issues could be fixed by repeatedly pinging the application in
order to wait until the route has been mapped. But what about
blue-green-deployment, where I map the application to a route on which
there is already a running application. Here, I cannot find out in a
general way, if the route mapping is completed and when I can start
unmapping the route from the old application. If route-mapping is indeed an
asynchronous task, in my opinion the description on
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html is
misleading at best. And all blue-green-deployment-scripts which I've seen
so far have this issue.

So, is the route mapping really supposed to be asynchronous? If so, is
there any general way to find out when the route-mapping process has
finished?

Thank you for your help and clarification,

Best regards,
--
Thomas Anderer
Agile Software Engineer
andrena objects ag
currently working at SAP



--
Chip Childers
CTO, Cloud Foundry Foundation
1.267.250.0815


Gerhard Lazu <glazu@...>
 

$ cf repo-plugins

Getting plugins from all repositories ...

Repository: CF-Community
name version description
...
blue-green-deploy 1.1.0 Zero downtime deploys with smoke
test support

On Tue, Feb 7, 2017 at 2:29 PM, Chip Childers <cchilders(a)cloudfoundry.org>
wrote:

+Shannon Coen <scoen(a)pivotal.io> +Nicholas Calugar <ncalugar(a)pivotal.io>

Shannon / Nicholas - between Routing and CAPI, can one of you help Thomas?

On Fri, Feb 3, 2017 at 1:40 AM Anderer, Thomas <thomas.anderer(a)sap.com>
wrote:

Hello everyone,

we already had an internal discussion on the topic and since we could not
quite come to a solution, I'd like to ask in this list.

We're using a self-made blue-green-deployment-script which pretty much
does what's explained on cloudfoundry.org (https://docs.cloudfoundry.
org/devguide/deploy-apps/blue-green.html), and additionally checks
(HTTP-GET) the application after push, map-route and unmap-route. This
script has been working fine on an instance A of CF (DEA-based and
relatively small), but has issues on instance B (Diego-based, larger). The
aforementioned HTTP-GET sometimes fails with "404 Not Found: Requested
route does not exist" directly after push or map-route. We never
experienced this issue on instance A. This could of course mean that the
instance is small enough that all router operation are handled faster than
our script is able to react. On instance B it sometimes takes up to a 1
second or even longer until the route mapping is finally completed.

In our internal discussion with our CF operators, we found a couple of
parts of documentation which at least hint to different, maybe even
inconsistent inner workings of the route mapping:
1) https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html:
Step 3: Map-route - The CF Router immediately begins to load balance
traffic for demo-time.example.com between Blue and Green.
- In my opinion this implies or at least strongly hints that the
map-route call was supposed to be synchronous.
2) PUT to "/v2/apps/#{app_guid}/routes/#{route_guid}" returns 201
CREATED and not 202 ACCEPTED, which also implies that it is a synchronous
operation. It also does not return an event or operation which could be
pinged to wait for completion of route mapping.
3) Operations which show the state of the route, like for example 'cf
routes' already shows the route as being successfully mapped, although 404
is still returned.
4) See https://docs.cloudfoundry.org/devguide/deploy-apps/routes-
domains.html#map-route: Applications running on the DEA architecture
must be restarted after routes for an app are mapped or unmapped.
Applications running on Diego do not need to be restarted.
- This is contrary to what I experienced, since route mapping on our
instance A always worked synchronously and without restart of the
application. Furthermore, it contradicts what's explained in 1) at least
for DEA.
5) https://github.com/cloudfoundry/diego-design-
notes#routing-translation-components:
Routing Translation Components: Route-Emitter
a) monitors DesiredLRP state and ActualLRP state via the BBS. When a
change is detected, the Route-Emitter emits route registration and
unregistration messages to the gorouter via the NATS message bus,
b) periodically emits the entire routing table to the router,
c) maintains a lock in consul to ensure only one route-emitter handles
route registration at a time.
- This hints that route mapping is an asynchronous task.

Some of the issues could be fixed by repeatedly pinging the application
in order to wait until the route has been mapped. But what about
blue-green-deployment, where I map the application to a route on which
there is already a running application. Here, I cannot find out in a
general way, if the route mapping is completed and when I can start
unmapping the route from the old application. If route-mapping is indeed an
asynchronous task, in my opinion the description on
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html is
misleading at best. And all blue-green-deployment-scripts which I've seen
so far have this issue.

So, is the route mapping really supposed to be asynchronous? If so, is
there any general way to find out when the route-mapping process has
finished?

Thank you for your help and clarification,

Best regards,
--
Thomas Anderer
Agile Software Engineer
andrena objects ag
currently working at SAP



--
Chip Childers
CTO, Cloud Foundry Foundation
1.267.250.0815 <(267)%20250-0815>


Shannon Coen
 

Hello Thomas,

Comments inline.

On Thu, Feb 2, 2017 at 10:40 PM, Anderer, Thomas <thomas.anderer(a)sap.com>
wrote:

Hello everyone,

we already had an internal discussion on the topic and since we could not
quite come to a solution, I'd like to ask in this list.

We're using a self-made blue-green-deployment-script which pretty much
does what's explained on cloudfoundry.org (https://docs.cloudfoundry.
org/devguide/deploy-apps/blue-green.html), and additionally checks
(HTTP-GET) the application after push, map-route and unmap-route. This
script has been working fine on an instance A of CF (DEA-based and
relatively small), but has issues on instance B (Diego-based, larger). The
aforementioned HTTP-GET sometimes fails with "404 Not Found: Requested
route does not exist" directly after push or map-route. We never
experienced this issue on instance A. This could of course mean that the
instance is small enough that all router operation are handled faster than
our script is able to react. On instance B it sometimes takes up to a 1
second or even longer until the route mapping is finally completed.
I am reviewing the logic for how routes are registered for apps on DEAs,
but I believe it is also asynchronous. It may be that registration of
routes takes slightly longer with Diego, or you're seeing another
difference between the environments.


In our internal discussion with our CF operators, we found a couple of
parts of documentation which at least hint to different, maybe even
inconsistent inner workings of the route mapping:
1) https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html:
Step 3: Map-route - The CF Router immediately begins to load balance
traffic for demo-time.example.com between Blue and Green.
- In my opinion this implies or at least strongly hints that the
map-route call was supposed to be synchronous.
The routing table is indeed updated asynchronously, and we can update the
docs to clarify this; "immediately" may be a bit misleading.


2) PUT to "/v2/apps/#{app_guid}/routes/#{route_guid}" returns 201 CREATED
and not 202 ACCEPTED, which also implies that it is a synchronous
operation. It also does not return an event or operation which could be
pinged to wait for completion of route mapping.
3) Operations which show the state of the route, like for example 'cf
routes' already shows the route as being successfully mapped, although 404
is still returned.
Currently Cloud Controller has no way of knowing whether a route is ever
registered with a router, whether the app is on DEAs or Diego. We could
consider how to provide such a guarantee; e.g. CC could poll routes until
an app returns a 200 (several issues with this). Also, if we were to add
this feature we couldn't change the response in v2 from 201 to 202 as that
would not be backwards compatible; we could consider this for the v3 CC API.


4) See https://docs.cloudfoundry.org/devguide/deploy-apps/routes-
domains.html#map-route: Applications running on the DEA architecture must
be restarted after routes for an app are mapped or unmapped. Applications
running on Diego do not need to be restarted.
- This is contrary to what I experienced, since route mapping on our
instance A always worked synchronously and without restart of the
application. Furthermore, it contradicts what's explained in 1) at least
for DEA.
This sentence in the docs is incorrect. We will update them. Restarting an
app on DEA is not required after mapping or unmapping routes.


5) https://github.com/cloudfoundry/diego-design-notes#routing-translation-
components:
Routing Translation Components: Route-Emitter
a) monitors DesiredLRP state and ActualLRP state via the BBS. When a
change is detected, the Route-Emitter emits route registration and
unregistration messages to the gorouter via the NATS message bus,
b) periodically emits the entire routing table to the router,
c) maintains a lock in consul to ensure only one route-emitter handles
route registration at a time.
- This hints that route mapping is an asynchronous task.

Some of the issues could be fixed by repeatedly pinging the application in
order to wait until the route has been mapped. But what about
blue-green-deployment, where I map the application to a route on which
there is already a running application. Here, I cannot find out in a
general way, if the route mapping is completed and when I can start
unmapping the route from the old application. If route-mapping is indeed an
asynchronous task, in my opinion the description on
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html is
misleading at best. And all blue-green-deployment-scripts which I've seen
so far have this issue.

So, is the route mapping really supposed to be asynchronous? If so, is
there any general way to find out when the route-mapping process has
finished?
Route registration is an asynchronous operation. Other than polling the
application, CF does not expose a guaranteed way to discover that the route
has been registered with the routers themselves.

We recommend adding a short wait to your blue-green deploy script. The
blue-green-deploy CLI plugin effectively does this, but running a few other
commands (renaming apps) between mapping the route to one app and unmapping
it from another.

Thank you for your feedback! I'll submit a PR to update the docs now.
Shannon




Thank you for your help and clarification,

Best regards,
--
Thomas Anderer
Agile Software Engineer
andrena objects ag
currently working at SAP




Shannon Coen
 

Changes to docs have been committed, and will show up with the next push of
the docs app.

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Tue, Feb 7, 2017 at 11:30 AM, Shannon Coen <scoen(a)pivotal.io> wrote:

Hello Thomas,

Comments inline.

On Thu, Feb 2, 2017 at 10:40 PM, Anderer, Thomas <thomas.anderer(a)sap.com>
wrote:

Hello everyone,

we already had an internal discussion on the topic and since we could not
quite come to a solution, I'd like to ask in this list.

We're using a self-made blue-green-deployment-script which pretty much
does what's explained on cloudfoundry.org (https://docs.cloudfoundry.org
/devguide/deploy-apps/blue-green.html), and additionally checks
(HTTP-GET) the application after push, map-route and unmap-route. This
script has been working fine on an instance A of CF (DEA-based and
relatively small), but has issues on instance B (Diego-based, larger). The
aforementioned HTTP-GET sometimes fails with "404 Not Found: Requested
route does not exist" directly after push or map-route. We never
experienced this issue on instance A. This could of course mean that the
instance is small enough that all router operation are handled faster than
our script is able to react. On instance B it sometimes takes up to a 1
second or even longer until the route mapping is finally completed.
I am reviewing the logic for how routes are registered for apps on DEAs,
but I believe it is also asynchronous. It may be that registration of
routes takes slightly longer with Diego, or you're seeing another
difference between the environments.


In our internal discussion with our CF operators, we found a couple of
parts of documentation which at least hint to different, maybe even
inconsistent inner workings of the route mapping:
1) https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html:
Step 3: Map-route - The CF Router immediately begins to load balance
traffic for demo-time.example.com between Blue and Green.
- In my opinion this implies or at least strongly hints that the
map-route call was supposed to be synchronous.
The routing table is indeed updated asynchronously, and we can update the
docs to clarify this; "immediately" may be a bit misleading.


2) PUT to "/v2/apps/#{app_guid}/routes/#{route_guid}" returns 201
CREATED and not 202 ACCEPTED, which also implies that it is a synchronous
operation. It also does not return an event or operation which could be
pinged to wait for completion of route mapping.
3) Operations which show the state of the route, like for example 'cf
routes' already shows the route as being successfully mapped, although 404
is still returned.
Currently Cloud Controller has no way of knowing whether a route is ever
registered with a router, whether the app is on DEAs or Diego. We could
consider how to provide such a guarantee; e.g. CC could poll routes until
an app returns a 200 (several issues with this). Also, if we were to add
this feature we couldn't change the response in v2 from 201 to 202 as that
would not be backwards compatible; we could consider this for the v3 CC API.


4) See https://docs.cloudfoundry.org/devguide/deploy-apps/routes-do
mains.html#map-route: Applications running on the DEA architecture must
be restarted after routes for an app are mapped or unmapped. Applications
running on Diego do not need to be restarted.
- This is contrary to what I experienced, since route mapping on our
instance A always worked synchronously and without restart of the
application. Furthermore, it contradicts what's explained in 1) at least
for DEA.
This sentence in the docs is incorrect. We will update them. Restarting an
app on DEA is not required after mapping or unmapping routes.


5) https://github.com/cloudfoundry/diego-design-notes#routing-
translation-components:
Routing Translation Components: Route-Emitter
a) monitors DesiredLRP state and ActualLRP state via the BBS. When a
change is detected, the Route-Emitter emits route registration and
unregistration messages to the gorouter via the NATS message bus,
b) periodically emits the entire routing table to the router,
c) maintains a lock in consul to ensure only one route-emitter handles
route registration at a time.
- This hints that route mapping is an asynchronous task.

Some of the issues could be fixed by repeatedly pinging the application
in order to wait until the route has been mapped. But what about
blue-green-deployment, where I map the application to a route on which
there is already a running application. Here, I cannot find out in a
general way, if the route mapping is completed and when I can start
unmapping the route from the old application. If route-mapping is indeed an
asynchronous task, in my opinion the description on
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html is
misleading at best. And all blue-green-deployment-scripts which I've seen
so far have this issue.

So, is the route mapping really supposed to be asynchronous? If so, is
there any general way to find out when the route-mapping process has
finished?
Route registration is an asynchronous operation. Other than polling the
application, CF does not expose a guaranteed way to discover that the route
has been registered with the routers themselves.

We recommend adding a short wait to your blue-green deploy script. The
blue-green-deploy CLI plugin effectively does this, but running a few other
commands (renaming apps) between mapping the route to one app and unmapping
it from another.

Thank you for your feedback! I'll submit a PR to update the docs now.
Shannon




Thank you for your help and clarification,

Best regards,
--
Thomas Anderer
Agile Software Engineer
andrena objects ag
currently working at SAP




Keller, Jens <jens.keller@...>
 

Hi Shannon,
thanks for your feedback - comments inline.
Best regards
Jens


Currently Cloud Controller has no way of knowing whether a route is ever
registered with a router, whether the app is on DEAs or Diego. We could
consider how to provide such a guarantee; e.g. CC could poll routes until
an app returns a 200 (several issues with this). Also, if we were to add
this feature we couldn't change the response in v2 from 201 to 202 as that
would not be backwards compatible; we could consider this for the v3 CC API.
Not sure if my understanding is correct. First of all, to avoid misunderstandings, it'd be totally fine if we had an asynchronous mechanism. We don't need the operation to be synchronous. Second, if that's a new API, that's absolutely fine as well, we understand & agree that incompatible changes are not the way to go.

But the point here is, whether the return code is correct or not is not so much our issue: the issue is that we currently do not see any reliable way of knowing whether the route mapping worked or not, or is still in progress. So how do we know when we can disconnect the old version? See also the next comment below. So agree, just polling the route until the app returns a 200 is not the way to go (also see below).


We recommend adding a short wait to your blue-green deploy script. The
blue-green-deploy CLI plugin effectively does this, but running a few other
commands (renaming apps) between mapping the route to one app and unmapping
it from another.
Given we have a lot of instances of the old version, and just created one or a few instances of the new version on the same route - how do we even know that the route mapping worked at all?

Even when waiting for 5 minutes, it could be we get a 200, just because the response came from the old version, so we think it works, but when we now disconnect the old version it'd be fatal. And if we get a 404, how do we know whether we need to wait for 5 minutes, or the mapping just failed entirely and there's no point in waiting any longer at all.

To me that would feel a bit like "add a wait that is hopefully long enough and cross fingers" - I'm afraid such an approach is not really acceptable for enterprise applications with business-critical processes. But not sure if I got you wrong.

I'm not sure which changes to the router would be possible and which won't - but I assume if it is possible to tell the router "please map X", it should be possible to add a feature, with which one could ask the router "what's the status of the mapping operation of X"?

The cloud controller could then just provide a new API that exposes this, so that a deployment script can poll this information – or am I missing something?


Shannon Coen
 

Hello Jens,

I understand your concern. There isn't a simple solution at the moment but
I have recorded your feedback.

We are aware that there is more CF could do with regard to upgrading apps
without downtime. For quite some time we have discussed offering automated
cutover between versions of an app as a built-in feature, so app developers
wouldn't have to use these blue-green scripts. Priorities are hard.
However, even if this feature were offered there may have to be significant
changes to the routing architecture to guarantee that routes have been
mapped before calling the operation a success.

Considering the routing tier is horizontally scalable, what would your
success criteria be for "a route is mapped to an app"? Would it be that the
change must be applied to the routing tables of all routers or only some
percentage?

add a wait that is hopefully long enough and cross fingers


Yes, or check that the new version of your app is responding before
unmapping the old version from the route. Maybe expose a version endpoint?

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Thu, Feb 9, 2017 at 2:01 AM, Jens Keller <jens.keller(a)sap.com> wrote:

Hi Shannon,
thanks for your feedback - comments inline.
Best regards
Jens


Currently Cloud Controller has no way of knowing whether a route is ever
registered with a router, whether the app is on DEAs or Diego. We could
consider how to provide such a guarantee; e.g. CC could poll routes until
an app returns a 200 (several issues with this). Also, if we were to add
this feature we couldn't change the response in v2 from 201 to 202 as
that
would not be backwards compatible; we could consider this for the v3 CC
API.

Not sure if my understanding is correct. First of all, to avoid
misunderstandings, it'd be totally fine if we had an asynchronous
mechanism. We don't need the operation to be synchronous. Second, if that's
a new API, that's absolutely fine as well, we understand & agree that
incompatible changes are not the way to go.

But the point here is, whether the return code is correct or not is not so
much our issue: the issue is that we currently do not see any reliable way
of knowing whether the route mapping worked or not, or is still in
progress. So how do we know when we can disconnect the old version? See
also the next comment below. So agree, just polling the route until the app
returns a 200 is not the way to go (also see below).


We recommend adding a short wait to your blue-green deploy script. The
blue-green-deploy CLI plugin effectively does this, but running a few
other
commands (renaming apps) between mapping the route to one app and
unmapping
it from another.
Given we have a lot of instances of the old version, and just created one
or a few instances of the new version on the same route - how do we even
know that the route mapping worked at all?

Even when waiting for 5 minutes, it could be we get a 200, just because
the response came from the old version, so we think it works, but when we
now disconnect the old version it'd be fatal. And if we get a 404, how do
we know whether we need to wait for 5 minutes, or the mapping just failed
entirely and there's no point in waiting any longer at all.

To me that would feel a bit like "add a wait that is hopefully long enough
and cross fingers" - I'm afraid such an approach is not really acceptable
for enterprise applications with business-critical processes. But not sure
if I got you wrong.

I'm not sure which changes to the router would be possible and which won't
- but I assume if it is possible to tell the router "please map X", it
should be possible to add a feature, with which one could ask the router
"what's the status of the mapping operation of X"?

The cloud controller could then just provide a new API that exposes this,
so that a deployment script can poll this information – or am I missing
something?


Keller, Jens <jens.keller@...>
 

Hi Shannon,

the success criteria would be, that when I disconnect the old version of the app, all requests sent to the public URL will reach the new application.

I'm not aware what would happen if only some percentage of the routers had the routing tables updated; if that means that a few requests would reach the new version, giving one the impression that the mapping was successful and one could disconnect the old app, then this would be an issue.

Regarding your proposal of exposing a version endpoint, I do think this is a workaround that could help in most scenarios. It becomes a hassle when there are >100 instances of the old version and only 1 instance of the new version, but I do think this is not the majority of the scenarios.

Thank you!

Best regards
Jens


From: Shannon Coen <scoen(a)pivotal.io<mailto:scoen(a)pivotal.io>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Friday, February 10, 2017 at 4:24 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Subject: [cf-dev] Re: Re: Re: Asynchronous route-mapping and blue-green deployment

Hello Jens,

I understand your concern. There isn't a simple solution at the moment but I have recorded your feedback.

We are aware that there is more CF could do with regard to upgrading apps without downtime. For quite some time we have discussed offering automated cutover between versions of an app as a built-in feature, so app developers wouldn't have to use these blue-green scripts. Priorities are hard. However, even if this feature were offered there may have to be significant changes to the routing architecture to guarantee that routes have been mapped before calling the operation a success.

Considering the routing tier is horizontally scalable, what would your success criteria be for "a route is mapped to an app"? Would it be that the change must be applied to the routing tables of all routers or only some percentage?

add a wait that is hopefully long enough and cross fingers

Yes, or check that the new version of your app is responding before unmapping the old version from the route. Maybe expose a version endpoint?

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Thu, Feb 9, 2017 at 2:01 AM, Jens Keller <jens.keller(a)sap.com<mailto:jens.keller(a)sap.com>> wrote:
Hi Shannon,
thanks for your feedback - comments inline.
Best regards
Jens


Currently Cloud Controller has no way of knowing whether a route is ever
registered with a router, whether the app is on DEAs or Diego. We could
consider how to provide such a guarantee; e.g. CC could poll routes until
an app returns a 200 (several issues with this). Also, if we were to add
this feature we couldn't change the response in v2 from 201 to 202 as that
would not be backwards compatible; we could consider this for the v3 CC API.
Not sure if my understanding is correct. First of all, to avoid misunderstandings, it'd be totally fine if we had an asynchronous mechanism. We don't need the operation to be synchronous. Second, if that's a new API, that's absolutely fine as well, we understand & agree that incompatible changes are not the way to go.

But the point here is, whether the return code is correct or not is not so much our issue: the issue is that we currently do not see any reliable way of knowing whether the route mapping worked or not, or is still in progress. So how do we know when we can disconnect the old version? See also the next comment below. So agree, just polling the route until the app returns a 200 is not the way to go (also see below).


We recommend adding a short wait to your blue-green deploy script. The
blue-green-deploy CLI plugin effectively does this, but running a few other
commands (renaming apps) between mapping the route to one app and unmapping
it from another.
Given we have a lot of instances of the old version, and just created one or a few instances of the new version on the same route - how do we even know that the route mapping worked at all?

Even when waiting for 5 minutes, it could be we get a 200, just because the response came from the old version, so we think it works, but when we now disconnect the old version it'd be fatal. And if we get a 404, how do we know whether we need to wait for 5 minutes, or the mapping just failed entirely and there's no point in waiting any longer at all.

To me that would feel a bit like "add a wait that is hopefully long enough and cross fingers" - I'm afraid such an approach is not really acceptable for enterprise applications with business-critical processes. But not sure if I got you wrong.

I'm not sure which changes to the router would be possible and which won't - but I assume if it is possible to tell the router "please map X", it should be possible to add a feature, with which one could ask the router "what's the status of the mapping operation of X"?

The cloud controller could then just provide a new API that exposes this, so that a deployment script can poll this information – or am I missing something?