Re: Issue with crashing Windows apps on Diego


Eric Malm <emalm@...>
 

Hi, Aaron and Matt,

Thanks for the thoughtful discussion of the Windows health-check issue. I
too think for consistency that if the CF end user has specified 'port' as
the type of health-check on their app, then the platform should be checking
only TCP connectivity to the app on that port, and not any layer-7
functionality beyond that.

Some background on the HTTP vs TCP behavior in the health-check:
originally, the health-check binary used for the buildpack and docker app
lifecycles made only TCP connections to the requested port. When Lattice
made it possible to submit DesiredLRPs directly to the Diego API, we got
feedback from its users that they wanted an option to specify an HTTP-based
health-check as well. Consequently, we extended that health-check binary to
take an optional endpoint flag, and in its presence the binary would make a
GET request to the specified endpoint and check for a response with a 200
OK status code within the specified timeout (default 1s). For buildpack and
docker CF apps, though, none of that HTTP functionality has been exposed
through CC, and only the basic TCP connectivity check is available.

Matt, the native NetCheckAction from the Diego Dev Notes proposal you
mention is effectively just encoding the current behavior of that
TCP-or-HTTP health-check binary as an action that the rep could perform
itself, rather that by invoking that binary in-container. The Diego team
had conceived of it primarily as a performance optimization, particularly
when starting a lot of instances on a cell simultaneously, but
investigation revealed it to be of secondary benefit at best. The Diego
team might implement it at some point, but for now we'd prefer not to
expand the surface area of the Diego BBS API to include it. I've been
meaning to update and close out that Dev Notes issue, and will do so
shortly.

In any case, the options on that proposed NetCheckAction are just the ones
already available on the health-check binary, and, native action or not,
additional work would still be required to expose them through CC to the CF
end-user. Moreover, I don't think they're sufficient to address all the
concerns that Aaron raises in his observations about the Windows app
lifecycle's current HTTP-based check. Aaron, you mentioned timeout and
expected status code as important parameters to specify on an HTTP
health-check; are there others? I would think endpoint could be just as
useful: perhaps your app has a /health or /ping endpoint specifically
designed to return a fast response about the app itself, separate from
backing services and/or authentication checks, or perhaps it simply doesn't
handle requests to /.

Thanks,
Eric

On Tue, Feb 2, 2016 at 1:50 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

My concern is that the HTTP check (mislabeled as "port") would still be the
default and I'd have to expect users to opt out of it per app. It's
confusing and not what users of the platform have come to expect moving
from
DEA/IF.

In general, the HTTP checks as a platform owner still make me nervous.
They
are nice in theory as long as they are opt-in for the developer, but what
happens when something goes wrong? For example, say I have an app
dependent
on a back-end resource (database, web service, etc.) that is down and as a
result my app is returning a friendly error page with a 500 response. With
an HTTP healthcheck my app is now effectively down with an ugly 404 message
from the router as all containers will fail and not correctly respawn
because they will not return a 200 to ever get healthy. Is that a better
user experience than the friendly error page? How long will Diego continue
trying to start the unhealthy containers before it gives up and then
requires developer interaction to start the app again?

To close on this, I think the new story is essential for consistency of the
overall platform and to avoid the issues above, and I would argue strongly
that it should be completed ASAP. Once the improved story is in place then
my customers could opt into an HTTP check with adequate knowledge of the
potential impacts.

Aaron



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3647.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.