Richer health-checks for CF apps: request for use cases


Eric Malm <emalm@...>
 

Dear CF Community,

CF has long had a notion of health-checking app instances as they start up
to determine whether they're in a functional state, on top of the process
simply having started. On the DEAs, the health-check behavior is coupled to
whether the app has routes mapped to it, and for apps targeting the Diego
backend, this health-check specification is independent of the routing
configuration on the app. On Diego cells, the health check is also run
periodically[1] even after the app is started, to verify the health of the
instance continually.

With that independence, we now would have more flexibility to specify
richer health checks for CF app instances. We on the CAPI and Diego teams
would like to know what kinds of health checks you would find useful for
your apps (either ones serving web traffic, or ones doing background work).
The two types of health check currently available are 'port', which checks
that a TCP connection can be made to the app instance on the port specified
by the PORT env var, and 'none', which despite the name does continually
verify that the process invoked in the container is still running.

As a starting point, on a recent cf-dev thread[2], we identified that for
an HTTP-based health check, it would be useful to specify an endpoint to
hit, an acceptable response status code or codes, and a timeout to apply to
the request. Sensible defaults could be "/", 200 OK, and 1 second,
respectively.

In any case, please comment here with your health-check use cases, and we
intend to use them as input to a proposal soon.

Thanks very much,
Eric, CF Runtime Diego PM

[1]:
https://github.com/cloudfoundry-incubator/diego-design-notes/blob/master/migrating-to-diego.md#health-checks
[2]:
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/HT7W7UMHR3ZLHV3Q6VJN5URETQUJBVZW/


Aaron Huber
 

Just as a reference you could look at some of the connection tests that Monit
allows:

https://mmonit.com/monit/documentation/monit.html#CONNECTION-TESTING

Obviously there are quite a few there so it might go well beyond what's
reasonable for container health checking.

I think to meet our use cases the addition of the HTTP check already
mentioned would be sufficient but to add to it, I could imagine that it
might be useful to be able to specify a regular expression to search for in
the returned HTML instead of or in addition to the status code.

Also, since you guys are expanding into offer TCP routing for containers, a
generic TCP monitor that looked for a specific regular expression in the
returned data might be useful, which might also require specifying data to
send to trigger a response.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Richer-health-checks-for-CF-apps-request-for-use-cases-tp3676p3677.html
Sent from the CF Dev mailing list archive at Nabble.com.