Re: Issue with crashing Windows apps on Diego
Aaron Huber
I shut off the healthcheck via the CLI and then started a separate process
calling healthcheck.exe on my test servers after setting the port
environment variable:
Looking at the data from running all day for the most part it looks normal
but one time it did time out:
The same instances have been up and running all day long, the timeout was
just a timeout then it came back just fine.
This has been nagging at me all weekend and I think I finally figured out
why. So far all healthchecks in CloudFoundry have been either on the PID
(process didn't crash) or the port (accepting TCP connections). This is the
first time I've seen one that is actually doing an HTTP check that must pass
for the "container" (such as it is on Windows) to be considered healthy.
Looking at the Linux healthcheck code it looks like there is a "uri"
healthcheck:
https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53
But as far as I can tell it's unused because only port is ever called:
https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98
In addition, all the documentation and even the help text on the CLI
describe this as a "port" healthcheck. It's bad enough that doing the HTTP
healthcheck means it's now inconsistent between Linux and Windows on Diego,
but the following are serious concerns for me:
1) Especially on .NET it can take a while for apps to start up and it's
likely we could get into a loop of starting and then killing containers
because we don't give them enough time to start up.
2) Even if all is working well, we've now hard coded that any app landed on
garden-windows now has to have a faster than 1 second HTTP response time or
it just can't land. What if my developer has an app that is expected to be
slow due to back-end dependencies or processing logic?
In my opinion, we need to change the Windows app lifecycle healthcheck at
https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs
to be consistent with Linux. Instead of doing an HttpClient() get request
we should be doing a TcpClient() connect. In that case a 1 second timeout
should be fine - all I care about is that the port is open and listening,
not how long it takes for the actual app to process and respond. As a
platform owner it's my job to make sure the app is there and able to be
connected to on the network - actual HTTP response and how long it takes
should be the developers concern.
Thoughts?
Aaron
--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3614.html
Sent from the CF Dev mailing list archive at Nabble.com.
calling healthcheck.exe on my test servers after setting the port
environment variable:
Looking at the data from running all day for the most part it looks normal
but one time it did time out:
The same instances have been up and running all day long, the timeout was
just a timeout then it came back just fine.
This has been nagging at me all weekend and I think I finally figured out
why. So far all healthchecks in CloudFoundry have been either on the PID
(process didn't crash) or the port (accepting TCP connections). This is the
first time I've seen one that is actually doing an HTTP check that must pass
for the "container" (such as it is on Windows) to be considered healthy.
Looking at the Linux healthcheck code it looks like there is a "uri"
healthcheck:
https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53
But as far as I can tell it's unused because only port is ever called:
https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98
In addition, all the documentation and even the help text on the CLI
describe this as a "port" healthcheck. It's bad enough that doing the HTTP
healthcheck means it's now inconsistent between Linux and Windows on Diego,
but the following are serious concerns for me:
1) Especially on .NET it can take a while for apps to start up and it's
likely we could get into a loop of starting and then killing containers
because we don't give them enough time to start up.
2) Even if all is working well, we've now hard coded that any app landed on
garden-windows now has to have a faster than 1 second HTTP response time or
it just can't land. What if my developer has an app that is expected to be
slow due to back-end dependencies or processing logic?
In my opinion, we need to change the Windows app lifecycle healthcheck at
https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs
to be consistent with Linux. Instead of doing an HttpClient() get request
we should be doing a TcpClient() connect. In that case a 1 second timeout
should be fine - all I care about is that the port is open and listening,
not how long it takes for the actual app to process and respond. As a
platform owner it's my job to make sure the app is there and able to be
connected to on the network - actual HTTP response and how long it takes
should be the developers concern.
Thoughts?
Aaron
--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3614.html
Sent from the CF Dev mailing list archive at Nabble.com.