Issue with crashing Windows apps on Diego


Aaron Huber
 

We've started testing Windows apps on Diego in our lab and everything appears
to be working correctly except for occasional crashes of the .NET apps. The
frequency is very random - some times I can go a day or more without any and
then I'll get many in a day. As far as I can tell from the logs the only
issue is that the healthcheck in the lifecycle is timing out due to
exceeding the 1 second wait here:

https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs#L29

Our test environment is definitely running on very slow storage so it
doesn't surprise me that it gets a bit slow sometimes, but I'm worried that
taking more than 1 second for a simple HTTP request to respond seems
unlikely. I've looked through the logs and can't find any indication of
root cause other than the healthcheck returning exit code 1 instead of zero:

{"timestamp":"1454113322.534542084","source":"garden-windows","message":"garden-windows.garden-server.run.spawned","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","spec":{"Path":"/tmp/lifecycle/healthcheck","Dir":"","User":"vcap","Limits":{"nofile":1024},"TTY":null}}}

{"timestamp":"1454113324.545698404","source":"garden-windows","message":"garden-windows.garden-server.run.exited","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","status":1}}

{"timestamp":"1454113324.987732887","source":"garden-windows","message":"garden-windows.garden-server.destroy.destroyed","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","session":"1.104647"}}

There are no other event log messages at the same time to indicate anything
is wrong on the system. Theoretically I could just try increasing the wait
time on the healthcheck but I'd love to get some more data on exactly what's
going on. Anyone have any ideas?

Aaron Huber
Intel Corporation





--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586.html
Sent from the CF Dev mailing list archive at Nabble.com.


Steven Benario
 

Hi Aaron,

Thanks for the report!

I'd recommend either extending the healthcheck timeout, or disabling health
checks completely to see if that fixes the problem. You can do this with:
`cf set-health-check APPNAME none`

If that doesn't fix the problem, is the app something you can share with
the CF Windows development team?

Thanks,
Steven Benario
Cloud Foundry PM for Greenhouse

On Fri, Jan 29, 2016 at 4:45 PM, aaron_huber <aaron.m.huber(a)intel.com>
wrote:

We've started testing Windows apps on Diego in our lab and everything
appears
to be working correctly except for occasional crashes of the .NET apps.
The
frequency is very random - some times I can go a day or more without any
and
then I'll get many in a day. As far as I can tell from the logs the only
issue is that the healthcheck in the lifecycle is timing out due to
exceeding the 1 second wait here:


https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs#L29

Our test environment is definitely running on very slow storage so it
doesn't surprise me that it gets a bit slow sometimes, but I'm worried that
taking more than 1 second for a simple HTTP request to respond seems
unlikely. I've looked through the logs and can't find any indication of
root cause other than the healthcheck returning exit code 1 instead of
zero:


{"timestamp":"1454113322.534542084","source":"garden-windows","message":"garden-windows.garden-server.run.spawned","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","spec":{"Path":"/tmp/lifecycle/healthcheck","Dir":"","User":"vcap","Limits":{"nofile":1024},"TTY":null}}}


{"timestamp":"1454113324.545698404","source":"garden-windows","message":"garden-windows.garden-server.run.exited","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","status":1}}


{"timestamp":"1454113324.987732887","source":"garden-windows","message":"garden-windows.garden-server.destroy.destroyed","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","session":"1.104647"}}

There are no other event log messages at the same time to indicate anything
is wrong on the system. Theoretically I could just try increasing the wait
time on the healthcheck but I'd love to get some more data on exactly
what's
going on. Anyone have any ideas?

Aaron Huber
Intel Corporation





--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

The app is just a simple one page test app we've been using since we landed
Iron Foundry, here is the content in our default.aspx:

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs"
%>

<!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot;
&quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
</head>
<body>
<form id="form1" runat="server">
<div>
Hello from Iron Foundry!
</div>
<div>
<%
Response.Write(".NET Framework Version: " +
System.Environment.Version.ToString() );
%>
</div>
</form>
</body>
</html>

I can't imagine it's causing any problem. :-) I'll try turning off the
healtcheck in CF and run healthcheck.exe in the app directory and see if I
can get any more data, thanks for the suggestion.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3603.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

I shut off the healthcheck via the CLI and then started a separate process
calling healthcheck.exe on my test servers after setting the port
environment variable:



Looking at the data from running all day for the most part it looks normal
but one time it did time out:



The same instances have been up and running all day long, the timeout was
just a timeout then it came back just fine.

This has been nagging at me all weekend and I think I finally figured out
why. So far all healthchecks in CloudFoundry have been either on the PID
(process didn't crash) or the port (accepting TCP connections). This is the
first time I've seen one that is actually doing an HTTP check that must pass
for the "container" (such as it is on Windows) to be considered healthy.
Looking at the Linux healthcheck code it looks like there is a "uri"
healthcheck:

https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53

But as far as I can tell it's unused because only port is ever called:

https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98

In addition, all the documentation and even the help text on the CLI
describe this as a "port" healthcheck. It's bad enough that doing the HTTP
healthcheck means it's now inconsistent between Linux and Windows on Diego,
but the following are serious concerns for me:

1) Especially on .NET it can take a while for apps to start up and it's
likely we could get into a loop of starting and then killing containers
because we don't give them enough time to start up.

2) Even if all is working well, we've now hard coded that any app landed on
garden-windows now has to have a faster than 1 second HTTP response time or
it just can't land. What if my developer has an app that is expected to be
slow due to back-end dependencies or processing logic?

In my opinion, we need to change the Windows app lifecycle healthcheck at
https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs
to be consistent with Linux. Instead of doing an HttpClient() get request
we should be doing a TcpClient() connect. In that case a 1 second timeout
should be fine - all I care about is that the port is open and listening,
not how long it takes for the actual app to process and respond. As a
platform owner it's my job to make sure the app is there and able to be
connected to on the network - actual HTTP response and how long it takes
should be the developers concern.

Thoughts?

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3614.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

Also just occurred to me - what if the page returns a 302, 401, or 404? I'm
guessing it would make the healthcheck fail because it wouldn't be a match
against Result.IsSuccessStatusCode. :-(

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3615.html
Sent from the CF Dev mailing list archive at Nabble.com.


Matthew Horan
 

On Mon, Feb 1, 2016 at 7:06 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

This has been nagging at me all weekend and I think I finally figured out
why. So far all healthchecks in CloudFoundry have been either on the PID
(process didn't crash) or the port (accepting TCP connections). This is
the
first time I've seen one that is actually doing an HTTP check that must
pass
for the "container" (such as it is on Windows) to be considered healthy.
Looking at the Linux healthcheck code it looks like there is a "uri"
healthcheck:


https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53

But as far as I can tell it's unused because only port is ever called:


https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98

Hey Aaron -

You're right; it looks like the port check is only ever used. I don't have
the history as to why we (the CF .NET team) implemented an HTTP check
instead of a port check, but that's how it is.

In addition, all the documentation and even the help text on the CLI
describe this as a "port" healthcheck. It's bad enough that doing the HTTP
healthcheck means it's now inconsistent between Linux and Windows on Diego,
but the following are serious concerns for me:

1) Especially on .NET it can take a while for apps to start up and it's
likely we could get into a loop of starting and then killing containers
because we don't give them enough time to start up.

2) Even if all is working well, we've now hard coded that any app landed on
garden-windows now has to have a faster than 1 second HTTP response time or
it just can't land. What if my developer has an app that is expected to be
slow due to back-end dependencies or processing logic?
There is a proposal [1] in place to address your concerns. As far as I
know, work towards implementing this proposal is stalled, but I've looped
in Eric for more details.

[1] https://github.com/cloudfoundry-incubator/diego-dev-notes/issues/31


Matthew Horan
 

On Tue, Feb 2, 2016 at 9:31 AM, Matthew Horan <mhoran(a)pivotal.io> wrote:

On Mon, Feb 1, 2016 at 7:06 PM, aaron_huber <aaron.m.huber(a)intel.com>
wrote:

This has been nagging at me all weekend and I think I finally figured out
why. So far all healthchecks in CloudFoundry have been either on the PID
(process didn't crash) or the port (accepting TCP connections). This is
the
first time I've seen one that is actually doing an HTTP check that must
pass
for the "container" (such as it is on Windows) to be considered healthy.
Looking at the Linux healthcheck code it looks like there is a "uri"
healthcheck:


https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53

But as far as I can tell it's unused because only port is ever called:


https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98

Hey Aaron -

You're right; it looks like the port check is only ever used. I don't have
the history as to why we (the CF .NET team) implemented an HTTP check
instead of a port check, but that's how it is.
In talking with a former team member, I came across the story [1] where we
made this change. The WebAppServer will listen on the port immediately upon
starting, even if the app has not successfully loaded. This was undesirable
for the common case -- but obviously causes issues for slow apps, or apps
which require authentication. As mentioned in the story, the developers
pointed out that this behavior should be configurable -- but this was never
implemented.

Hopefully we can see some progress on the proposed healthcheck changes,
which would better address your issue. In the meantime, I'm not sure of the
best course of action. It's quite easy to push an unlaunchable app to
Windows, and there will be little to no debug information available to help
the developer figure out why their app is inaccessible. The current
implementation has its drawbacks, but can be worked around by "disabling"
the health check.

[1] https://www.pivotaltracker.com/story/show/96080778


Aaron Huber
 

I agree with your root argument that the port check doesn't really address
application health and it's easy to push a non-working app and have the
healthcheck still pass. My argument is that is exactly how the healthchecks
work for Linux-based apps and it seems clear that is the intent of the
"port" healthcheck. Any buildpack or Docker based app that I push on
cflinuxfs2 will pass as soon as the web server starts accepting connections
even if the actual app isn't working (yet, or at all).

I don't disagree that improvement can be made here, but I do strongly
believe that 1) the platform should be consistent across Linux and Windows
apps and what is described as a "port" check should just be checking the
port, and 2) any HTTP check should be configurable (either opt-in or
opt-out) in cases where the root of an app isn't expected to return a 200,
of which there are many valid cases. Your proposed work-around in my
opinion is even worse, in that I have to disable any container checking at
all if an app falls outside of what you consider typical.

I think we agree that the best solution for most common apps is to use an
HTTP check, but in order for that to be functional I think the platform
would need to define a new "http" healthcheck type and allow the user to
configure a timeout and expected status code (with defaults of 1 second and
200).

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3633.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

Just to clarify as well why I think this is so important - a majority of apps
on our internal platforms require authentication and will return a 401 on
the root page, making them unusable on Diego for Windows without completely
disabling the healthchecks. These same apps work just fine on Iron Foundry
because it was only checking the port.

I'd love to move forward with Garden Windows support when we land Diego but
for now I don't see how we can.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3635.html
Sent from the CF Dev mailing list archive at Nabble.com.


Matthew Horan
 

On Tue, Feb 2, 2016 at 12:21 PM, aaron_huber <aaron.m.huber(a)intel.com>
wrote:

I don't disagree that improvement can be made here, but I do strongly
believe that 1) the platform should be consistent across Linux and Windows
apps and what is described as a "port" check should just be checking the
port, and 2) any HTTP check should be configurable (either opt-in or
opt-out) in cases where the root of an app isn't expected to return a 200,
of which there are many valid cases. Your proposed work-around in my
opinion is even worse, in that I have to disable any container checking at
all if an app falls outside of what you consider typical.
Setting the health check to none does not actually disable health checks.
This setting simply disables the HTTP healthcheck, and Diego will continue
to monitor the process. I'm not sure why this setting does not meet your
immediate requirements. A simple TCP check of a deadlocked or misconfigured
WebAppServer would pass both a simple TCP check and process check, while
the current healthcheck implementation would detect an issue. Given the
process check runs regardless of whether the healthcheck is enabled, the
more reliable (though sometimes undesirable) opt-in HTTP check can simply
be disabled, and the process will still be monitored by Diego.


I think we agree that the best solution for most common apps is to use an
HTTP check, but in order for that to be functional I think the platform
would need to define a new "http" healthcheck type and allow the user to
configure a timeout and expected status code (with defaults of 1 second and
200).

Please see the proposal [1] currently being discussed. We plan expose a
multitude of options for healthcheck, including simple port check, HTTP
check with configurable endpoint, and timeouts.

We've also dropped a story in our backlog [2] to bring our healthcheck in
line with Linux. However, any stories to implement the proposed healthcheck
improvements would likely be prioritized before this effort. Regardless,
garden-windows is open source, and pull requests are welcome!

[1] https://github.com/cloudfoundry-incubator/diego-dev-notes/issues/31
[2] https://www.pivotaltracker.com/story/show/112914163


Aaron Huber
 

My concern is that the HTTP check (mislabeled as "port") would still be the
default and I'd have to expect users to opt out of it per app. It's
confusing and not what users of the platform have come to expect moving from
DEA/IF.

In general, the HTTP checks as a platform owner still make me nervous. They
are nice in theory as long as they are opt-in for the developer, but what
happens when something goes wrong? For example, say I have an app dependent
on a back-end resource (database, web service, etc.) that is down and as a
result my app is returning a friendly error page with a 500 response. With
an HTTP healthcheck my app is now effectively down with an ugly 404 message
from the router as all containers will fail and not correctly respawn
because they will not return a 200 to ever get healthy. Is that a better
user experience than the friendly error page? How long will Diego continue
trying to start the unhealthy containers before it gives up and then
requires developer interaction to start the app again?

To close on this, I think the new story is essential for consistency of the
overall platform and to avoid the issues above, and I would argue strongly
that it should be completed ASAP. Once the improved story is in place then
my customers could opt into an HTTP check with adequate knowledge of the
potential impacts.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3647.html
Sent from the CF Dev mailing list archive at Nabble.com.


Eric Malm <emalm@...>
 

Hi, Aaron and Matt,

Thanks for the thoughtful discussion of the Windows health-check issue. I
too think for consistency that if the CF end user has specified 'port' as
the type of health-check on their app, then the platform should be checking
only TCP connectivity to the app on that port, and not any layer-7
functionality beyond that.

Some background on the HTTP vs TCP behavior in the health-check:
originally, the health-check binary used for the buildpack and docker app
lifecycles made only TCP connections to the requested port. When Lattice
made it possible to submit DesiredLRPs directly to the Diego API, we got
feedback from its users that they wanted an option to specify an HTTP-based
health-check as well. Consequently, we extended that health-check binary to
take an optional endpoint flag, and in its presence the binary would make a
GET request to the specified endpoint and check for a response with a 200
OK status code within the specified timeout (default 1s). For buildpack and
docker CF apps, though, none of that HTTP functionality has been exposed
through CC, and only the basic TCP connectivity check is available.

Matt, the native NetCheckAction from the Diego Dev Notes proposal you
mention is effectively just encoding the current behavior of that
TCP-or-HTTP health-check binary as an action that the rep could perform
itself, rather that by invoking that binary in-container. The Diego team
had conceived of it primarily as a performance optimization, particularly
when starting a lot of instances on a cell simultaneously, but
investigation revealed it to be of secondary benefit at best. The Diego
team might implement it at some point, but for now we'd prefer not to
expand the surface area of the Diego BBS API to include it. I've been
meaning to update and close out that Dev Notes issue, and will do so
shortly.

In any case, the options on that proposed NetCheckAction are just the ones
already available on the health-check binary, and, native action or not,
additional work would still be required to expose them through CC to the CF
end-user. Moreover, I don't think they're sufficient to address all the
concerns that Aaron raises in his observations about the Windows app
lifecycle's current HTTP-based check. Aaron, you mentioned timeout and
expected status code as important parameters to specify on an HTTP
health-check; are there others? I would think endpoint could be just as
useful: perhaps your app has a /health or /ping endpoint specifically
designed to return a fast response about the app itself, separate from
backing services and/or authentication checks, or perhaps it simply doesn't
handle requests to /.

Thanks,
Eric

On Tue, Feb 2, 2016 at 1:50 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

My concern is that the HTTP check (mislabeled as "port") would still be the
default and I'd have to expect users to opt out of it per app. It's
confusing and not what users of the platform have come to expect moving
from
DEA/IF.

In general, the HTTP checks as a platform owner still make me nervous.
They
are nice in theory as long as they are opt-in for the developer, but what
happens when something goes wrong? For example, say I have an app
dependent
on a back-end resource (database, web service, etc.) that is down and as a
result my app is returning a friendly error page with a 500 response. With
an HTTP healthcheck my app is now effectively down with an ugly 404 message
from the router as all containers will fail and not correctly respawn
because they will not return a 200 to ever get healthy. Is that a better
user experience than the friendly error page? How long will Diego continue
trying to start the unhealthy containers before it gives up and then
requires developer interaction to start the app again?

To close on this, I think the new story is essential for consistency of the
overall platform and to avoid the issues above, and I would argue strongly
that it should be completed ASAP. Once the improved story is in place then
my customers could opt into an HTTP check with adequate knowledge of the
potential impacts.

Aaron



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3647.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

Yes, I agree that setting the specific URI to check would be necessary as
well so that developers could avoid some of the other concerns. So the ones
I can think of:

* URI / endpoint
* Expected status codes - this would probably need to be a range or an
array, or even an array of ranges :-)
* Timeout

Aaron




--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3662.html
Sent from the CF Dev mailing list archive at Nabble.com.


Eric Malm <emalm@...>
 

Thanks, Aaron, that's extremely helpful. I'll start a separate thread on
cf-dev shortly soliciting more input on how the community would find richer
health checks useful, but this specification seems like an excellent
starting point.

Best,
Eric

On Wed, Feb 3, 2016 at 9:55 AM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

Yes, I agree that setting the specific URI to check would be necessary as
well so that developers could avoid some of the other concerns. So the
ones
I can think of:

* URI / endpoint
* Expected status codes - this would probably need to be a range or an
array, or even an array of ranges :-)
* Timeout

Aaron




--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3662.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

Based on this discussion, where are we on the priority of switching the
current "port" check for the Windows lifecycle back to actually be a port
check? I get the impression that the changes to support a new HTTP check in
the CC, CLI, BBS, etc. will probably take a while so until then I'm hoping
we can make the other change a bit quicker.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3686.html
Sent from the CF Dev mailing list archive at Nabble.com.


Steven Benario
 

Hi Aaron,

You can track the progress of the story for DiegoWindows here on the public
tracker [1].

As it stands, we don't yet have a solution that we could do within the
DiegoWindows codebase that wouldn't break existing applications by allowing
them to return "healthy" before the app has even started up.

I absolutely agree that have an inconsistent pattern between Linux and
Windows is something to avoid (and something that is mis-labeled is even
worse), but I can totally see how this decision was made originally, and I
don't yet have any ideas for something that could fix it in the short term.

I think long term, we'd like to see a general healthcheck that looks like
some combination or user-selection of:
- Process monitoring
- Port check
- HTTP check (with configuration options previously discussed)

...with some "sane" settings selected by default.

For the short term, until we have a strong proposal of what to do to
significantly improve the state of the world without breaking existing
applications, we will probably not make any changes.


Thanks,
Steven Benario
PM for Windows Support


[1] https://www.pivotaltracker.com/story/show/112914163

On Mon, Feb 8, 2016 at 1:21 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

Based on this discussion, where are we on the priority of switching the
current "port" check for the Windows lifecycle back to actually be a port
check? I get the impression that the changes to support a new HTTP check
in
the CC, CLI, BBS, etc. will probably take a while so until then I'm hoping
we can make the other change a bit quicker.

Aaron



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3686.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

I understand what you're trying to avoid, I just think that is actually the
normal case for the port healthchecks. Nothing on the Linux or Docker side
ever touches the app so it's entirely possible it will be added to the
router without it actually working and that is what I expect the platform to
do. Hopefully the more generic HTTP check can be added quickly to all the
right places so that we'll at least have more sensible options.

Now we just have to decide if we hang onto Iron Foundry that just uses a
port check until then, or try to explain to my users that most of their apps
won't work unless they turn off the healthcheck. I'm expecting most of them
won't RTFM and we'll get constant complaints about how our .NET support is
broken because their apps won't start up.

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3690.html
Sent from the CF Dev mailing list archive at Nabble.com.


Steven Benario
 

My understanding is that because the app droplet itself typically includes
the webserver (as opposed to Windows where the server is run by the host),
it would be rare for the web server to be available before the app is up
and running.

On Windows, it would be the common case for the web server to start
accepting TCP connections almost immediately, and you could wait a long
time before the app is ready. Hence the discrepancy.

Thanks for understanding and weighing in. Looking forward to hearing more
about how disabling the checks works in your environment -- and of course
keep an eye out here for the proposal and updated timeline on the more
robust checks.

Cheers,
Steven

On Mon, Feb 8, 2016 at 4:49 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:

I understand what you're trying to avoid, I just think that is actually the
normal case for the port healthchecks. Nothing on the Linux or Docker side
ever touches the app so it's entirely possible it will be added to the
router without it actually working and that is what I expect the platform
to
do. Hopefully the more generic HTTP check can be added quickly to all the
right places so that we'll at least have more sensible options.

Now we just have to decide if we hang onto Iron Foundry that just uses a
port check until then, or try to explain to my users that most of their
apps
won't work unless they turn off the healthcheck. I'm expecting most of
them
won't RTFM and we'll get constant complaints about how our .NET support is
broken because their apps won't start up.

Aaron



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3690.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

It will totally depend on the app/buildpack. For example, the static file
buildpack and PHP buildpack just launch Nginx and then host the application
inside it. As soon as the web server is up it will accept connections so
they would work identically to IIS HWC with just a TCP healthcheck. For
others the framework would still likely start up and accept connections
before the app itself is ready, and again it would be very possible that the
app itself would crash the first time you actually hit it but the
healthcheck would still think the container is healthy.

Again, I'm not arguing that any of that is "good", just that is how the
platform is expected to work with a port check and it should work
consistently. I also agree that the (annoying) 30-60 second app warmup on
.NET makes this even uglier.

Assuming you do eventually make the port healthcheck for Windows work by
checking the port, it should be made to work. My understanding right now is
you do the following (high level):

* Spin up the "container" via the app lifecycle (create user, set quota,
create FW rules, etc.)
* Start up the HWC process
* Start running the healthcheck which hits the root of the app and checks
for 200-299 with a 1s timeout
* Add it to the router once the healthcheck passes

What if you did something like this:

* Spin up the container
* Start up the HWC process
* Hit the app once via HTTP as part of the startup to get the app going
* Put in a hard coded delay like 30 seconds to give the app time to start
(.NET penalty)
* Start the healthcheck after the delay
* Add it to the router when passing

Just brainstorming. :-)

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3695.html
Sent from the CF Dev mailing list archive at Nabble.com.


Aaron Huber
 

Just checking in to make sure this isn't forgotten - any update on plans to
address this in the near future?

Aaron



--
View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p4017.html
Sent from the CF Dev mailing list archive at Nabble.com.