Date
1 - 20 of 20
Issue with crashing Windows apps on Diego
Aaron Huber
We've started testing Windows apps on Diego in our lab and everything appears
to be working correctly except for occasional crashes of the .NET apps. The frequency is very random - some times I can go a day or more without any and then I'll get many in a day. As far as I can tell from the logs the only issue is that the healthcheck in the lifecycle is timing out due to exceeding the 1 second wait here: https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs#L29 Our test environment is definitely running on very slow storage so it doesn't surprise me that it gets a bit slow sometimes, but I'm worried that taking more than 1 second for a simple HTTP request to respond seems unlikely. I've looked through the logs and can't find any indication of root cause other than the healthcheck returning exit code 1 instead of zero: {"timestamp":"1454113322.534542084","source":"garden-windows","message":"garden-windows.garden-server.run.spawned","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","spec":{"Path":"/tmp/lifecycle/healthcheck","Dir":"","User":"vcap","Limits":{"nofile":1024},"TTY":null}}} {"timestamp":"1454113324.545698404","source":"garden-windows","message":"garden-windows.garden-server.run.exited","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","id":"5920","session":"1.104644","status":1}} {"timestamp":"1454113324.987732887","source":"garden-windows","message":"garden-windows.garden-server.destroy.destroyed","log_level":1,"data":{"handle":"c41ecf17-6e8c-4b50-a103-4e32323ef53e-bdfa601f-0a44-48fd-8d05-e5551ac9af7a-3a193046-43ed-4811-7bc4-3595809a409c","session":"1.104647"}} There are no other event log messages at the same time to indicate anything is wrong on the system. Theoretically I could just try increasing the wait time on the healthcheck but I'd love to get some more data on exactly what's going on. Anyone have any ideas? Aaron Huber Intel Corporation -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Steven Benario
Hi Aaron,
Thanks for the report! I'd recommend either extending the healthcheck timeout, or disabling health checks completely to see if that fixes the problem. You can do this with: `cf set-health-check APPNAME none` If that doesn't fix the problem, is the app something you can share with the CF Windows development team? Thanks, Steven Benario Cloud Foundry PM for Greenhouse On Fri, Jan 29, 2016 at 4:45 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote: We've started testing Windows apps on Diego in our lab and everything |
|
Aaron Huber
The app is just a simple one page test app we've been using since we landed
Iron Foundry, here is the content in our default.aspx: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" %> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title></title> </head> <body> <form id="form1" runat="server"> <div> Hello from Iron Foundry! </div> <div> <% Response.Write(".NET Framework Version: " + System.Environment.Version.ToString() ); %> </div> </form> </body> </html> I can't imagine it's causing any problem. :-) I'll try turning off the healtcheck in CF and run healthcheck.exe in the app directory and see if I can get any more data, thanks for the suggestion. Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3603.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Aaron Huber
I shut off the healthcheck via the CLI and then started a separate process
calling healthcheck.exe on my test servers after setting the port environment variable: Looking at the data from running all day for the most part it looks normal but one time it did time out: The same instances have been up and running all day long, the timeout was just a timeout then it came back just fine. This has been nagging at me all weekend and I think I finally figured out why. So far all healthchecks in CloudFoundry have been either on the PID (process didn't crash) or the port (accepting TCP connections). This is the first time I've seen one that is actually doing an HTTP check that must pass for the "container" (such as it is on Windows) to be considered healthy. Looking at the Linux healthcheck code it looks like there is a "uri" healthcheck: https://github.com/cloudfoundry-incubator/healthcheck/blob/master/cmd/healthcheck/main.go#L49-L53 But as far as I can tell it's unused because only port is ever called: https://github.com/cloudfoundry-incubator/nsync/blob/master/recipebuilder/recipe_builder.go#L97-L98 In addition, all the documentation and even the help text on the CLI describe this as a "port" healthcheck. It's bad enough that doing the HTTP healthcheck means it's now inconsistent between Linux and Windows on Diego, but the following are serious concerns for me: 1) Especially on .NET it can take a while for apps to start up and it's likely we could get into a loop of starting and then killing containers because we don't give them enough time to start up. 2) Even if all is working well, we've now hard coded that any app landed on garden-windows now has to have a faster than 1 second HTTP response time or it just can't land. What if my developer has an app that is expected to be slow due to back-end dependencies or processing logic? In my opinion, we need to change the Windows app lifecycle healthcheck at https://github.com/cloudfoundry/windows_app_lifecycle/blob/master/Healthcheck/Program.cs to be consistent with Linux. Instead of doing an HttpClient() get request we should be doing a TcpClient() connect. In that case a 1 second timeout should be fine - all I care about is that the port is open and listening, not how long it takes for the actual app to process and respond. As a platform owner it's my job to make sure the app is there and able to be connected to on the network - actual HTTP response and how long it takes should be the developers concern. Thoughts? Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3614.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Aaron Huber
Also just occurred to me - what if the page returns a 302, 401, or 404? I'm
guessing it would make the healthcheck fail because it wouldn't be a match against Result.IsSuccessStatusCode. :-( Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3615.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Matthew Horan
On Mon, Feb 1, 2016 at 7:06 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:
Hey Aaron - You're right; it looks like the port check is only ever used. I don't have the history as to why we (the CF .NET team) implemented an HTTP check instead of a port check, but that's how it is. In addition, all the documentation and even the help text on the CLI describe this as a "port" healthcheck. It's bad enough that doing the HTTPThere is a proposal [1] in place to address your concerns. As far as I know, work towards implementing this proposal is stalled, but I've looped in Eric for more details. [1] https://github.com/cloudfoundry-incubator/diego-dev-notes/issues/31 |
|
Matthew Horan
On Tue, Feb 2, 2016 at 9:31 AM, Matthew Horan <mhoran(a)pivotal.io> wrote:
On Mon, Feb 1, 2016 at 7:06 PM, aaron_huber <aaron.m.huber(a)intel.com>In talking with a former team member, I came across the story [1] where we made this change. The WebAppServer will listen on the port immediately upon starting, even if the app has not successfully loaded. This was undesirable for the common case -- but obviously causes issues for slow apps, or apps which require authentication. As mentioned in the story, the developers pointed out that this behavior should be configurable -- but this was never implemented. Hopefully we can see some progress on the proposed healthcheck changes, which would better address your issue. In the meantime, I'm not sure of the best course of action. It's quite easy to push an unlaunchable app to Windows, and there will be little to no debug information available to help the developer figure out why their app is inaccessible. The current implementation has its drawbacks, but can be worked around by "disabling" the health check. [1] https://www.pivotaltracker.com/story/show/96080778 |
|
Aaron Huber
I agree with your root argument that the port check doesn't really address
application health and it's easy to push a non-working app and have the healthcheck still pass. My argument is that is exactly how the healthchecks work for Linux-based apps and it seems clear that is the intent of the "port" healthcheck. Any buildpack or Docker based app that I push on cflinuxfs2 will pass as soon as the web server starts accepting connections even if the actual app isn't working (yet, or at all). I don't disagree that improvement can be made here, but I do strongly believe that 1) the platform should be consistent across Linux and Windows apps and what is described as a "port" check should just be checking the port, and 2) any HTTP check should be configurable (either opt-in or opt-out) in cases where the root of an app isn't expected to return a 200, of which there are many valid cases. Your proposed work-around in my opinion is even worse, in that I have to disable any container checking at all if an app falls outside of what you consider typical. I think we agree that the best solution for most common apps is to use an HTTP check, but in order for that to be functional I think the platform would need to define a new "http" healthcheck type and allow the user to configure a timeout and expected status code (with defaults of 1 second and 200). Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3633.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Aaron Huber
Just to clarify as well why I think this is so important - a majority of apps
on our internal platforms require authentication and will return a 401 on the root page, making them unusable on Diego for Windows without completely disabling the healthchecks. These same apps work just fine on Iron Foundry because it was only checking the port. I'd love to move forward with Garden Windows support when we land Diego but for now I don't see how we can. Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3635.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Matthew Horan
On Tue, Feb 2, 2016 at 12:21 PM, aaron_huber <aaron.m.huber(a)intel.com>
wrote: I don't disagree that improvement can be made here, but I do stronglySetting the health check to none does not actually disable health checks. This setting simply disables the HTTP healthcheck, and Diego will continue to monitor the process. I'm not sure why this setting does not meet your immediate requirements. A simple TCP check of a deadlocked or misconfigured WebAppServer would pass both a simple TCP check and process check, while the current healthcheck implementation would detect an issue. Given the process check runs regardless of whether the healthcheck is enabled, the more reliable (though sometimes undesirable) opt-in HTTP check can simply be disabled, and the process will still be monitored by Diego. I think we agree that the best solution for most common apps is to use an Please see the proposal [1] currently being discussed. We plan expose a multitude of options for healthcheck, including simple port check, HTTP check with configurable endpoint, and timeouts. We've also dropped a story in our backlog [2] to bring our healthcheck in line with Linux. However, any stories to implement the proposed healthcheck improvements would likely be prioritized before this effort. Regardless, garden-windows is open source, and pull requests are welcome! [1] https://github.com/cloudfoundry-incubator/diego-dev-notes/issues/31 [2] https://www.pivotaltracker.com/story/show/112914163 |
|
Aaron Huber
My concern is that the HTTP check (mislabeled as "port") would still be the
default and I'd have to expect users to opt out of it per app. It's confusing and not what users of the platform have come to expect moving from DEA/IF. In general, the HTTP checks as a platform owner still make me nervous. They are nice in theory as long as they are opt-in for the developer, but what happens when something goes wrong? For example, say I have an app dependent on a back-end resource (database, web service, etc.) that is down and as a result my app is returning a friendly error page with a 500 response. With an HTTP healthcheck my app is now effectively down with an ugly 404 message from the router as all containers will fail and not correctly respawn because they will not return a 200 to ever get healthy. Is that a better user experience than the friendly error page? How long will Diego continue trying to start the unhealthy containers before it gives up and then requires developer interaction to start the app again? To close on this, I think the new story is essential for consistency of the overall platform and to avoid the issues above, and I would argue strongly that it should be completed ASAP. Once the improved story is in place then my customers could opt into an HTTP check with adequate knowledge of the potential impacts. Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3647.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Eric Malm <emalm@...>
Hi, Aaron and Matt,
toggle quoted message
Show quoted text
Thanks for the thoughtful discussion of the Windows health-check issue. I too think for consistency that if the CF end user has specified 'port' as the type of health-check on their app, then the platform should be checking only TCP connectivity to the app on that port, and not any layer-7 functionality beyond that. Some background on the HTTP vs TCP behavior in the health-check: originally, the health-check binary used for the buildpack and docker app lifecycles made only TCP connections to the requested port. When Lattice made it possible to submit DesiredLRPs directly to the Diego API, we got feedback from its users that they wanted an option to specify an HTTP-based health-check as well. Consequently, we extended that health-check binary to take an optional endpoint flag, and in its presence the binary would make a GET request to the specified endpoint and check for a response with a 200 OK status code within the specified timeout (default 1s). For buildpack and docker CF apps, though, none of that HTTP functionality has been exposed through CC, and only the basic TCP connectivity check is available. Matt, the native NetCheckAction from the Diego Dev Notes proposal you mention is effectively just encoding the current behavior of that TCP-or-HTTP health-check binary as an action that the rep could perform itself, rather that by invoking that binary in-container. The Diego team had conceived of it primarily as a performance optimization, particularly when starting a lot of instances on a cell simultaneously, but investigation revealed it to be of secondary benefit at best. The Diego team might implement it at some point, but for now we'd prefer not to expand the surface area of the Diego BBS API to include it. I've been meaning to update and close out that Dev Notes issue, and will do so shortly. In any case, the options on that proposed NetCheckAction are just the ones already available on the health-check binary, and, native action or not, additional work would still be required to expose them through CC to the CF end-user. Moreover, I don't think they're sufficient to address all the concerns that Aaron raises in his observations about the Windows app lifecycle's current HTTP-based check. Aaron, you mentioned timeout and expected status code as important parameters to specify on an HTTP health-check; are there others? I would think endpoint could be just as useful: perhaps your app has a /health or /ping endpoint specifically designed to return a fast response about the app itself, separate from backing services and/or authentication checks, or perhaps it simply doesn't handle requests to /. Thanks, Eric On Tue, Feb 2, 2016 at 1:50 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:
My concern is that the HTTP check (mislabeled as "port") would still be the |
|
Aaron Huber
Yes, I agree that setting the specific URI to check would be necessary as
well so that developers could avoid some of the other concerns. So the ones I can think of: * URI / endpoint * Expected status codes - this would probably need to be a range or an array, or even an array of ranges :-) * Timeout Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3662.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Eric Malm <emalm@...>
Thanks, Aaron, that's extremely helpful. I'll start a separate thread on
toggle quoted message
Show quoted text
cf-dev shortly soliciting more input on how the community would find richer health checks useful, but this specification seems like an excellent starting point. Best, Eric On Wed, Feb 3, 2016 at 9:55 AM, aaron_huber <aaron.m.huber(a)intel.com> wrote:
Yes, I agree that setting the specific URI to check would be necessary as |
|
Aaron Huber
Based on this discussion, where are we on the priority of switching the
current "port" check for the Windows lifecycle back to actually be a port check? I get the impression that the changes to support a new HTTP check in the CC, CLI, BBS, etc. will probably take a while so until then I'm hoping we can make the other change a bit quicker. Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3686.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Steven Benario
Hi Aaron,
toggle quoted message
Show quoted text
You can track the progress of the story for DiegoWindows here on the public tracker [1]. As it stands, we don't yet have a solution that we could do within the DiegoWindows codebase that wouldn't break existing applications by allowing them to return "healthy" before the app has even started up. I absolutely agree that have an inconsistent pattern between Linux and Windows is something to avoid (and something that is mis-labeled is even worse), but I can totally see how this decision was made originally, and I don't yet have any ideas for something that could fix it in the short term. I think long term, we'd like to see a general healthcheck that looks like some combination or user-selection of: - Process monitoring - Port check - HTTP check (with configuration options previously discussed) ...with some "sane" settings selected by default. For the short term, until we have a strong proposal of what to do to significantly improve the state of the world without breaking existing applications, we will probably not make any changes. Thanks, Steven Benario PM for Windows Support [1] https://www.pivotaltracker.com/story/show/112914163 On Mon, Feb 8, 2016 at 1:21 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:
Based on this discussion, where are we on the priority of switching the |
|
Aaron Huber
I understand what you're trying to avoid, I just think that is actually the
normal case for the port healthchecks. Nothing on the Linux or Docker side ever touches the app so it's entirely possible it will be added to the router without it actually working and that is what I expect the platform to do. Hopefully the more generic HTTP check can be added quickly to all the right places so that we'll at least have more sensible options. Now we just have to decide if we hang onto Iron Foundry that just uses a port check until then, or try to explain to my users that most of their apps won't work unless they turn off the healthcheck. I'm expecting most of them won't RTFM and we'll get constant complaints about how our .NET support is broken because their apps won't start up. Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3690.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Steven Benario
My understanding is that because the app droplet itself typically includes
toggle quoted message
Show quoted text
the webserver (as opposed to Windows where the server is run by the host), it would be rare for the web server to be available before the app is up and running. On Windows, it would be the common case for the web server to start accepting TCP connections almost immediately, and you could wait a long time before the app is ready. Hence the discrepancy. Thanks for understanding and weighing in. Looking forward to hearing more about how disabling the checks works in your environment -- and of course keep an eye out here for the proposal and updated timeline on the more robust checks. Cheers, Steven On Mon, Feb 8, 2016 at 4:49 PM, aaron_huber <aaron.m.huber(a)intel.com> wrote:
I understand what you're trying to avoid, I just think that is actually the |
|
Aaron Huber
It will totally depend on the app/buildpack. For example, the static file
buildpack and PHP buildpack just launch Nginx and then host the application inside it. As soon as the web server is up it will accept connections so they would work identically to IIS HWC with just a TCP healthcheck. For others the framework would still likely start up and accept connections before the app itself is ready, and again it would be very possible that the app itself would crash the first time you actually hit it but the healthcheck would still think the container is healthy. Again, I'm not arguing that any of that is "good", just that is how the platform is expected to work with a port check and it should work consistently. I also agree that the (annoying) 30-60 second app warmup on .NET makes this even uglier. Assuming you do eventually make the port healthcheck for Windows work by checking the port, it should be made to work. My understanding right now is you do the following (high level): * Spin up the "container" via the app lifecycle (create user, set quota, create FW rules, etc.) * Start up the HWC process * Start running the healthcheck which hits the root of the app and checks for 200-299 with a 1s timeout * Add it to the router once the healthcheck passes What if you did something like this: * Spin up the container * Start up the HWC process * Hit the app once via HTTP as part of the startup to get the app going * Put in a hard coded delay like 30 seconds to give the app time to start (.NET penalty) * Start the healthcheck after the delay * Add it to the router when passing Just brainstorming. :-) Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p3695.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Aaron Huber
Just checking in to make sure this isn't forgotten - any update on plans to
address this in the near future? Aaron -- View this message in context: http://cf-dev.70369.x6.nabble.com/Issue-with-crashing-Windows-apps-on-Diego-tp3586p4017.html Sent from the CF Dev mailing list archive at Nabble.com. |
|