CF app that helps with self-healing


Siva <mailsiva@...>
 

Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan


Daniel Mikusa
 

Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <mailsiva@...> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan


Siva <mailsiva@...>
 

Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva


On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <dmikusa@...> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <mailsiva@...> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



Daniel Jones
 

Hi Siva,

I'm not aware of a similar solution that already exists. A couple of thoughts:
  • Could you use HTTP healthchecks, and have the endpoint return a non-200 status code if the app detects high CPU usage itself?
  • Be mindful of how CPU usage is reported. Whilst current containerisation tech can limit how many CPU shares a process gets, it can't control the system calls that report how much CPU is available. Hence things like `top` will appear inaccurate, and you should ensure the CPU usage statistics come from the metrics that feed into the cpu-entitlement-plugin. If you want to double-check this, there's a blog post (https://www.cloudfoundry.org/blog/better-way-split-cake-cpu-entitlements/) and the folks in the #garden channel are awfully helpful.
  • Having an endpoint that allows remote termination of an app sounds like a bit of a security risk, but I'm sure you'll manage that appropriately.

Regards,
Daniel 'Deejay' Jones - CTO
+44 (0)79 8000 9153
EngineerBetter Ltd - More than cloud platform specialists


On Fri, 24 Jan 2020 at 22:27, Siva <mailsiva@...> wrote:
Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva

On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <dmikusa@...> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <mailsiva@...> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



--


Daniel Mikusa
 



On Fri, Jan 24, 2020 at 5:28 PM Siva <mailsiva@...> wrote:
Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.

The cf cli is just a glorified rest client. If you can access the cloud controller API for your foundation, you can do everything I mentioned w/out the cf cli & by using raw rest commands.


+1 to everything Daniel Jones said in his response.

Hope that helps!

Dan

 
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva

On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <dmikusa@...> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <mailsiva@...> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



--


Troy Topnik
 

Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
troy.topnik@...
 


Siva <mailsiva@...>
 

Thanks Daniel J and Daniel M for your inputs.
Troy - We are also thinking something along those lines to see of we can use the App Autoscaler for the restarts.

-Siva


On Mon, Jan 27, 2020 at 10:05 AM Troy Topnik <troy.topnik@...> wrote:
Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
 



Hjortshoj, Julian <Julian.Hjortshoj@...>
 

To me this seems a lot like a health check.  Is there some reason that you couldn't add a health check endpoint to your app instances (either directly, or as a sidecar) and then let CF take care of restarting the app instances for you?


From: cf-dev@... <cf-dev@...> on behalf of Siva <mailsiva@...>
Sent: Monday, January 27, 2020 11:22 AM
To: Discussions about Cloud Foundry projects and the system overall. <cf-dev@...>
Subject: Re: [cf-dev] CF app that helps with self-healing
 

[EXTERNAL EMAIL]

Thanks Daniel J and Daniel M for your inputs.
Troy - We are also thinking something along those lines to see of we can use the App Autoscaler for the restarts.

-Siva

On Mon, Jan 27, 2020 at 10:05 AM Troy Topnik <troy.topnik@...> wrote:
Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
 



--