[Proposal] Wanted a Babysitter for my applicatoin. ;-)


Dhilip
 

Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.

Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://github.com/cloudfoundry-incubator/healthcheck> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?



Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?



How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.


How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.

This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.

Thanks in advance,
Dhilip


Matt Curry
 

We have been talking about something similar that we have labeled the Angry Farmer. I do not think you would need an agent. The firehose and cloud controller should have everything that you need. Also an agent does not give you the ability to really measure the performance of instances relative to each other which is a good indicator of bad state or performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Cc: Vinay Murudi <vinaym(a)huawei.com<mailto:vinaym(a)huawei.com>>, Krishna M Kumar <krishna.m.kumar(a)huawei.com<mailto:krishna.m.kumar(a)huawei.com>>, Liangbiao <rexxar.liang(a)huawei.com<mailto:rexxar.liang(a)huawei.com>>, Srinivasch ch <srinivasch.ch(a)huawei.com<mailto:srinivasch.ch(a)huawei.com>>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.

Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?



Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?



How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.


How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.

This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.

Thanks in advance,
Dhilip


Niki Dokovski <nickytd@...>
 

Hi

Just an example to the case, a short description of a similar technique can be found at JavaSpaces (leveraging Java JINI)[1] .
The notion of a “lease” as a foundation of a self-healing systems architecture has been used successfully in the past.

Best Regards
Niki Dokovski | @nickytd <https://twitter.com/nickytd>

[1] https://en.wikibooks.org/wiki/Java_Programming/JavaSpaces <https://en.wikibooks.org/wiki/Java_Programming/JavaSpaces>

On 5.10.2015 г., at 19:31, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com> wrote:

Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.

Proposal Motivations
The way health-check process is currently performed in cloud foundry is to run a command <https://github.com/cloudfoundry-incubator/healthcheck> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.

How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.

This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.

Thanks in advance,
Dhilip


Sylvain Gibier
 

Hi,

My 2 cents - but have you look at sensu (https://sensuapp.org/) - I was in
need of something similar, and so end up deploying along with every pushed
app an sensu client (agent) that pull metrics out of the container onto my
bosh'ed sensu server, including if needed custom healthcheck for each app.

Sylvain


On Mon, Oct 5, 2015 at 6:31 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Hello CF,



Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.



*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command <https://github.com/cloudfoundry-incubator/healthcheck>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?



Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?



How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.



*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.



This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.



Thanks in advance,

Dhilip