[Proposal] Wanted a Babysitter for my applicatoin. ; -)


Amit Kumar Gupta
 

I'm not sure I see the benefit here.

Diego, for instance, runs a customizable babysitter alongside each app
instance, and kills the container if the babysitter says things are going
bad. This triggers an event that the system can react to, and the system
also polls for container states because events can always be lost.

One thing to note is in this case, "the system" is the Executor, not HM9k
(which doesn't exist in Diego), or the Converger (Diego's equivalent of
HM9k), or Firehose or Cloud Controller which are very far removed from the
container backend. In Diego, the pieces are loosely coupled, events/data
in the system don't have to be sent through several layers of abstraction.

Best,
Amit

On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com>
wrote:

We have been talking about something similar that we have labeled the
Angry Farmer. I do not think you would need an agent. The firehose and
cloud controller should have everything that you need. Also an agent does
not give you the ability to really measure the performance of instances
relative to each other which is a good indicator of bad state or
performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
Reply-To: "Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <
cf-dev(a)lists.cloudfoundry.org>
Cc: Vinay Murudi <vinaym(a)huawei.com>, Krishna M Kumar <
krishna.m.kumar(a)huawei.com>, Liangbiao <rexxar.liang(a)huawei.com>,
Srinivasch ch <srinivasch.ch(a)huawei.com>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,



Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.



*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?



Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?



How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.



*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.



This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.



Thanks in advance,

Dhilip




Dhilip
 

Hi All,

Thanks for the response.

Hi Amit,

Thanks for the info, I haven’t noticed that we run a copy of executor for each ‘garden-linux’ container that we launch. We do have a ‘push’ based container metrics collection and monitoring mechanism already in place then, In this case I can think of only the following benefits here.


1) This can become a unified health check approach as this binary can be packed within the container, it can even run inside a docker-container of an external system and keep pushing to a common HM. Or we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in every container anyways.

Ofcourse the original intention is to see if we could slightly alter the way diego’s monitoring/metrics collection works. If this is already implemented then I do not see a point perusing this idea.

Thanks for your time CF,
Dhilip


From: Amit Gupta [mailto:agupta(a)pivotal.io]
Sent: Tuesday, October 06, 2015 1:05 AM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
Subject: [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

I'm not sure I see the benefit here.

Diego, for instance, runs a customizable babysitter alongside each app instance, and kills the container if the babysitter says things are going bad. This triggers an event that the system can react to, and the system also polls for container states because events can always be lost.

One thing to note is in this case, "the system" is the Executor, not HM9k (which doesn't exist in Diego), or the Converger (Diego's equivalent of HM9k), or Firehose or Cloud Controller which are very far removed from the container backend. In Diego, the pieces are loosely coupled, events/data in the system don't have to be sent through several layers of abstraction.

Best,
Amit

On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com<mailto:Matt.Curry(a)allstate.com>> wrote:
We have been talking about something similar that we have labeled the Angry Farmer. I do not think you would need an agent. The firehose and cloud controller should have everything that you need. Also an agent does not give you the ability to really measure the performance of instances relative to each other which is a good indicator of bad state or performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Cc: Vinay Murudi <vinaym(a)huawei.com<mailto:vinaym(a)huawei.com>>, Krishna M Kumar <krishna.m.kumar(a)huawei.com<mailto:krishna.m.kumar(a)huawei.com>>, Liangbiao <rexxar.liang(a)huawei.com<mailto:rexxar.liang(a)huawei.com>>, Srinivasch ch <srinivasch.ch(a)huawei.com<mailto:srinivasch.ch(a)huawei.com>>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,
Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.
Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.
How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.
This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.
Thanks in advance,
Dhilip


Amit Kumar Gupta
 

Hey Dhilip,

To clarify, we don't have a copy of executor for each garden-linux
container. A single "cell" VM has one executor, one garden-linux, and many
containers. The executor runs one monitor or "babysitter" process per each
container.

What would be the benefit of running a monitor inside external systems
which report to the HM? With Diego, there is no HM, so who exactly would
it report to? And whatever it reports to, what can it do with that
information? The Diego system components can take action when hearing
about a failed container running in a Diego cell, it can schedule the
process to be restarted, or whatever the right action may be given the
crash restart policies. How can Diego or any Cloud Foundry component take
action against an external system?

I think you highlight something valuable, that it would be nice for the
platform to support running things other than apps, e.g. a MySQL database.
The plan is that this can be solved within Diego's abstractions of tasks
and LRPs, and it's true for perhaps most stateless non-app workloads, but
things like databases are still hard, due to persistence being a hard
problem. If you have not already seen it, Ted Young and Caleb Miles talk
at the last CF Summit about this problem is a good one to watch:
https://www.youtube.com/watch?v=3Ut6Qdd2FHY

Not all containers run sshd. Typically, the CC is responsible for
requesting that an LRP have SSH access enabled, it's not conflated with
Diego's responsibilities. It's also optional for the CC, users and space
managers can opt to disable SSH (actually, I believe it's disabled by
default).

Cheers,
Amit

On Mon, Oct 5, 2015 at 10:39 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Hi All,



Thanks for the response.



Hi Amit,



Thanks for the info, I haven’t noticed that we run a copy of executor for
each ‘garden-linux’ container that we launch. We do have a ‘push’ based
container metrics collection and monitoring mechanism already in place
then, In this case I can think of only the following benefits here.



1) This can become a unified health check approach as this binary
can be packed within the container, it can even run inside a
docker-container of an external system and keep pushing to a common HM. Or
we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in
every container anyways.



Ofcourse the original intention is to see if we could slightly alter the
way diego’s monitoring/metrics collection works. If this is already
implemented then I do not see a point perusing this idea.



Thanks for your time CF,

Dhilip





*From:* Amit Gupta [mailto:agupta(a)pivotal.io]
*Sent:* Tuesday, October 06, 2015 1:05 AM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Cc:* Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
*Subject:* [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my
applicatoin. ;-)



I'm not sure I see the benefit here.



Diego, for instance, runs a customizable babysitter alongside each app
instance, and kills the container if the babysitter says things are going
bad. This triggers an event that the system can react to, and the system
also polls for container states because events can always be lost.



One thing to note is in this case, "the system" is the Executor, not HM9k
(which doesn't exist in Diego), or the Converger (Diego's equivalent of
HM9k), or Firehose or Cloud Controller which are very far removed from the
container backend. In Diego, the pieces are loosely coupled, events/data
in the system don't have to be sent through several layers of abstraction.



Best,

Amit



On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com>
wrote:

We have been talking about something similar that we have labeled the
Angry Farmer. I do not think you would need an agent. The firehose and
cloud controller should have everything that you need. Also an agent does
not give you the ability to really measure the performance of instances
relative to each other which is a good indicator of bad state or
performance.



Matt



*From: *Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
*Reply-To: *"Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
*Date: *Monday, October 5, 2015 at 9:31 AM
*To: *"Discussions about Cloud Foundry projects and the system overall." <
cf-dev(a)lists.cloudfoundry.org>
*Cc: *Vinay Murudi <vinaym(a)huawei.com>, Krishna M Kumar <
krishna.m.kumar(a)huawei.com>, Liangbiao <rexxar.liang(a)huawei.com>,
Srinivasch ch <srinivasch.ch(a)huawei.com>
*Subject: *[cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)



Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.

*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.

*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.

This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.

Thanks in advance,

Dhilip





Dhilip
 

Thanks again Amit for the clarification on the executor part.

“The plan is that this can be solved within Diego's abstractions of tasks and LRPs”
Does this mean Diego will be capable of provisioning workloads other than garden-linux containers?

Ill add just one little point to clarify, but not pushing on the idea itself.

I should have been even more explicit When I mentioned HM, what I meant was the subsystem that was responsible for managing application’s health, I did not intend to point at HM9000 specifically. The idea was that the ‘babysitter’ should be able to fire up a HTTP POST to such a system automatically when any of its threshold value such as cpu, memory, disk exceeds, other times it simply collects and sends a consolidated metrics report once a minute.

Say for instance. A given app exceeds 90% CPU then the babysitter automatically sends the post message to a specified (discoverable endpoint).
json{
GUID: ABCD1234
Time: <time stamp>
Index: 3
CPU: 95
Mem: 50
Disk: 50
}

Regards,
Dhilip

From: Amit Gupta [mailto:agupta(a)pivotal.io]
Sent: Tuesday, October 06, 2015 12:00 PM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Jianhui Zhou; Srinivasch ch
Subject: [cf-dev] Re: Re: Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hey Dhilip,

To clarify, we don't have a copy of executor for each garden-linux container. A single "cell" VM has one executor, one garden-linux, and many containers. The executor runs one monitor or "babysitter" process per each container.

What would be the benefit of running a monitor inside external systems which report to the HM? With Diego, there is no HM, so who exactly would it report to? And whatever it reports to, what can it do with that information? The Diego system components can take action when hearing about a failed container running in a Diego cell, it can schedule the process to be restarted, or whatever the right action may be given the crash restart policies. How can Diego or any Cloud Foundry component take action against an external system?

I think you highlight something valuable, that it would be nice for the platform to support running things other than apps, e.g. a MySQL database. The plan is that this can be solved within Diego's abstractions of tasks and LRPs, and it's true for perhaps most stateless non-app workloads, but things like databases are still hard, due to persistence being a hard problem. If you have not already seen it, Ted Young and Caleb Miles talk at the last CF Summit about this problem is a good one to watch: https://www.youtube.com/watch?v=3Ut6Qdd2FHY

Not all containers run sshd. Typically, the CC is responsible for requesting that an LRP have SSH access enabled, it's not conflated with Diego's responsibilities. It's also optional for the CC, users and space managers can opt to disable SSH (actually, I believe it's disabled by default).

Cheers,
Amit

On Mon, Oct 5, 2015 at 10:39 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>> wrote:
Hi All,

Thanks for the response.

Hi Amit,

Thanks for the info, I haven’t noticed that we run a copy of executor for each ‘garden-linux’ container that we launch. We do have a ‘push’ based container metrics collection and monitoring mechanism already in place then, In this case I can think of only the following benefits here.


1) This can become a unified health check approach as this binary can be packed within the container, it can even run inside a docker-container of an external system and keep pushing to a common HM. Or we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in every container anyways.

Ofcourse the original intention is to see if we could slightly alter the way diego’s monitoring/metrics collection works. If this is already implemented then I do not see a point perusing this idea.

Thanks for your time CF,
Dhilip


From: Amit Gupta [mailto:agupta(a)pivotal.io<mailto:agupta(a)pivotal.io>]
Sent: Tuesday, October 06, 2015 1:05 AM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
Subject: [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

I'm not sure I see the benefit here.

Diego, for instance, runs a customizable babysitter alongside each app instance, and kills the container if the babysitter says things are going bad. This triggers an event that the system can react to, and the system also polls for container states because events can always be lost.

One thing to note is in this case, "the system" is the Executor, not HM9k (which doesn't exist in Diego), or the Converger (Diego's equivalent of HM9k), or Firehose or Cloud Controller which are very far removed from the container backend. In Diego, the pieces are loosely coupled, events/data in the system don't have to be sent through several layers of abstraction.

Best,
Amit

On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com<mailto:Matt.Curry(a)allstate.com>> wrote:
We have been talking about something similar that we have labeled the Angry Farmer. I do not think you would need an agent. The firehose and cloud controller should have everything that you need. Also an agent does not give you the ability to really measure the performance of instances relative to each other which is a good indicator of bad state or performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Cc: Vinay Murudi <vinaym(a)huawei.com<mailto:vinaym(a)huawei.com>>, Krishna M Kumar <krishna.m.kumar(a)huawei.com<mailto:krishna.m.kumar(a)huawei.com>>, Liangbiao <rexxar.liang(a)huawei.com<mailto:rexxar.liang(a)huawei.com>>, Srinivasch ch <srinivasch.ch(a)huawei.com<mailto:srinivasch.ch(a)huawei.com>>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,
Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.
Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.
How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.
This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.
Thanks in advance,
Dhilip


Amit Kumar Gupta
 

Does this mean Diego will be capable of provisioning workloads other than
garden-linux containers?

It can schedule to anything that runs something that acts like a "Rep" in
front of : https://github.com/cloudfoundry-incubator/rep. This is how
Diego is able to support garden-windows running .NET apps, for example.

The idea was that the ‘babysitter’ should be able to fire up a HTTP POST
to such a system automatically when any of its threshold value such as cpu,
memory, disk exceeds, other times it simply collects and sends a
consolidated metrics report once a minute.

That's a great idea. I think there's a couple distinct ideas here. The
first is custom healthchecking, some basic test you want to constantly do
against your app instances and shut it down if the healthchecks fail.
Diego allows you to define such healthchecks, it's not exposed up at the
level of the CC though. I think there will need to be some discovery to
determine the main use cases and finding an "opinionated" way to expose
this functionality through the API, exposing Diego's full Executor Action
DSL is probably not desirable.

The second idea here is the idea of streaming metrics out of the system,
and being able to set up monitoring, alerting, and hooks (e.g. autoscaling)
around these metrics. Things like request latency can be gleaned from the
gorouter, and memory and CPU metrics are available through the loggregator
firehose, as I believe this is how cf app is able to report those values.
I imagine this actually has a very large scope: gather metrics from
throughout the system relevant to your app instances, stream it out, have a
system ingest it, provide visualizations of the data, allow setting
alerting thresholds, allow configuring hooks like to kill an app instance,
scale an app up, roll back to a previous version of the droplet, etc, and
then building the piece that actually can convert these hooks into requests
that the CC will honour. This is more than what your original proposal
described, but I think it's sort of the logical conclusion. Quite
valuable, but also quite large in scope.

Best,
Amit

On Tue, Oct 6, 2015 at 1:21 AM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Thanks again Amit for the clarification on the executor part.



“The plan is that this can be solved within Diego's abstractions of tasks
and LRPs”

Does this mean Diego will be capable of provisioning workloads other than
garden-linux containers?



Ill add just one little point to clarify, but not pushing on the idea
itself.



I should have been even more explicit When I mentioned HM, what I meant
was the subsystem that was responsible for managing application’s health,
I did not intend to point at HM9000 specifically. The idea was that the
‘babysitter’ should be able to fire up a HTTP POST to such a system
automatically when any of its threshold value such as cpu, memory, disk
exceeds, other times it simply collects and sends a consolidated metrics
report once a minute.



Say for instance. A given app exceeds 90% CPU then the babysitter
automatically sends the post message to a specified (discoverable endpoint).

json{

GUID: ABCD1234

Time: <time stamp>

Index: 3

CPU: 95

Mem: 50

Disk: 50

}



Regards,

Dhilip



*From:* Amit Gupta [mailto:agupta(a)pivotal.io]
*Sent:* Tuesday, October 06, 2015 12:00 PM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Cc:* Vinay Murudi; Krishna M Kumar; Liangbiao; Jianhui Zhou; Srinivasch
ch
*Subject:* [cf-dev] Re: Re: Re: Re: [Proposal] Wanted a Babysitter for my
applicatoin. ;-)



Hey Dhilip,



To clarify, we don't have a copy of executor for each garden-linux
container. A single "cell" VM has one executor, one garden-linux, and many
containers. The executor runs one monitor or "babysitter" process per each
container.



What would be the benefit of running a monitor inside external systems
which report to the HM? With Diego, there is no HM, so who exactly would
it report to? And whatever it reports to, what can it do with that
information? The Diego system components can take action when hearing
about a failed container running in a Diego cell, it can schedule the
process to be restarted, or whatever the right action may be given the
crash restart policies. How can Diego or any Cloud Foundry component take
action against an external system?



I think you highlight something valuable, that it would be nice for the
platform to support running things other than apps, e.g. a MySQL database.
The plan is that this can be solved within Diego's abstractions of tasks
and LRPs, and it's true for perhaps most stateless non-app workloads, but
things like databases are still hard, due to persistence being a hard
problem. If you have not already seen it, Ted Young and Caleb Miles talk
at the last CF Summit about this problem is a good one to watch:
https://www.youtube.com/watch?v=3Ut6Qdd2FHY



Not all containers run sshd. Typically, the CC is responsible for
requesting that an LRP have SSH access enabled, it's not conflated with
Diego's responsibilities. It's also optional for the CC, users and space
managers can opt to disable SSH (actually, I believe it's disabled by
default).



Cheers,

Amit



On Mon, Oct 5, 2015 at 10:39 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Hi All,



Thanks for the response.



Hi Amit,



Thanks for the info, I haven’t noticed that we run a copy of executor for
each ‘garden-linux’ container that we launch. We do have a ‘push’ based
container metrics collection and monitoring mechanism already in place
then, In this case I can think of only the following benefits here.



1) This can become a unified health check approach as this binary
can be packed within the container, it can even run inside a
docker-container of an external system and keep pushing to a common HM. Or
we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in
every container anyways.



Ofcourse the original intention is to see if we could slightly alter the
way diego’s monitoring/metrics collection works. If this is already
implemented then I do not see a point perusing this idea.



Thanks for your time CF,

Dhilip





*From:* Amit Gupta [mailto:agupta(a)pivotal.io]
*Sent:* Tuesday, October 06, 2015 1:05 AM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Cc:* Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
*Subject:* [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my
applicatoin. ;-)



I'm not sure I see the benefit here.



Diego, for instance, runs a customizable babysitter alongside each app
instance, and kills the container if the babysitter says things are going
bad. This triggers an event that the system can react to, and the system
also polls for container states because events can always be lost.



One thing to note is in this case, "the system" is the Executor, not HM9k
(which doesn't exist in Diego), or the Converger (Diego's equivalent of
HM9k), or Firehose or Cloud Controller which are very far removed from the
container backend. In Diego, the pieces are loosely coupled, events/data
in the system don't have to be sent through several layers of abstraction.



Best,

Amit



On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com>
wrote:

We have been talking about something similar that we have labeled the
Angry Farmer. I do not think you would need an agent. The firehose and
cloud controller should have everything that you need. Also an agent does
not give you the ability to really measure the performance of instances
relative to each other which is a good indicator of bad state or
performance.



Matt



*From: *Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
*Reply-To: *"Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
*Date: *Monday, October 5, 2015 at 9:31 AM
*To: *"Discussions about Cloud Foundry projects and the system overall." <
cf-dev(a)lists.cloudfoundry.org>
*Cc: *Vinay Murudi <vinaym(a)huawei.com>, Krishna M Kumar <
krishna.m.kumar(a)huawei.com>, Liangbiao <rexxar.liang(a)huawei.com>,
Srinivasch ch <srinivasch.ch(a)huawei.com>
*Subject: *[cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)



Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.

*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.

*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.

This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.

Thanks in advance,

Dhilip