Date   

Issues with collector in v219 of cf-release

Amit Kumar Gupta
 

Hey all,

Just wanted to let you know that we discovered an issue between the
collector and etcd-metrics-server components of cf-release. The release
notes have been updated to reflect the following:

*The bump in v219 to etcd-metrics-server turned out to not play nicely with
collector, and caused collector to periodically crash. If your system is
dependent on collector for metrics, this will affect your deployment.
However, if you are not concerned with metrics from the etcd component, you
can opt to not include etcd-metrics-server as part of your deployment. In
standard deployments, it is colocated with the etcd_zN jobs; you can simply
remove the template from the list of colocated jobs.*

Best,
Amit, OSS Release Integration PM


What are the services required for MicroBOSH monitoring/resurrection

Harpreet Ghai
 

Hi All,
If I understand correctly when MicroBOSH detects one VM is down and MicroBOSH is about to delete/recreate the VM, MicroBOSH will definitely use keystone and compute API.

Besides keystone and compute what are other services used by MicroBOSH for monitoring/resurrection.

Regards,
Harpreet


What services are required for MicroBOSH monitoring/resurrection

Harpreet Ghai
 

Hi All,

If my understanding is correct when MicroBOSH detects one VM is down and MicroBOSH is about to delete/recreate the VM, MicroBOSH will definitely use keystone and compute API.

Besides keystone and compute what are the other services That are used for monitoring and resurrection.

Thanks
Harpreet


Need information regarding health monitor and VM resurrection

Harpreet Ghai
 

Hi All,
I have been assigned verification and investigation task as follows,
(1) heatbeat interval from agent: 1 minute
(2) How long does it take for MicroBOSH to detect VM status is abnormal since no heartbeat from agent?
(3) How long does it take to complete VM resurrection since it is detected the VM status is abnormal?
(5) Messages
- What messages are output when
(5-1) MicroBOSH detected unresponse VM
(5-2) MicroBOSH detected VM is in abnormal state
(5-3) MicroBOSH finished VM resurrection succesfully
(5-4) MicroBOSH failed VM resurrection
(6) The messages are output to /var/vcap/sys/log/health_monitor/health_monitor.log.
- Are there any other files where the messages are output.

Appreciate your help in answering these question. Also how can I investigate these myself, what is the best place to look at.

Thanks
Harpreet


Announcing cf-mysql-release v23

Marco Nicosia
 

Hi everyone,

Just wanted to let you know that cf-mysql-release v23
<https://github.com/cloudfoundry/cf-mysql-release/tree/v23> is now
available.

--

cf-mysql-release is a BOSH release that delivers a MySQL-compatible
Database-as-a-Service for Cloud Foundry users. Through Cloud Foundry, users
can provision databases and derive unique access credentials for
applications bound to those databases.

v23 is a minor update. We try to release regularly to give you access to
all the latest features and fixes. The changes in v23 include an updated
MariaDB, bugfixes for the SQL proxy and several configuration changes.
There's also a special change for users of bosh-lite (typically developers
who work on cf-mysql itself).

For all the details, please see the release notes for v23
<https://github.com/cloudfoundry/cf-mysql-release/releases/tag/v23>.

As always, we'd love to hear if you're having any problems with this
version of the software. Please open a GitHub issue
<https://github.com/cloudfoundry/cf-mysql-release/issues>. If you're
willing, we're always very happy to receive a Pull Request
<https://github.com/cloudfoundry/cf-mysql-release/pulls>.

--
Marco Nicosia
Product Manager
Pivotal Software, Inc.
mnicosia(a)pivotal.io
c: 650-796-2948


Re: future changes to etcd configuration in cf-release

Amit Kumar Gupta
 

Yes, this is mutual SSL auth.

Best,
Amit

On Tue, Oct 6, 2015 at 12:36 PM, Shannon Coen <scoen(a)pivotal.io> wrote:

Amit,

Could you confirm that you will require *mutual* SSL auth, otherwise this
wouldn't require much of a change by clients.

If etcd.require_ssl:true, must a client present a cert?

Thank you,

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Tue, Sep 29, 2015 at 5:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi all,

Just wanted to give the community advance notice that we will be
introducing a change to the etcd configuration in cf-release, probably
within the week (probably cf v220+, we are currently on v218).

etcd can be configured to require ssl communication amongst servers, and
between servers and clients. Currently this defaults to false, but we will
be changing the default to true. We will include documentation on how to
generate certs, and where to put them in your stubs if you are using the
spiff tooling to generate deployment manifests. The BOSH-Lite dev
manifests will include certs by default, to make the dev workflow
especially easy.

Cheers,

Amit Gupta
Cloud Foundry PM, OSS Release Integration team


Re: CF deployment environments available for CF incubating projects to use?

Michael Maximilien
 

Yup. We have stories to move to concourse. That's not the issue. Running CF versions on demand to test is the issue... I'll sync with Amit this week so I can see if we can dovetail with his future solutions.




Best,




Max




Sent from Mailbox

On Tue, Oct 6, 2015 at 9:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

The CF OSS Release Integration team (aka MEGA) is working on tooling to
fully bootstrap an AWS VPC with BOSH and Concourse running, and sufficient
AWS resources (eg subnets, ELB) to deploy CF:
https://github.com/cloudfoundry/mega-ci
We're working on making deploying CF itself less burdensome, but that will
take more time.
As for having someone actually maintain CF environments for the community
or dev teams to checkout and use as a service, the question is, who foots
the bill?
Amit
On Tuesday, October 6, 2015, Michael Maximilien <mmaximilien(a)gmail.com>
wrote:
Chiming in with some info and to add to Sebastien's request.

While most of the teams in CF have switched to using concourse for their
pipelines. And each pipeline instantiation has to be built and manage
individually. The issue mentioned here, needing to provision latest CF
releases to use for testing apps, services, brokers, etc is a real one. And
likely one many solve also individually.

Begs the question whether it should be offered as a service for all to
use. Almost like "CF as a Service" type of service broker... Just thinking
out loud here. Cloud capacity and managing resources would be the long
running costs after initial investment.

Best,

Max


Sent from Mailbox <https://www.dropbox.com/mailbox>


On Mon, Oct 5, 2015 at 7:27 PM, Jean-Sebastien Delfino <
jsdelfino(a)gmail.com <javascript:_e(%7B%7D,'cvml','jsdelfino(a)gmail.com');>>
wrote:

Hi all,

So far in the Abacus project we've been running our automated tests
outside CF (as our Node.js apps can also run outside with just a bit of env
variable config) on Travis-CI. Some of us also deploy our apps to Bosh lite
to test inside CF but maintaining working versions of Bosh lite is pretty
time consuming and that manual testing hasn't been a repeatable process so
far, so I'd really like to automate that with a proper CI build and test
environment.

Are there any CF deployment environments available for CF incubating
projects to use for CI builds and tests?

I'm looking for an environment where my build script could simply select
a specific version of CF, bootstrap it 'clean' (nothing left over from
previous runs), deploy the Abacus apps to it to run the tests there, then
repeat with a different version of CF etc.

Is anything like that already available for CF projects to use?

Thanks!

-- Jean-Sebastien


Re: future changes to etcd configuration in cf-release

Shannon Coen
 

Amit,

Could you confirm that you will require *mutual* SSL auth, otherwise this
wouldn't require much of a change by clients.

If etcd.require_ssl:true, must a client present a cert?

Thank you,

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Tue, Sep 29, 2015 at 5:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi all,

Just wanted to give the community advance notice that we will be
introducing a change to the etcd configuration in cf-release, probably
within the week (probably cf v220+, we are currently on v218).

etcd can be configured to require ssl communication amongst servers, and
between servers and clients. Currently this defaults to false, but we will
be changing the default to true. We will include documentation on how to
generate certs, and where to put them in your stubs if you are using the
spiff tooling to generate deployment manifests. The BOSH-Lite dev
manifests will include certs by default, to make the dev workflow
especially easy.

Cheers,

Amit Gupta
Cloud Foundry PM, OSS Release Integration team


Re: cf push fails with "Unauthorized error: You are not authorized. Error: Invalid authorization"

Heitor Meira <htrmeira@...>
 

We had a similar issue, we passed the buildpack that the application should
use and it worked.

I have no idea why this helped.

On Mon, Oct 5, 2015 at 7:01 PM, CF Runtime <cfruntime(a)gmail.com> wrote:

Are you using an aws elb or a ha_proxy instance for your load balancer?

If you curl api.SYSTEM_DOMAIN/v2/info, what is the logging_endpoint from
the response?

Joseph
CF Release Integration Team



On Mon, Oct 5, 2015 at 3:36 AM, remi clotaire tassing tagne <
tassingremi(a)gmail.com> wrote:

Hi,

So I've finally managed to deploy CF on AWS but haven't managed pushing
the most rudimentary node.js app. BTW, pushing this app to my local CF
instance worked.

I tried troubleshooting with "cf logs" and "cf events" and
"CF_TRACE=true" but I just can't figure out the root cause. I've even
created an org and space and a user with SpaceDeveloper role but without
luck. 'cf push' gives me:

cf push
Using manifest file /home/remi/workspace/apps/hello-nodejs/manifest.yml

Creating app hello-nodejs in org test / space test as cf...
OK

Creating route hello-nodejs.ip.address.xip.io...
OK

Binding hello-nodejs.ip.address.xip.io to hello-nodejs...
OK

Uploading hello-nodejs...
Uploading app files from: /home/remi/workspace/apps/hello-nodejs
Uploading 1K, 3 files
Done uploading
OK

timeout connecting to log server, no log will be shown
Starting app hello-nodejs in org test / space test as cf...
Warning: error tailing logs
Unauthorized error: You are not authorized. Error: Invalid authorization

FAILED
hello-nodejs failed to stage within 15.000000 minutes

Any idea where I should look for the problem?

Thanks in advance!

Remi
--
Heitor Meira
Undergraduate in Computer Science - www.ccc.ufcg.edu.br
Member of Distributed Systems Laboratory - www.lsd.ufcg.edu.br


Re: CF deployment environments available for CF incubating projects to use?

Chip Childers <cchilders@...>
 

On Tue, Oct 6, 2015 at 5:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

The CF OSS Release Integration team (aka MEGA) is working on tooling to
fully bootstrap an AWS VPC with BOSH and Concourse running, and sufficient
AWS resources (eg subnets, ELB) to deploy CF:
https://github.com/cloudfoundry/mega-ci

We're working on making deploying CF itself less burdensome, but that will
take more time.

As for having someone actually maintain CF environments for the community
or dev teams to checkout and use as a service, the question is, who foots
the bill?
I suggest the Abacus project work with the participating companies to find
a CI infrastructure environment, and model the pipelines after the approach
the other projects are using with Concourse.


Re: CF deployment environments available for CF incubating projects to use?

Amit Kumar Gupta
 

The CF OSS Release Integration team (aka MEGA) is working on tooling to
fully bootstrap an AWS VPC with BOSH and Concourse running, and sufficient
AWS resources (eg subnets, ELB) to deploy CF:
https://github.com/cloudfoundry/mega-ci

We're working on making deploying CF itself less burdensome, but that will
take more time.

As for having someone actually maintain CF environments for the community
or dev teams to checkout and use as a service, the question is, who foots
the bill?

Amit

On Tuesday, October 6, 2015, Michael Maximilien <mmaximilien(a)gmail.com>
wrote:

Chiming in with some info and to add to Sebastien's request.

While most of the teams in CF have switched to using concourse for their
pipelines. And each pipeline instantiation has to be built and manage
individually. The issue mentioned here, needing to provision latest CF
releases to use for testing apps, services, brokers, etc is a real one. And
likely one many solve also individually.

Begs the question whether it should be offered as a service for all to
use. Almost like "CF as a Service" type of service broker... Just thinking
out loud here. Cloud capacity and managing resources would be the long
running costs after initial investment.

Best,

Max


Sent from Mailbox <https://www.dropbox.com/mailbox>


On Mon, Oct 5, 2015 at 7:27 PM, Jean-Sebastien Delfino <
jsdelfino(a)gmail.com <javascript:_e(%7B%7D,'cvml','jsdelfino(a)gmail.com');>>
wrote:

Hi all,

So far in the Abacus project we've been running our automated tests
outside CF (as our Node.js apps can also run outside with just a bit of env
variable config) on Travis-CI. Some of us also deploy our apps to Bosh lite
to test inside CF but maintaining working versions of Bosh lite is pretty
time consuming and that manual testing hasn't been a repeatable process so
far, so I'd really like to automate that with a proper CI build and test
environment.

Are there any CF deployment environments available for CF incubating
projects to use for CI builds and tests?

I'm looking for an environment where my build script could simply select
a specific version of CF, bootstrap it 'clean' (nothing left over from
previous runs), deploy the Abacus apps to it to run the tests there, then
repeat with a different version of CF etc.

Is anything like that already available for CF projects to use?

Thanks!

-- Jean-Sebastien


Re: CF deployment environments available for CF incubating projects to use?

Michael Maximilien
 

Chiming in with some info and to add to Sebastien's request.




While most of the teams in CF have switched to using concourse for their pipelines. And each pipeline instantiation has to be built and manage individually. The issue mentioned here, needing to provision latest CF releases to use for testing apps, services, brokers, etc is a real one. And likely one many solve also individually.




Begs the question whether it should be offered as a service for all to use. Almost like "CF as a Service" type of service broker... Just thinking out loud here. Cloud capacity and managing resources would be the long running costs after initial investment.




Best,




Max




Sent from Mailbox

On Mon, Oct 5, 2015 at 7:27 PM, Jean-Sebastien Delfino
<jsdelfino(a)gmail.com> wrote:

Hi all,
So far in the Abacus project we've been running our automated tests outside
CF (as our Node.js apps can also run outside with just a bit of env
variable config) on Travis-CI. Some of us also deploy our apps to Bosh lite
to test inside CF but maintaining working versions of Bosh lite is pretty
time consuming and that manual testing hasn't been a repeatable process so
far, so I'd really like to automate that with a proper CI build and test
environment.
Are there any CF deployment environments available for CF incubating
projects to use for CI builds and tests?
I'm looking for an environment where my build script could simply select a
specific version of CF, bootstrap it 'clean' (nothing left over from
previous runs), deploy the Abacus apps to it to run the tests there, then
repeat with a different version of CF etc.
Is anything like that already available for CF projects to use?
Thanks!
-- Jean-Sebastien


Proposing a change to the Project Lead for Greenhouse

Dieu Cao <dcao@...>
 

Hello All,

Pivotal would like to nominate Steven Benario for the Project Lead on
project Greenhouse. Steven is a new hire with experience working for
Microsoft on the .NET stack.

Steven's experience with the Pivotal PM process and his familiarity with
the .NET ecosystem makes him an ideal candidate for driving a consistent
experience for windows developers on Cloud Foundry. Steven brings with him
a diverse background of working for finservs, federal customers, and
enterprise software vendors. Steven intends to pair with Mark Kropf the
original project lead for greenhouse to ramp up quickly.

We plan to propose this change at the Runtime PMC meeting this week.

-Dieu
Runtime PMC Lead


Warden container memory

Rohit Kelapure
 

We want to correlate what cf app reports with the OS stats obtained by SSHing into the container. The cf app output reports memory from within a cgroup = RSS+cache to calculate and limit bytes in use. ps aux gives us the RSS of the process. How does one determine the cache component of this computation.

-Thanks,
Rohit


Re: Making your landscape trust a certain certificate authority

Eric Westenberger
 

Hi,

thanks so much for pointing me in the right direction. The following script worked for me

#!/bin/bash
sleep 2
$HOME/.java-buildpack/open_jdk_jre/bin/keytool -keystore $HOME/.java-buildpack/open_jdk_jre/lib/security/cacerts -storepass changeit -importcert -noprompt -alias MyCert -file $HOME/WEB-INF/ssl/MyCert.crt

Cheers, Eric


Re: [Proposal] Wanted a Babysitter for my applicatoin. ; -)

Dhilip
 

Thanks again Amit for the clarification on the executor part.

“The plan is that this can be solved within Diego's abstractions of tasks and LRPs”
Does this mean Diego will be capable of provisioning workloads other than garden-linux containers?

Ill add just one little point to clarify, but not pushing on the idea itself.

I should have been even more explicit When I mentioned HM, what I meant was the subsystem that was responsible for managing application’s health, I did not intend to point at HM9000 specifically. The idea was that the ‘babysitter’ should be able to fire up a HTTP POST to such a system automatically when any of its threshold value such as cpu, memory, disk exceeds, other times it simply collects and sends a consolidated metrics report once a minute.

Say for instance. A given app exceeds 90% CPU then the babysitter automatically sends the post message to a specified (discoverable endpoint).
json{
GUID: ABCD1234
Time: <time stamp>
Index: 3
CPU: 95
Mem: 50
Disk: 50
}

Regards,
Dhilip

From: Amit Gupta [mailto:agupta(a)pivotal.io]
Sent: Tuesday, October 06, 2015 12:00 PM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Jianhui Zhou; Srinivasch ch
Subject: [cf-dev] Re: Re: Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hey Dhilip,

To clarify, we don't have a copy of executor for each garden-linux container. A single "cell" VM has one executor, one garden-linux, and many containers. The executor runs one monitor or "babysitter" process per each container.

What would be the benefit of running a monitor inside external systems which report to the HM? With Diego, there is no HM, so who exactly would it report to? And whatever it reports to, what can it do with that information? The Diego system components can take action when hearing about a failed container running in a Diego cell, it can schedule the process to be restarted, or whatever the right action may be given the crash restart policies. How can Diego or any Cloud Foundry component take action against an external system?

I think you highlight something valuable, that it would be nice for the platform to support running things other than apps, e.g. a MySQL database. The plan is that this can be solved within Diego's abstractions of tasks and LRPs, and it's true for perhaps most stateless non-app workloads, but things like databases are still hard, due to persistence being a hard problem. If you have not already seen it, Ted Young and Caleb Miles talk at the last CF Summit about this problem is a good one to watch: https://www.youtube.com/watch?v=3Ut6Qdd2FHY

Not all containers run sshd. Typically, the CC is responsible for requesting that an LRP have SSH access enabled, it's not conflated with Diego's responsibilities. It's also optional for the CC, users and space managers can opt to disable SSH (actually, I believe it's disabled by default).

Cheers,
Amit

On Mon, Oct 5, 2015 at 10:39 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>> wrote:
Hi All,

Thanks for the response.

Hi Amit,

Thanks for the info, I haven’t noticed that we run a copy of executor for each ‘garden-linux’ container that we launch. We do have a ‘push’ based container metrics collection and monitoring mechanism already in place then, In this case I can think of only the following benefits here.


1) This can become a unified health check approach as this binary can be packed within the container, it can even run inside a docker-container of an external system and keep pushing to a common HM. Or we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in every container anyways.

Ofcourse the original intention is to see if we could slightly alter the way diego’s monitoring/metrics collection works. If this is already implemented then I do not see a point perusing this idea.

Thanks for your time CF,
Dhilip


From: Amit Gupta [mailto:agupta(a)pivotal.io<mailto:agupta(a)pivotal.io>]
Sent: Tuesday, October 06, 2015 1:05 AM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
Subject: [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

I'm not sure I see the benefit here.

Diego, for instance, runs a customizable babysitter alongside each app instance, and kills the container if the babysitter says things are going bad. This triggers an event that the system can react to, and the system also polls for container states because events can always be lost.

One thing to note is in this case, "the system" is the Executor, not HM9k (which doesn't exist in Diego), or the Converger (Diego's equivalent of HM9k), or Firehose or Cloud Controller which are very far removed from the container backend. In Diego, the pieces are loosely coupled, events/data in the system don't have to be sent through several layers of abstraction.

Best,
Amit

On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com<mailto:Matt.Curry(a)allstate.com>> wrote:
We have been talking about something similar that we have labeled the Angry Farmer. I do not think you would need an agent. The firehose and cloud controller should have everything that you need. Also an agent does not give you the ability to really measure the performance of instances relative to each other which is a good indicator of bad state or performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Cc: Vinay Murudi <vinaym(a)huawei.com<mailto:vinaym(a)huawei.com>>, Krishna M Kumar <krishna.m.kumar(a)huawei.com<mailto:krishna.m.kumar(a)huawei.com>>, Liangbiao <rexxar.liang(a)huawei.com<mailto:rexxar.liang(a)huawei.com>>, Srinivasch ch <srinivasch.ch(a)huawei.com<mailto:srinivasch.ch(a)huawei.com>>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,
Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.
Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.
How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.
This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.
Thanks in advance,
Dhilip


Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Sylvain Gibier
 

Hi,

My 2 cents - but have you look at sensu (https://sensuapp.org/) - I was in
need of something similar, and so end up deploying along with every pushed
app an sensu client (agent) that pull metrics out of the container onto my
bosh'ed sensu server, including if needed custom healthcheck for each app.

Sylvain


On Mon, Oct 5, 2015 at 6:31 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Hello CF,



Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.



*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command <https://github.com/cloudfoundry-incubator/healthcheck>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?



Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?



How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.



*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.



This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.



Thanks in advance,

Dhilip




Re: [Proposal] Wanted a Babysitter for my applicatoin. ; -)

Amit Kumar Gupta
 

Hey Dhilip,

To clarify, we don't have a copy of executor for each garden-linux
container. A single "cell" VM has one executor, one garden-linux, and many
containers. The executor runs one monitor or "babysitter" process per each
container.

What would be the benefit of running a monitor inside external systems
which report to the HM? With Diego, there is no HM, so who exactly would
it report to? And whatever it reports to, what can it do with that
information? The Diego system components can take action when hearing
about a failed container running in a Diego cell, it can schedule the
process to be restarted, or whatever the right action may be given the
crash restart policies. How can Diego or any Cloud Foundry component take
action against an external system?

I think you highlight something valuable, that it would be nice for the
platform to support running things other than apps, e.g. a MySQL database.
The plan is that this can be solved within Diego's abstractions of tasks
and LRPs, and it's true for perhaps most stateless non-app workloads, but
things like databases are still hard, due to persistence being a hard
problem. If you have not already seen it, Ted Young and Caleb Miles talk
at the last CF Summit about this problem is a good one to watch:
https://www.youtube.com/watch?v=3Ut6Qdd2FHY

Not all containers run sshd. Typically, the CC is responsible for
requesting that an LRP have SSH access enabled, it's not conflated with
Diego's responsibilities. It's also optional for the CC, users and space
managers can opt to disable SSH (actually, I believe it's disabled by
default).

Cheers,
Amit

On Mon, Oct 5, 2015 at 10:39 PM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
wrote:

Hi All,



Thanks for the response.



Hi Amit,



Thanks for the info, I haven’t noticed that we run a copy of executor for
each ‘garden-linux’ container that we launch. We do have a ‘push’ based
container metrics collection and monitoring mechanism already in place
then, In this case I can think of only the following benefits here.



1) This can become a unified health check approach as this binary
can be packed within the container, it can even run inside a
docker-container of an external system and keep pushing to a common HM. Or
we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in
every container anyways.



Ofcourse the original intention is to see if we could slightly alter the
way diego’s monitoring/metrics collection works. If this is already
implemented then I do not see a point perusing this idea.



Thanks for your time CF,

Dhilip





*From:* Amit Gupta [mailto:agupta(a)pivotal.io]
*Sent:* Tuesday, October 06, 2015 1:05 AM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Cc:* Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
*Subject:* [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my
applicatoin. ;-)



I'm not sure I see the benefit here.



Diego, for instance, runs a customizable babysitter alongside each app
instance, and kills the container if the babysitter says things are going
bad. This triggers an event that the system can react to, and the system
also polls for container states because events can always be lost.



One thing to note is in this case, "the system" is the Executor, not HM9k
(which doesn't exist in Diego), or the Converger (Diego's equivalent of
HM9k), or Firehose or Cloud Controller which are very far removed from the
container backend. In Diego, the pieces are loosely coupled, events/data
in the system don't have to be sent through several layers of abstraction.



Best,

Amit



On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com>
wrote:

We have been talking about something similar that we have labeled the
Angry Farmer. I do not think you would need an agent. The firehose and
cloud controller should have everything that you need. Also an agent does
not give you the ability to really measure the performance of instances
relative to each other which is a good indicator of bad state or
performance.



Matt



*From: *Dhilip Kumar S <dhilip.kumar.s(a)huawei.com>
*Reply-To: *"Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
*Date: *Monday, October 5, 2015 at 9:31 AM
*To: *"Discussions about Cloud Foundry projects and the system overall." <
cf-dev(a)lists.cloudfoundry.org>
*Cc: *Vinay Murudi <vinaym(a)huawei.com>, Krishna M Kumar <
krishna.m.kumar(a)huawei.com>, Liangbiao <rexxar.liang(a)huawei.com>,
Srinivasch ch <srinivasch.ch(a)huawei.com>
*Subject: *[cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)



Hello CF,

Greetings from Huawei. Here is a quick idea that came up to our mind
recently. Honestly we did not spend enormous time brainstorming this
internally, but we thought we could go ahead and ask the community
directly. It would be a great help to know if such an idea had already been
considered and dropped by the community.

*Proposal Motivations*

The way health-check process is currently performed in cloud foundry is to
run a command
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=>
periodically; if the exit status is non-zero then it is assumed that an
application is non-responsive. We periodically repeat this process for all
the applications. Which means that we actually scan the entire data center
frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the
real-world? The hospitals don’t periodically scan the entire community
looking for sick residents. Similarly, why can’t we report problems as and
when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s
health and periodically reports in case an app is down or non-responsive
etc. In a huge datacenter where thousands of apps are hosted, and each app
has many instances. Wouldn’t it be better to make the individual
app/container come and tell us(healthmanager) that there is a problem
instead of scanning all of them? *Push versus Pull model* - Something
like a babysitter residing within each container and taking care of the
‘app’ hosted by our customers.

*How to accomplish this?*

Our proposal is for BabySitter(BS) – an agent residing within each
container optionally deployed using app-specific configuration. This agent
sends out the collected metrics to health monitor in case of any anomaly –
periodic time-series information etc. The agent should remember the
configured threshold value that each app should not exceed; otherwise it
triggers an alarm automatically to the health monitor in case of any
threshold violations. The alarm even could be sent many times a second to
the healthmonitor depending on the severity of the event, but the regular
periodic ‘time-series’ information could be collected every second but sent
once a minute to the HM. The challenge is design the application ‘bs’ as
lightweight as possible.

This is our primary idea, we also thought it would make more sense if we
club few more capabilities to babysitter like sshd (as a goroutine) and
fileserver(as a goroutine) but before we bore you with all that details, we
first want to understand what CF community thinks about this initial idea.

Thanks in advance,

Dhilip





Re: [Proposal] Wanted a Babysitter for my applicatoin. ; -)

Dhilip
 

Hi All,

Thanks for the response.

Hi Amit,

Thanks for the info, I haven’t noticed that we run a copy of executor for each ‘garden-linux’ container that we launch. We do have a ‘push’ based container metrics collection and monitoring mechanism already in place then, In this case I can think of only the following benefits here.


1) This can become a unified health check approach as this binary can be packed within the container, it can even run inside a docker-container of an external system and keep pushing to a common HM. Or we could run this in the same VM as a My-SQL instance to get its health.

2) This can be a part of the SshD as we are running a daemon in every container anyways.

Ofcourse the original intention is to see if we could slightly alter the way diego’s monitoring/metrics collection works. If this is already implemented then I do not see a point perusing this idea.

Thanks for your time CF,
Dhilip


From: Amit Gupta [mailto:agupta(a)pivotal.io]
Sent: Tuesday, October 06, 2015 1:05 AM
To: Discussions about Cloud Foundry projects and the system overall.
Cc: Vinay Murudi; Krishna M Kumar; Liangbiao; Srinivasch ch
Subject: [cf-dev] Re: Re: [Proposal] Wanted a Babysitter for my applicatoin. ;-)

I'm not sure I see the benefit here.

Diego, for instance, runs a customizable babysitter alongside each app instance, and kills the container if the babysitter says things are going bad. This triggers an event that the system can react to, and the system also polls for container states because events can always be lost.

One thing to note is in this case, "the system" is the Executor, not HM9k (which doesn't exist in Diego), or the Converger (Diego's equivalent of HM9k), or Firehose or Cloud Controller which are very far removed from the container backend. In Diego, the pieces are loosely coupled, events/data in the system don't have to be sent through several layers of abstraction.

Best,
Amit

On Mon, Oct 5, 2015 at 10:09 AM, Curry, Matthew <Matt.Curry(a)allstate.com<mailto:Matt.Curry(a)allstate.com>> wrote:
We have been talking about something similar that we have labeled the Angry Farmer. I do not think you would need an agent. The firehose and cloud controller should have everything that you need. Also an agent does not give you the ability to really measure the performance of instances relative to each other which is a good indicator of bad state or performance.

Matt

From: Dhilip Kumar S <dhilip.kumar.s(a)huawei.com<mailto:dhilip.kumar.s(a)huawei.com>>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Date: Monday, October 5, 2015 at 9:31 AM
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>>
Cc: Vinay Murudi <vinaym(a)huawei.com<mailto:vinaym(a)huawei.com>>, Krishna M Kumar <krishna.m.kumar(a)huawei.com<mailto:krishna.m.kumar(a)huawei.com>>, Liangbiao <rexxar.liang(a)huawei.com<mailto:rexxar.liang(a)huawei.com>>, Srinivasch ch <srinivasch.ch(a)huawei.com<mailto:srinivasch.ch(a)huawei.com>>
Subject: [cf-dev] [Proposal] Wanted a Babysitter for my applicatoin. ;-)

Hello CF,
Greetings from Huawei. Here is a quick idea that came up to our mind recently. Honestly we did not spend enormous time brainstorming this internally, but we thought we could go ahead and ask the community directly. It would be a great help to know if such an idea had already been considered and dropped by the community.
Proposal Motivations

The way health-check process is currently performed in cloud foundry is to run a command<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cloudfoundry-2Dincubator_healthcheck&d=BQMGaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=5uKsnIXwfJIxHSaCSaJzvcn90bBlYQuxsJhof4ERK-Q&m=8v-kDNCf3N_TGthtUze_YzZR4BADnwPZ9BiNHtzQnF4&s=MkllX3Km4FRjbvpC1QE02cQWP_QcCOE2qDv-UQCgytk&e=> periodically; if the exit status is non-zero then it is assumed that an application is non-responsive. We periodically repeat this process for all the applications. Which means that we actually scan the entire data center frequently to find one or few miss-behaving apps?

Why can’t we change the way health-check is done? Can it reflect the real-world? The hospitals don’t periodically scan the entire community looking for sick residents. Similarly, why can’t we report problems as and when they occur – just like the real-world?

How about a lightweight process that constantly monitors the application’s health and periodically reports in case an app is down or non-responsive etc. In a huge datacenter where thousands of apps are hosted, and each app has many instances. Wouldn’t it be better to make the individual app/container come and tell us(healthmanager) that there is a problem instead of scanning all of them? Push versus Pull model - Something like a babysitter residing within each container and taking care of the ‘app’ hosted by our customers.
How to accomplish this?
Our proposal is for BabySitter(BS) – an agent residing within each container optionally deployed using app-specific configuration. This agent sends out the collected metrics to health monitor in case of any anomaly – periodic time-series information etc. The agent should remember the configured threshold value that each app should not exceed; otherwise it triggers an alarm automatically to the health monitor in case of any threshold violations. The alarm even could be sent many times a second to the healthmonitor depending on the severity of the event, but the regular periodic ‘time-series’ information could be collected every second but sent once a minute to the HM. The challenge is design the application ‘bs’ as lightweight as possible.
This is our primary idea, we also thought it would make more sense if we club few more capabilities to babysitter like sshd (as a goroutine) and fileserver(as a goroutine) but before we bore you with all that details, we first want to understand what CF community thinks about this initial idea.
Thanks in advance,
Dhilip


Re: 3 etcd nodes don't work well in single zone

Tony
 

Hi Amit,

It has been two months since last time I replied.

We tested 3 etcd vms in a single_AZ in another environment and they worked
fine. I can get a constant instance number now.

The CF version we are using is CF 212 (but even if we updated the version
from 210 to 212 in our previous environment, it doesn't work yet. So I
assume the problem is in our environment but not in CF).

The reason is not clear. But it starts to work anyway, thanks for all your
help.

Cheers,
Tony



--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-3-etcd-nodes-don-t-work-well-in-single-zone-tp746p2092.html
Sent from the CF Dev mailing list archive at Nabble.com.

7301 - 7320 of 9425