Re: HM9000 metrics

CF Runtime

I believe the full crashed Warden container is kept around for an hour.

The DEA keeps the Warden handle to the container. The Warden grace time
only applies after all handles have been released.

Joseph Palermo
CF Runtime Team

On Thu, Jun 11, 2015 at 8:55 AM, Pablo Alonso Rodriguez <palonsoro(a)
Ok. Just a question: When you say "the DEA will keep its container carcass
around for an hour", you mean that the DEA does not remove the container
files. However, if Warden grace time is configured at 300 seconds (5
minutes), the container is actually destroyed after that time (although its
files remain). Is this right?

Thank you very much.

2015-06-11 17:27 GMT+02:00 Dieu Cao <dcao(a)>:

Hi Pablo,

Ops Metrics is a PCF product and questions about that should be directed
to Pivotal customer support.

Regarding your second question, about the difference between crashed
indices and crashed indexes.

The NumberOfCrashedInstances metric is usually about 4 times the
NumberOfCrashedIndices metric. First, NumberOfCrashedInstances is the total
number of crashed containers that remain on the DEAs, while
NumberOfCrashedIndices is the number of app-index pairs which have only
crashed instances.

If an app has a droplet that crashes on startup, HM9000 will eventually
settle on restarting an instance at each of its indices every 16
minutes. When the instance crashes, the DEA will keep its container
carcass around for an hour (to allow the space developers to inspect its
files via the files API if they have the instance guid). So on average,
there will be 60/16 = 3.75 crashed instances in the system per crashed
index. That should account for most of the indices and instances that
are crashed in the system.

Hope that helps.

CF Runtime PM

On Thu, Jun 11, 2015 at 4:48 AM, Pablo Alonso Rodriguez <
palonsoro(a)> wrote:

Good morning.

Recently, I have been revising metrics emitted by CF components. In
order to understand HM9000 metrics, I have been reading the metrics
documentation (at

I post this message because I have two questions.

First question:

Not all the metrics retrieved via Ops Metrics are documented there. Is
there any additional documentation? If not, could you please explain my
what do the following metrics mean?

- StartEvacuating, StartCrashed, StartMissing
- StopDuplicate, StopEvacuationComplete, StopExtra

I have some guesses about some of them, but I am not completely sure
about them.

Second question:

I do not fully understand the difference between the concepts of
"instances" and "indices" at metrics like "NumberOfCrashedIndices" and

For example, I have one crashed app in my CF instance, and
"NumberOfCrashedIndices" reports '1' and "NumberOfCrashedInstances" reports
'3'. If I have a look at `cf app myapp`, I see one single crashed instance
(this was expected). If I have a look at hm9000 dump, I see the following
about my crashed app (UUIDs have been replaced by false ones):

Guid: 7ef08c44-102d-11e5-9c0d-0fb30c2610f7 | Version:
Desired: [1] instances, (STARTED, STAGED)
[0 CRASHED] a42a7236102d11e5813abfab583ad850 on 1-abc
[0 CRASHED] b35b9f1e102d11e5ad29cfc4c2c4e3ea on 2-ac3
[0 CRASHED] bbd37658102d11e5ba8e2b98d1fd1793 on 4-a67
CrashCounts: [0]:7499
Pending Starts:
[0] priority:1.00 send:2m34.628437793s

So, what does all this mean? I do not understand why do I get 3
heartbeats while
I only was trying to start a single instance.

Thank you in advance

cf-dev mailing list

cf-dev mailing list

cf-dev mailing list

Join to automatically receive all group messages.