Re: HM9000 metrics

Pablo Alonso Rodriguez <palonsoro@...>

Ok. Just a question: When you say "the DEA will keep its container carcass
around for an hour", you mean that the DEA does not remove the container
files. However, if Warden grace time is configured at 300 seconds (5
minutes), the container is actually destroyed after that time (although its
files remain). Is this right?

Thank you very much.

2015-06-11 17:27 GMT+02:00 Dieu Cao <dcao(a)>:

Hi Pablo,

Ops Metrics is a PCF product and questions about that should be directed
to Pivotal customer support.

Regarding your second question, about the difference between crashed
indices and crashed indexes.

The NumberOfCrashedInstances metric is usually about 4 times the
NumberOfCrashedIndices metric. First, NumberOfCrashedInstances is the total
number of crashed containers that remain on the DEAs, while
NumberOfCrashedIndices is the number of app-index pairs which have only
crashed instances.

If an app has a droplet that crashes on startup, HM9000 will eventually
settle on restarting an instance at each of its indices every 16 minutes.
When the instance crashes, the DEA will keep its container carcass around
for an hour (to allow the space developers to inspect its files via the
files API if they have the instance guid). So on average, there will be
60/16 = 3.75 crashed instances in the system per crashed index. That
should account for most of the indices and instances that are crashed in
the system.

Hope that helps.

CF Runtime PM

On Thu, Jun 11, 2015 at 4:48 AM, Pablo Alonso Rodriguez <
palonsoro(a)> wrote:

Good morning.

Recently, I have been revising metrics emitted by CF components. In order
to understand HM9000 metrics, I have been reading the metrics documentation

I post this message because I have two questions.

First question:

Not all the metrics retrieved via Ops Metrics are documented there. Is
there any additional documentation? If not, could you please explain my
what do the following metrics mean?

- StartEvacuating, StartCrashed, StartMissing
- StopDuplicate, StopEvacuationComplete, StopExtra

I have some guesses about some of them, but I am not completely sure
about them.

Second question:

I do not fully understand the difference between the concepts of
"instances" and "indices" at metrics like "NumberOfCrashedIndices" and

For example, I have one crashed app in my CF instance, and
"NumberOfCrashedIndices" reports '1' and "NumberOfCrashedInstances" reports
'3'. If I have a look at `cf app myapp`, I see one single crashed instance
(this was expected). If I have a look at hm9000 dump, I see the following
about my crashed app (UUIDs have been replaced by false ones):

Guid: 7ef08c44-102d-11e5-9c0d-0fb30c2610f7 | Version:
Desired: [1] instances, (STARTED, STAGED)
[0 CRASHED] a42a7236102d11e5813abfab583ad850 on 1-abc
[0 CRASHED] b35b9f1e102d11e5ad29cfc4c2c4e3ea on 2-ac3
[0 CRASHED] bbd37658102d11e5ba8e2b98d1fd1793 on 4-a67
CrashCounts: [0]:7499
Pending Starts:
[0] priority:1.00 send:2m34.628437793s

So, what does all this mean? I do not understand why do I get 3
heartbeats while
I only was trying to start a single instance.

Thank you in advance

cf-dev mailing list

cf-dev mailing list

Join to automatically receive all group messages.