DEA Monitoring Capabilities


Chawki, Amin <amin.chawki@...>
 

Hi,

by upgrading to CF v234 (including pre-release v232) we lost all our monitoring capabilities regarding DEA and HM9000 (we were still using Collector). By migrating to Firehose only a fraction of the metrics was available. Very important metrics for our productive systems like ‘available_memory_ratio’ were just added in CF v235. In the meantime, we were pretty much “flying blind”.

We replaced not existing metrics like ‘DEA…mem_used_bytes’ and ‘HM9000…healthy’, which were available via Collector, with metrics from Bosh. Is this the way to go or are there any plans to add them again?

Best Regards,
Amin


Marco Voelz
 

Including Jim Campbell to make sure this reaches him.

On 20/05/16 13:41, "Chawki, Amin" <amin.chawki(a)sap.com<mailto:amin.chawki(a)sap.com>> wrote:

Hi,

by upgrading to CF v234 (including pre-release v232) we lost all our monitoring capabilities regarding DEA and HM9000 (we were still using Collector). By migrating to Firehose only a fraction of the metrics was available. Very important metrics for our productive systems like ‘available_memory_ratio’ were just added in CF v235. In the meantime, we were pretty much “flying blind”.

We replaced not existing metrics like ‘DEA…mem_used_bytes’ and ‘HM9000…healthy’, which were available via Collector, with metrics from Bosh. Is this the way to go or are there any plans to add them again?

Best Regards,
Amin


Jim CF Campbell
 

Yep, I've had a back thread with Runtime OG who now has the DEA metrics and
who I thought had implemented all previous /varz (aka Collector) metrics
into the firehose. No answer from Mr Fraenkel yet.

On Fri, May 20, 2016 at 11:15 AM, Voelz, Marco <marco.voelz(a)sap.com> wrote:

Including Jim Campbell to make sure this reaches him.



On 20/05/16 13:41, "Chawki, Amin" <amin.chawki(a)sap.com> wrote:



Hi,



by upgrading to CF v234 (including pre-release v232) we lost all our
monitoring capabilities regarding DEA and HM9000 (we were still using
Collector). By migrating to Firehose only a fraction of the metrics was
available. Very important metrics for our productive systems like
‘available_memory_ratio’ were just added in CF v235. In the meantime, we
were pretty much “flying blind”.



We replaced not existing metrics like ‘DEA…mem_used_bytes’ and
‘HM9000…healthy’, which were available via Collector, with metrics from
Bosh. Is this the way to go or are there any plans to add them again?



Best Regards,

Amin




--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963


Jim CF Campbell
 

Looks like the Runtime team took some out in v234, added them back in in
v235.

From the Runtime Slack:




On Fri, May 20, 2016 at 11:47 AM, Jim CF Campbell <jcampbell(a)pivotal.io>
wrote:

Yep, I've had a back thread with Runtime OG who now has the DEA metrics
and who I thought had implemented all previous /varz (aka Collector)
metrics into the firehose. No answer from Mr Fraenkel yet.

On Fri, May 20, 2016 at 11:15 AM, Voelz, Marco <marco.voelz(a)sap.com>
wrote:

Including Jim Campbell to make sure this reaches him.



On 20/05/16 13:41, "Chawki, Amin" <amin.chawki(a)sap.com> wrote:



Hi,



by upgrading to CF v234 (including pre-release v232) we lost all our
monitoring capabilities regarding DEA and HM9000 (we were still using
Collector). By migrating to Firehose only a fraction of the metrics was
available. Very important metrics for our productive systems like
‘available_memory_ratio’ were just added in CF v235. In the meantime, we
were pretty much “flying blind”.



We replaced not existing metrics like ‘DEA…mem_used_bytes’ and
‘HM9000…healthy’, which were available via Collector, with metrics from
Bosh. Is this the way to go or are there any plans to add them again?



Best Regards,

Amin






--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963


--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963


Michael Fraenkel <michael.fraenkel@...>
 

When 234 was released, we did not realize that Collector was creating
additional metrics. Based on reports, we have added back any missing
metrics that people felt were needed. Let me know if we still have
missing metrics as you move beyond 234.

In 234, while we did not report available_memory_ratio, we do report
remaining_memory. If your DEAs have the same amount of memory, the ratio
can be computed or you can use the current value directly.

What information were you trying to understand from mem_used_bytes?

As far as the healthy metric from HM9000, it was quite misleading. It
reported healthy as long as the metrics server was running which wasn't
any indication of health. What exactly do you want to know?

- Michael

On 5/20/16 4:41 AM, Chawki, Amin wrote:

Hi,

by upgrading to CF v234 (including pre-release v232) we lost all our
monitoring capabilities regarding DEA and HM9000 (we were still using
Collector). By migrating to Firehose only a fraction of the metrics
was available. Very important metrics for our productive systems like
‘available_memory_ratio’ were just added in CF v235. In the meantime,
we were pretty much “flying blind”.

We replaced not existing metrics like ‘DEA…mem_used_bytes’ and
‘HM9000…healthy’, which were available via Collector, with metrics
from Bosh. Is this the way to go or are there any plans to add them again?

Best Regards,

Amin


Chawki, Amin <amin.chawki@...>
 

-What information were you trying to understand from mem_used_bytes?

We used mem_used_bytes and mem_free_bytes (currently metrics from bosh) to get an overview over the real overall memory usage of all apps as an approximation. This helps us to get a better understanding of the current overcommit factor.

-As far as the healthy metric from HM9000, it was quite misleading. It reported healthy as long as the metrics server was running which wasn't any indication of health. What exactly do you want to know?

Ah ok, I was not aware of that. Is there any reliable way to verify whether HM9000 is healthy?

Best Regards and Thanks,
Amin


From: Michael Fraenkel <michael.fraenkel(a)gmail.com>
Reply-To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Date: Monday 23 May 2016 at 13:17
To: "Discussions about Cloud Foundry projects and the system overall." <cf-dev(a)lists.cloudfoundry.org>
Subject: [cf-dev] Re: DEA Monitoring Capabilities

When 234 was released, we did not realize that Collector was creating additional metrics. Based on reports, we have added back any missing metrics that people felt were needed. Let me know if we still have missing metrics as you move beyond 234.

In 234, while we did not report available_memory_ratio, we do report remaining_memory. If your DEAs have the same amount of memory, the ratio can be computed or you can use the current value directly.

What information were you trying to understand from mem_used_bytes?

As far as the healthy metric from HM9000, it was quite misleading. It reported healthy as long as the metrics server was running which wasn't any indication of health. What exactly do you want to know?

- Michael

On 5/20/16 4:41 AM, Chawki, Amin wrote:
Hi,

by upgrading to CF v234 (including pre-release v232) we lost all our monitoring capabilities regarding DEA and HM9000 (we were still using Collector). By migrating to Firehose only a fraction of the metrics was available. Very important metrics for our productive systems like ‘available_memory_ratio’ were just added in CF v235. In the meantime, we were pretty much “flying blind”.

We replaced not existing metrics like ‘DEA…mem_used_bytes’ and ‘HM9000…healthy’, which were available via Collector, with metrics from Bosh. Is this the way to go or are there any plans to add them again?

Best Regards,
Amin


Michael Fraenkel <michael.fraenkel@...>
 

The apiserver, listener, shredder, analyzer all produce standard Go
metrics which really don't tell you much as to the "health" of hm9000
other than the processes are up. The metrics that are reported tell you
the health of your system. If the metrics are not be reported or just
look wrong, that is really when you know the "health" needs further
investigation. In most cases, you will need to examine the logs to
determine next steps.

The listener reports ReceivedHeartbeats and SavedHeartbeats. These will
at least tell you that the listener process is receiving hearbeats and
processing them. I have opened a bug on the values reported since they
are not reporting the last known value but an increasing amount.

The analyzer will report all metrics regarding the state of the expected
applications that should be running including instance counts as well as
what is actually running or missing, etc...

- Michael

On 5/23/16 7:57 AM, Chawki, Amin wrote:

/-What information were you trying to understand from mem_used_bytes?/

We used mem_used_bytes and mem_free_bytes (currently metrics from
bosh) to get an overview over the real overall memory usage of all
apps as an approximation. This helps us to get a better understanding
of the current overcommit factor.

/-As far as the healthy metric from HM9000, it was quite misleading.
It reported healthy as long as the metrics server was running which
wasn't any indication of health. What exactly do you want to know?/

//

Ah ok, I was not aware of that. Is there any reliable way to verify
whether HM9000 is healthy?

Best Regards and Thanks,

Amin

*From: *Michael Fraenkel <michael.fraenkel(a)gmail.com>
*Reply-To: *"Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
*Date: *Monday 23 May 2016 at 13:17
*To: *"Discussions about Cloud Foundry projects and the system
overall." <cf-dev(a)lists.cloudfoundry.org>
*Subject: *[cf-dev] Re: DEA Monitoring Capabilities

When 234 was released, we did not realize that Collector was
creating additional metrics. Based on reports, we have added back
any missing metrics that people felt were needed. Let me know if
we still have missing metrics as you move beyond 234.

In 234, while we did not report available_memory_ratio, we do
report remaining_memory. If your DEAs have the same amount of
memory, the ratio can be computed or you can use the current value
directly.

What information were you trying to understand from mem_used_bytes?

As far as the healthy metric from HM9000, it was quite misleading.
It reported healthy as long as the metrics server was running
which wasn't any indication of health. What exactly do you want to
know?

- Michael

On 5/20/16 4:41 AM, Chawki, Amin wrote:

Hi,

by upgrading to CF v234 (including pre-release v232) we lost
all our monitoring capabilities regarding DEA and HM9000 (we
were still using Collector). By migrating to Firehose only a
fraction of the metrics was available. Very important metrics
for our productive systems like ‘available_memory_ratio’ were
just added in CF v235. In the meantime, we were pretty much
“flying blind”.

We replaced not existing metrics like ‘DEA…mem_used_bytes’ and
‘HM9000…healthy’, which were available via Collector, with
metrics from Bosh. Is this the way to go or are there any
plans to add them again?

Best Regards,

Amin