Collector fix for default data tagging


Erik Jasiak
 

Hi all,

The loggregator team has been working to maintain metric parity between
the CF collector[1] and the firehose[2] in the short term.This is to
help CF operators feel confident they can move from the collector to the
firehose as the collector is removed.

During this work, we found one major bug in the collector.A longer
analysis follows, but in summary the collector was designed to figure
out and attach key bits of info to a metric (such as originating
component name and IP) if they weren't included.The implementation,
however, overwrote this data, even if it was provided by the metric
author. One additional consequence is that no other part of the system
could forward data on a component's behalf, without the collector
replacing it with the forwarder's info.

Though we've been trying to touch the collector as little as possible
before retirement, leaving this bug meant there was no straightforward
way to achieve nominal metric parity.Fixing the bug changes some of the
metric data CF operators may have seen previously, but thus far we have
been unable to find any widespread problem with the fix.

We've decided to fix the collector bug to not overwrite these
tags.However, we wanted to explain the impact:If this change will
cripple your existing CF monitoring (i.e. the possible change in metrics
data will hurt you, and you'd rather call the bug the 'intended
behavior') please let us know ASAP.

Many Thanks,

Erik Jasiak

PM - Loggregator

[1] https://github.com/cloudfoundry/collector

[2] https://github.com/cloudfoundry/noaa#logs-and-metrics-firehose

Analysis from the loggregator team follows -- many thanks to David
Sabetti and Nino Khodabandeh for the write up.

########################

*The current (buggy) behavior:*

In short, the collector computes specific tags (listed below) for
metrics, and overrides those tags even if they were already provided by
the varz endpoint.

*Slightly longer explanation of the bug:*

When a component registers its varz endpoint for the collector, it
publishes a message on NATS. The message contains certain information
about the component, such as component name, ip/port of the varz
endpoint, and job index. Using this information, and a few other pieces
of collector configuration, the collector computes tags for metrics
emitted by that varz endpoint. These tags are "job", "index", "name"
(which is a combination of "job" and "index"), "role", "deployment",
"ip", and (only in the case of DEAs) "stack".

The bug is that, even if the component provides one of these tags, the
collector will override it with the computed value. As an example, if
the Cloud Controller job emits metrics with a tag "job:api_z1", but
publishes a varz message that includes "type: CloudController", the
collector will override the tag provided by the varz endpoint
("job:api_z1") with the tag computed as a result of the registration
message on NATS ("job:CloudController"). As a result, we may end up with
unexpected tags in our downstream consumers of metrics.

*The proposed (fixed) behavior:*

The collector uses these computed values as **default** values for the
tags if they are not already provided by the component. However, if the
varz metrics include one of these tags, the collector will respect that
value and not override it.

*Potential Ramifications:*

Although this change is pretty small, we could think of a hypothetical
situation where this update changes the tags emitted by the collector.
If a metric includes one of these tags already, it will no longer be
overridden. As a result, the downstream consumer would see a different
tag. We don't imagine that anybody is producing a tag with the
**expectation** that it gets overridden, but if a metric includes such a
tag by accident (and the author never actually verified that the correct
value of the tag was received by the downstream consumer), you may see
that the tags attached to your metrics has changed. However, in this
case, the new tag value represents what the author of the component
originally intended.


Marco Voelz
 

Dear Erik,

thanks for the detailed explanation and the heads-up. A few comments and questions below.

On 26/08/15 17:49, "Erik Jasiak" <mjasiak(a)pivotal.io<mailto:mjasiak(a)pivotal.io>> wrote:
We’ve decided to fix the collector bug to not overwrite these tags. However, we wanted to explain the impact: If this change will cripple your existing CF monitoring (i.e. the possible change in metrics data will hurt you, and you’d rather call the bug the ‘intended behavior’) please let us know ASAP.

Will the change you are doing now in any way affect the already existing metrics and what users of the graphite plugin see as their names? Names are computed from the tags ‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ [1].

We currently use the collector to forward metrics in graphite format and built alerting and dashboards around this. So the alerts and dashboards contain metric names in their queries, such as “CF-landscape.Router.*.<metric name>”. If the names change, it *will* break the queries we have and render the alerts and dashboards useless – which might be somewhat transparent for dashboards, because you no longer see data, but is a disaster for alerts when you assume everything is fine if no event occurs.

I am aware that at some point in time we have to migrate to a graphite-firehose-nozzle and I am fine if we have to adapt metric names in alerts and dashboards. However, as long as I don’t do that migration, I would expect that our setup continues to work as it does today.

I’m also not sure about the detailed analysis:

the collector will override the tag provided by the varz endpoint ("job:api_z1") with the tag computed as a result of the registration message on NATS ("job:CloudController"). As a result, we may end up with unexpected tags in our downstream consumers of metrics.

This is only about metrics which get added now, right? Because for all metrics existing today, I would not call the current behavior “unexpected” - this is what I’m using right now.

Potential Ramifications:

Although this change is pretty small, we could think of a hypothetical situation where this update changes the tags emitted by the collector. If a metric includes one of these tags already, it will no longer be overridden. As a result, the downstream consumer would see a different tag.

This part also seems to be targeted to authors providing new metrics. Because, again, for the already existing metrics I as a consumer don’t know about the tags with which the metrics are emitted from the components currently, all I know about is what I’m getting from the collector.

TL;DR: As long as for none of the existing metrics the tags ‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ change, this change should be fine from our point of view. If they change, this change is a breaking one and I’d rather see this bug not fixed in the collector, but adapt my existing alerts and dashboards once I switch to the firehose.

Warm regards
Marco

[1] https://github.com/cloudfoundry/collector/blob/master/lib/collector/historian/graphite.rb#L25-L42


Erik Jasiak
 

Thanks Marco - for everyone else on the list, we reached out to Marco
after this email, and opened a tracker story[1] to do deeper analysis of
collector tags that may change. Thus far we have found nothing, but
will keep you all posted.

Erik
[1] https://www.pivotaltracker.com/story/show/102159498


Voelz, Marco wrote:

Dear Erik,

thanks for the detailed explanation and the heads-up. A few comments
and questions below.

On 26/08/15 17:49, "Erik Jasiak" <mjasiak(a)pivotal.io
<mailto:mjasiak(a)pivotal.io>> wrote:

We’ve decided to fix the collector bug to not overwrite these
tags.However, we wanted to explain the impact:If this change will
cripple your existing CF monitoring (i.e. the possible change in
metrics data will hurt you, and you’d rather call the bug the
‘intended behavior’) please let us know ASAP.


Will the change you are doing now in any way affect the already
existing metrics and what users of the graphite plugin see as their
names? Names are computed from the tags ‘deployment’, ‘job’, ‘index’,
‘ip’, ‘key’ [1].

We currently use the collector to forward metrics in graphite format
and built alerting and dashboards around this. So the alerts and
dashboards contain metric names in their queries, such as
“CF-landscape.Router.*.<metric name>”. If the names change, it *will*
break the queries we have and render the alerts and dashboards useless
– which might be somewhat transparent for dashboards, because you no
longer see data, but is a disaster for alerts when you assume
everything is fine if no event occurs.

I am aware that at some point in time we have to migrate to a
graphite-firehose-nozzle and I am fine if we have to adapt metric
names in alerts and dashboards. However, as long as I don’t do that
migration, I would expect that our setup continues to work as it does
today.


I’m also not sure about the detailed analysis:

the collector will override the tag provided by the varz endpoint
("job:api_z1") with the tag computed as a result of the
registration message on NATS ("job:CloudController"). As a result,
we may end up with unexpected tags in our downstream consumers of
metrics.


This is only about metrics which get added now, right? Because for all
metrics existing today, I would not call the current behavior
“unexpected” - this is what I’m using right now.

*Potential Ramifications:*

Although this change is pretty small, we could think of a
hypothetical situation where this update changes the tags emitted
by the collector. If a metric includes one of these tags already,
it will no longer be overridden. As a result, the downstream
consumer would see a different tag.


This part also seems to be targeted to authors providing new metrics.
Because, again, for the already existing metrics I as a consumer don’t
know about the tags with which the metrics are emitted from the
components currently, all I know about is what I’m getting from the
collector.

*TL;DR*: As long as for none of the existing metrics the tags
‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ change, this change should
be fine from our point of view. If they change, this change is a
breaking one and I’d rather see this bug not fixed in the collector,
but adapt my existing alerts and dashboards once I switch to the firehose.

Warm regards
Marco

[1]
https://github.com/cloudfoundry/collector/blob/master/lib/collector/historian/graphite.rb#L25-L42


Rohit Kumar
 

Hi Marco,

We instrumented the collector and found that none of the existing CF
components provide any tags which would be overridden by the latest
collector change. So this change shouldn't affect any metric names coming
from components in cf-release or diego-release. Let us know if you do find
any metrics from other components which do get affected by this.

Thanks,
Rohit and David

On Thu, Aug 27, 2015 at 11:05 AM, Erik Jasiak <mjasiak(a)pivotal.io> wrote:

Thanks Marco - for everyone else on the list, we reached out to Marco
after this email, and opened a tracker story[1] to do deeper analysis of
collector tags that may change. Thus far we have found nothing, but will
keep you all posted.

Erik
[1] https://www.pivotaltracker.com/story/show/102159498



Voelz, Marco wrote:

Dear Erik,

thanks for the detailed explanation and the heads-up. A few comments and
questions below.

On 26/08/15 17:49, "Erik Jasiak" <mjasiak(a)pivotal.io> wrote:

We’ve decided to fix the collector bug to not overwrite these tags. However,
we wanted to explain the impact: If this change will cripple your
existing CF monitoring (i.e. the possible change in metrics data will hurt
you, and you’d rather call the bug the ‘intended behavior’) please let us
know ASAP.


Will the change you are doing now in any way affect the already existing
metrics and what users of the graphite plugin see as their names? Names are
computed from the tags ‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ [1].

We currently use the collector to forward metrics in graphite format and
built alerting and dashboards around this. So the alerts and dashboards
contain metric names in their queries, such as
“CF-landscape.Router.*.<metric name>”. If the names change, it *will*
break the queries we have and render the alerts and dashboards useless –
which might be somewhat transparent for dashboards, because you no longer
see data, but is a disaster for alerts when you assume everything is fine
if no event occurs.

I am aware that at some point in time we have to migrate to a
graphite-firehose-nozzle and I am fine if we have to adapt metric names in
alerts and dashboards. However, as long as I don’t do that migration, I
would expect that our setup continues to work as it does today.


I’m also not sure about the detailed analysis:

the collector will override the tag provided by the varz endpoint
("job:api_z1") with the tag computed as a result of the registration
message on NATS ("job:CloudController"). As a result, we may end up with
unexpected tags in our downstream consumers of metrics.


This is only about metrics which get added now, right? Because for all
metrics existing today, I would not call the current behavior “unexpected”
- this is what I’m using right now.

*Potential Ramifications:*



Although this change is pretty small, we could think of a hypothetical
situation where this update changes the tags emitted by the collector. If a
metric includes one of these tags already, it will no longer be overridden.
As a result, the downstream consumer would see a different tag.


This part also seems to be targeted to authors providing new metrics.
Because, again, for the already existing metrics I as a consumer don’t know
about the tags with which the metrics are emitted from the components
currently, all I know about is what I’m getting from the collector.

*TL;DR*: As long as for none of the existing metrics the tags
‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ change, this change should be
fine from our point of view. If they change, this change is a breaking one
and I’d rather see this bug not fixed in the collector, but adapt my
existing alerts and dashboards once I switch to the firehose.

Warm regards
Marco

[1]
https://github.com/cloudfoundry/collector/blob/master/lib/collector/historian/graphite.rb#L25-L42