Re: Collector fix for default data tagging


Erik Jasiak
 

Thanks Marco - for everyone else on the list, we reached out to Marco
after this email, and opened a tracker story[1] to do deeper analysis of
collector tags that may change. Thus far we have found nothing, but
will keep you all posted.

Erik
[1] https://www.pivotaltracker.com/story/show/102159498


Voelz, Marco wrote:

Dear Erik,

thanks for the detailed explanation and the heads-up. A few comments
and questions below.

On 26/08/15 17:49, "Erik Jasiak" <mjasiak(a)pivotal.io
<mailto:mjasiak(a)pivotal.io>> wrote:

We’ve decided to fix the collector bug to not overwrite these
tags.However, we wanted to explain the impact:If this change will
cripple your existing CF monitoring (i.e. the possible change in
metrics data will hurt you, and you’d rather call the bug the
‘intended behavior’) please let us know ASAP.


Will the change you are doing now in any way affect the already
existing metrics and what users of the graphite plugin see as their
names? Names are computed from the tags ‘deployment’, ‘job’, ‘index’,
‘ip’, ‘key’ [1].

We currently use the collector to forward metrics in graphite format
and built alerting and dashboards around this. So the alerts and
dashboards contain metric names in their queries, such as
“CF-landscape.Router.*.<metric name>”. If the names change, it *will*
break the queries we have and render the alerts and dashboards useless
– which might be somewhat transparent for dashboards, because you no
longer see data, but is a disaster for alerts when you assume
everything is fine if no event occurs.

I am aware that at some point in time we have to migrate to a
graphite-firehose-nozzle and I am fine if we have to adapt metric
names in alerts and dashboards. However, as long as I don’t do that
migration, I would expect that our setup continues to work as it does
today.


I’m also not sure about the detailed analysis:

the collector will override the tag provided by the varz endpoint
("job:api_z1") with the tag computed as a result of the
registration message on NATS ("job:CloudController"). As a result,
we may end up with unexpected tags in our downstream consumers of
metrics.


This is only about metrics which get added now, right? Because for all
metrics existing today, I would not call the current behavior
“unexpected” - this is what I’m using right now.

*Potential Ramifications:*

Although this change is pretty small, we could think of a
hypothetical situation where this update changes the tags emitted
by the collector. If a metric includes one of these tags already,
it will no longer be overridden. As a result, the downstream
consumer would see a different tag.


This part also seems to be targeted to authors providing new metrics.
Because, again, for the already existing metrics I as a consumer don’t
know about the tags with which the metrics are emitted from the
components currently, all I know about is what I’m getting from the
collector.

*TL;DR*: As long as for none of the existing metrics the tags
‘deployment’, ‘job’, ‘index’, ‘ip’, ‘key’ change, this change should
be fine from our point of view. If they change, this change is a
breaking one and I’d rather see this bug not fixed in the collector,
but adapt my existing alerts and dashboards once I switch to the firehose.

Warm regards
Marco

[1]
https://github.com/cloudfoundry/collector/blob/master/lib/collector/historian/graphite.rb#L25-L42

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.