Collector fix for default data tagging
Erik Jasiak
Hi all,
The loggregator team has been working to maintain metric parity between the CF collector[1] and the firehose[2] in the short term.This is to help CF operators feel confident they can move from the collector to the firehose as the collector is removed. During this work, we found one major bug in the collector.A longer analysis follows, but in summary the collector was designed to figure out and attach key bits of info to a metric (such as originating component name and IP) if they weren't included.The implementation, however, overwrote this data, even if it was provided by the metric author. One additional consequence is that no other part of the system could forward data on a component's behalf, without the collector replacing it with the forwarder's info. Though we've been trying to touch the collector as little as possible before retirement, leaving this bug meant there was no straightforward way to achieve nominal metric parity.Fixing the bug changes some of the metric data CF operators may have seen previously, but thus far we have been unable to find any widespread problem with the fix. We've decided to fix the collector bug to not overwrite these tags.However, we wanted to explain the impact:If this change will cripple your existing CF monitoring (i.e. the possible change in metrics data will hurt you, and you'd rather call the bug the 'intended behavior') please let us know ASAP. Many Thanks, Erik Jasiak PM - Loggregator [1] https://github.com/cloudfoundry/collector [2] https://github.com/cloudfoundry/noaa#logs-and-metrics-firehose Analysis from the loggregator team follows -- many thanks to David Sabetti and Nino Khodabandeh for the write up. ######################## *The current (buggy) behavior:* In short, the collector computes specific tags (listed below) for metrics, and overrides those tags even if they were already provided by the varz endpoint. *Slightly longer explanation of the bug:* When a component registers its varz endpoint for the collector, it publishes a message on NATS. The message contains certain information about the component, such as component name, ip/port of the varz endpoint, and job index. Using this information, and a few other pieces of collector configuration, the collector computes tags for metrics emitted by that varz endpoint. These tags are "job", "index", "name" (which is a combination of "job" and "index"), "role", "deployment", "ip", and (only in the case of DEAs) "stack". The bug is that, even if the component provides one of these tags, the collector will override it with the computed value. As an example, if the Cloud Controller job emits metrics with a tag "job:api_z1", but publishes a varz message that includes "type: CloudController", the collector will override the tag provided by the varz endpoint ("job:api_z1") with the tag computed as a result of the registration message on NATS ("job:CloudController"). As a result, we may end up with unexpected tags in our downstream consumers of metrics. *The proposed (fixed) behavior:* The collector uses these computed values as **default** values for the tags if they are not already provided by the component. However, if the varz metrics include one of these tags, the collector will respect that value and not override it. *Potential Ramifications:* Although this change is pretty small, we could think of a hypothetical situation where this update changes the tags emitted by the collector. If a metric includes one of these tags already, it will no longer be overridden. As a result, the downstream consumer would see a different tag. We don't imagine that anybody is producing a tag with the **expectation** that it gets overridden, but if a metric includes such a tag by accident (and the author never actually verified that the correct value of the tag was received by the downstream consumer), you may see that the tags attached to your metrics has changed. However, in this case, the new tag value represents what the author of the component originally intended. |
|