All, The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events. The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation. Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you! b [1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
|
|
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
toggle quoted message
Show quoted text
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
|
|
great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
toggle quoted message
Show quoted text
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote: Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
|
|
The current way of sending metrics as either Values or Counters through the pipeline makes the development of a downstream consumer (=nozzle) pretty easy. If you look at the datadog nozzle[0], it just takes all ValueMetrics and Counters and sends them off to datadog. The nozzle does not have to know anything about these metrics (e.g. their origin, name, or layout). Adding a new way to send metrics as a nested object would make the downstream implementation certainly more complicated. In that case, the nozzle developer has to know what metrics are included inside the generic point (basically the schema of the metric) and treat each point accordingly. For example, if I were to write a nozzle that emits metrics to Graphite with a StatsD client (like it is done here[1]), I need to know if my int64 value is a Gauge or a Counter. Also, my consumer is under constant risk of breaking when the upstream schema changes. We are already facing this problem with the container metrics. But at least the container metrics are in a defined format that is well documented and not likely to change. I agree with you, though, the the dropsonde protocol could use a mechanism for easier extension. Having a GenericPoint and/or GenericEvent seems like a good idea that I whole-heartedly support. I would just like to stay away from nested metrics. I think the cost of adding more logic into the downstream consumer (and making it harder to maintain) is not worth the benefit of a more concise metric transport. [0] https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle[1] https://github.com/CloudCredo/graphite-nozzle
toggle quoted message
Show quoted text
On Tue, Sep 1, 2015 at 5:52 PM, Benjamin Black <bblack(a)pivotal.io> wrote: great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote:
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
|
|
johannes, the problem of upstream schema changes causing downstream change or breakage is the current situation: every addition of a metric type implies a change to the dropsonde-protocol requiring everything downstream to be updated. the schema concerns are similar. currently there is no schema whatsoever beyond the very fine grained "this is a name and this is a value". this means every implementation of redis info export, for example, can, and almost certainly will, be different. this results in every downstream consumer having to know every possible variant or to only support specific variants, both exactly the problem you are looking to avoid. i share the core concern regarding ensuring points are "sufficiently" self describing. however, there is no clear line delineating what is sufficient. the current proposal pushes almost everything out to schema. we could imagine a change to the attributes that includes what an attribute is (gauge, counter, etc), what the units are for the attribute, and so on. it is critical that we balance the complexity of the points against complexity of the consumers as there is no free lunch here. which specific functionality would you want to see in the generic points to achieve the balance you prefer? b On Wed, Sep 2, 2015 at 2:07 PM, Johannes Tuchscherer < jtuchscherer(a)pivotal.io> wrote: The current way of sending metrics as either Values or Counters through the pipeline makes the development of a downstream consumer (=nozzle) pretty easy. If you look at the datadog nozzle[0], it just takes all ValueMetrics and Counters and sends them off to datadog. The nozzle does not have to know anything about these metrics (e.g. their origin, name, or layout).
Adding a new way to send metrics as a nested object would make the downstream implementation certainly more complicated. In that case, the nozzle developer has to know what metrics are included inside the generic point (basically the schema of the metric) and treat each point accordingly. For example, if I were to write a nozzle that emits metrics to Graphite with a StatsD client (like it is done here[1]), I need to know if my int64 value is a Gauge or a Counter. Also, my consumer is under constant risk of breaking when the upstream schema changes.
We are already facing this problem with the container metrics. But at least the container metrics are in a defined format that is well documented and not likely to change.
I agree with you, though, the the dropsonde protocol could use a mechanism for easier extension. Having a GenericPoint and/or GenericEvent seems like a good idea that I whole-heartedly support. I would just like to stay away from nested metrics. I think the cost of adding more logic into the downstream consumer (and making it harder to maintain) is not worth the benefit of a more concise metric transport.
[0] https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle [1] https://github.com/CloudCredo/graphite-nozzle
On Tue, Sep 1, 2015 at 5:52 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote:
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
|
|
after understanding ben's proposal of what i would call an extensible generic point versus the status quo of metrics that are actually hard-coded in software on by the metric producer and the metric consumer, i immediately gravitated toward the approach by ben.
cloud foundry has really benefited from extensibility in these examples:
* diego lifecycles * app buildpacks * app docker images * app as windows build artifact * service brokers * cf cli plugins * collector plugins * firehose nozzles * diego route emitters * garden backends * bosh cli plugins * bosh releases * external bosh CPIs * bosh health monitor plugins
let me know if there are other points of extension i'm missing.
in most cases, the initial implementations required cloud foundry system components to change software to support additional extensibility, and some of the examples above still require that and it's an issue in frustration as someone with an idea to explore needs to persuade the cf maintaining team to process a pull request or complete work on an area. i see ben's proposal as making an extremely valuable additional point of extension for creating application and system metrics that benefits the entire cloud foundry ecosystem.
i am sympathetic to the question raised by dwayne around how large the messages will be. it would seem that we could consider an upper bound on the number of attributes supported by looking at the types of metrics that would be expressed. the redis info point is already 84 attributes for example.
all of the following seem related to scaling considerations off the top of my head: * how large an individual metric may be * at what rate the platform should support producers sending metrics * what platform quality of service to provide (lossiness or not, back pressure, rate limiting, etc) * what type of clients to the metrics are supported and any limitations related to that. * whether there is tenant variability in some of the dimensions above. for example a system metric might have a higher SLA than an app metric
should we consider putting a boundary on the "how large an individual metric may be" by limiting the initial implementation to a number of attributes (that we could change later or make configurable?).
i'm personally really excited about this new set of extensibility being proposed and the creative things people will do with it. having loggregator as a built-in system component versus a bolt-on is already such a great capability compared with other platforms and i see investments to make it more extensible and apply to more scenarios as making cloud foundry more valuable and more fun to use.
toggle quoted message
Show quoted text
On Fri, Sep 4, 2015 at 10:52 AM, Benjamin Black <bblack(a)pivotal.io> wrote: johannes,
the problem of upstream schema changes causing downstream change or breakage is the current situation: every addition of a metric type implies a change to the dropsonde-protocol requiring everything downstream to be updated.
the schema concerns are similar. currently there is no schema whatsoever beyond the very fine grained "this is a name and this is a value". this means every implementation of redis info export, for example, can, and almost certainly will, be different. this results in every downstream consumer having to know every possible variant or to only support specific variants, both exactly the problem you are looking to avoid.
i share the core concern regarding ensuring points are "sufficiently" self describing. however, there is no clear line delineating what is sufficient. the current proposal pushes almost everything out to schema. we could imagine a change to the attributes that includes what an attribute is (gauge, counter, etc), what the units are for the attribute, and so on.
it is critical that we balance the complexity of the points against complexity of the consumers as there is no free lunch here. which specific functionality would you want to see in the generic points to achieve the balance you prefer?
b
On Wed, Sep 2, 2015 at 2:07 PM, Johannes Tuchscherer < jtuchscherer(a)pivotal.io> wrote:
The current way of sending metrics as either Values or Counters through the pipeline makes the development of a downstream consumer (=nozzle) pretty easy. If you look at the datadog nozzle[0], it just takes all ValueMetrics and Counters and sends them off to datadog. The nozzle does not have to know anything about these metrics (e.g. their origin, name, or layout).
Adding a new way to send metrics as a nested object would make the downstream implementation certainly more complicated. In that case, the nozzle developer has to know what metrics are included inside the generic point (basically the schema of the metric) and treat each point accordingly. For example, if I were to write a nozzle that emits metrics to Graphite with a StatsD client (like it is done here[1]), I need to know if my int64 value is a Gauge or a Counter. Also, my consumer is under constant risk of breaking when the upstream schema changes.
We are already facing this problem with the container metrics. But at least the container metrics are in a defined format that is well documented and not likely to change.
I agree with you, though, the the dropsonde protocol could use a mechanism for easier extension. Having a GenericPoint and/or GenericEvent seems like a good idea that I whole-heartedly support. I would just like to stay away from nested metrics. I think the cost of adding more logic into the downstream consumer (and making it harder to maintain) is not worth the benefit of a more concise metric transport.
[0] https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle [1] https://github.com/CloudCredo/graphite-nozzle
On Tue, Sep 1, 2015 at 5:52 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote:
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
-- Thank you,
James Bayer
|
|
Ben,
I guess I am working under the assumption that the current upstream schema is not going to see a terrible amount of change. The StatsD protocol has been very stable for over four years, so I don't understand why we would add more and more metric types. (I already struggle with the decision to have container metrics as their own data type. I am not quite sure why that was done vs just expressing them as ValueMetrics).
I am also not following your argument with the multiple implementations of a redis export? Why would you have multiple implementations of a redis info export? Also, why does the downstream consumer have to know about the schema? Neither the datadog nozzle nor the graphite nozzle cares about any type of schema right now.
But to answer your question, I think as a downstream developer I am not as interested in whether you are sending me a uint32 or uint64, but the meaning (e.g. counter vs value) is much more important to me. So, if you were to do nested metrics, I think I would rather like to see having nested counters or values in there plus maybe one type that we are missing which is a generic event with just a string.
Generally, I would try to avoid falling into the trap of creating a overly generic system at the cost of making consumers unnecessarily complicated. Maybe it would help if you outlined a few use cases that might benefit from a system like this and how specifically you would implement a downstream consumer (e.g. is there a common place where I can fetch the schema for the generic data point?).
toggle quoted message
Show quoted text
On Sat, Sep 5, 2015 at 6:57 AM, James Bayer <jbayer(a)pivotal.io> wrote: after understanding ben's proposal of what i would call an extensible generic point versus the status quo of metrics that are actually hard-coded in software on by the metric producer and the metric consumer, i immediately gravitated toward the approach by ben.
cloud foundry has really benefited from extensibility in these examples:
* diego lifecycles * app buildpacks * app docker images * app as windows build artifact * service brokers * cf cli plugins * collector plugins * firehose nozzles * diego route emitters * garden backends * bosh cli plugins * bosh releases * external bosh CPIs * bosh health monitor plugins
let me know if there are other points of extension i'm missing.
in most cases, the initial implementations required cloud foundry system components to change software to support additional extensibility, and some of the examples above still require that and it's an issue in frustration as someone with an idea to explore needs to persuade the cf maintaining team to process a pull request or complete work on an area. i see ben's proposal as making an extremely valuable additional point of extension for creating application and system metrics that benefits the entire cloud foundry ecosystem.
i am sympathetic to the question raised by dwayne around how large the messages will be. it would seem that we could consider an upper bound on the number of attributes supported by looking at the types of metrics that would be expressed. the redis info point is already 84 attributes for example.
all of the following seem related to scaling considerations off the top of my head: * how large an individual metric may be * at what rate the platform should support producers sending metrics * what platform quality of service to provide (lossiness or not, back pressure, rate limiting, etc) * what type of clients to the metrics are supported and any limitations related to that. * whether there is tenant variability in some of the dimensions above. for example a system metric might have a higher SLA than an app metric
should we consider putting a boundary on the "how large an individual metric may be" by limiting the initial implementation to a number of attributes (that we could change later or make configurable?).
i'm personally really excited about this new set of extensibility being proposed and the creative things people will do with it. having loggregator as a built-in system component versus a bolt-on is already such a great capability compared with other platforms and i see investments to make it more extensible and apply to more scenarios as making cloud foundry more valuable and more fun to use.
On Fri, Sep 4, 2015 at 10:52 AM, Benjamin Black <bblack(a)pivotal.io> wrote:
johannes,
the problem of upstream schema changes causing downstream change or breakage is the current situation: every addition of a metric type implies a change to the dropsonde-protocol requiring everything downstream to be updated.
the schema concerns are similar. currently there is no schema whatsoever beyond the very fine grained "this is a name and this is a value". this means every implementation of redis info export, for example, can, and almost certainly will, be different. this results in every downstream consumer having to know every possible variant or to only support specific variants, both exactly the problem you are looking to avoid.
i share the core concern regarding ensuring points are "sufficiently" self describing. however, there is no clear line delineating what is sufficient. the current proposal pushes almost everything out to schema. we could imagine a change to the attributes that includes what an attribute is (gauge, counter, etc), what the units are for the attribute, and so on.
it is critical that we balance the complexity of the points against complexity of the consumers as there is no free lunch here. which specific functionality would you want to see in the generic points to achieve the balance you prefer?
b
On Wed, Sep 2, 2015 at 2:07 PM, Johannes Tuchscherer < jtuchscherer(a)pivotal.io> wrote:
The current way of sending metrics as either Values or Counters through the pipeline makes the development of a downstream consumer (=nozzle) pretty easy. If you look at the datadog nozzle[0], it just takes all ValueMetrics and Counters and sends them off to datadog. The nozzle does not have to know anything about these metrics (e.g. their origin, name, or layout).
Adding a new way to send metrics as a nested object would make the downstream implementation certainly more complicated. In that case, the nozzle developer has to know what metrics are included inside the generic point (basically the schema of the metric) and treat each point accordingly. For example, if I were to write a nozzle that emits metrics to Graphite with a StatsD client (like it is done here[1]), I need to know if my int64 value is a Gauge or a Counter. Also, my consumer is under constant risk of breaking when the upstream schema changes.
We are already facing this problem with the container metrics. But at least the container metrics are in a defined format that is well documented and not likely to change.
I agree with you, though, the the dropsonde protocol could use a mechanism for easier extension. Having a GenericPoint and/or GenericEvent seems like a good idea that I whole-heartedly support. I would just like to stay away from nested metrics. I think the cost of adding more logic into the downstream consumer (and making it harder to maintain) is not worth the benefit of a more concise metric transport.
[0] https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle [1] https://github.com/CloudCredo/graphite-nozzle
On Tue, Sep 1, 2015 at 5:52 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote:
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
-- Thank you,
James Bayer
|
|
One of the use cases that would benefit from this would be metrics sending. Given that the current statsd protocol lacks the ability to supply metadata, such as job and index ids, some apps have taken to inserting what would otherwise be tagged data into the metric namespace. As an example: [image: Screenshot 2015-09-10 17.25.19.png] Endpoints like Datadog and OpenTSDB want key names that are not unique per instance. Graphite has wildcard semantics to accomodate this. But Datadog and OpenTSDB do not, and would need this implemented elsewhere in the delivery chain. StatsD doesn't provide a way to side-channel this information, and we don't want to implement custom parsing on consumers when we overload the metric key. I believe that this protocol will be a move towards providing a better means by which folks can supply metrics to the system without having to make convention decisions that have to be scraped out and transformed on the consumer side, as wasn't done above. A generic schema does not exist currently, and this appears to be a promising way of delivering that functionality. It would be much easier to use a generic schema to output to DataDog, OpenTSDB, Graphite, and others, than it would be to guess a schema from a flattened result (for example, "router__0" is understandably job and index, but what does the "vizzini_1_abcd" part represent? How would I parse this if I didn't have a human trace it back to source?). Thanks, Jim On Tue, Sep 8, 2015 at 7:29 AM Johannes Tuchscherer <jtuchscherer(a)pivotal.io> wrote: Ben,
I guess I am working under the assumption that the current upstream schema is not going to see a terrible amount of change. The StatsD protocol has been very stable for over four years, so I don't understand why we would add more and more metric types. (I already struggle with the decision to have container metrics as their own data type. I am not quite sure why that was done vs just expressing them as ValueMetrics).
I am also not following your argument with the multiple implementations of a redis export? Why would you have multiple implementations of a redis info export? Also, why does the downstream consumer have to know about the schema? Neither the datadog nozzle nor the graphite nozzle cares about any type of schema right now.
But to answer your question, I think as a downstream developer I am not as interested in whether you are sending me a uint32 or uint64, but the meaning (e.g. counter vs value) is much more important to me. So, if you were to do nested metrics, I think I would rather like to see having nested counters or values in there plus maybe one type that we are missing which is a generic event with just a string.
Generally, I would try to avoid falling into the trap of creating a overly generic system at the cost of making consumers unnecessarily complicated. Maybe it would help if you outlined a few use cases that might benefit from a system like this and how specifically you would implement a downstream consumer (e.g. is there a common place where I can fetch the schema for the generic data point?).
On Sat, Sep 5, 2015 at 6:57 AM, James Bayer <jbayer(a)pivotal.io> wrote:
after understanding ben's proposal of what i would call an extensible generic point versus the status quo of metrics that are actually hard-coded in software on by the metric producer and the metric consumer, i immediately gravitated toward the approach by ben.
cloud foundry has really benefited from extensibility in these examples:
* diego lifecycles * app buildpacks * app docker images * app as windows build artifact * service brokers * cf cli plugins * collector plugins * firehose nozzles * diego route emitters * garden backends * bosh cli plugins * bosh releases * external bosh CPIs * bosh health monitor plugins
let me know if there are other points of extension i'm missing.
in most cases, the initial implementations required cloud foundry system components to change software to support additional extensibility, and some of the examples above still require that and it's an issue in frustration as someone with an idea to explore needs to persuade the cf maintaining team to process a pull request or complete work on an area. i see ben's proposal as making an extremely valuable additional point of extension for creating application and system metrics that benefits the entire cloud foundry ecosystem.
i am sympathetic to the question raised by dwayne around how large the messages will be. it would seem that we could consider an upper bound on the number of attributes supported by looking at the types of metrics that would be expressed. the redis info point is already 84 attributes for example.
all of the following seem related to scaling considerations off the top of my head: * how large an individual metric may be * at what rate the platform should support producers sending metrics * what platform quality of service to provide (lossiness or not, back pressure, rate limiting, etc) * what type of clients to the metrics are supported and any limitations related to that. * whether there is tenant variability in some of the dimensions above. for example a system metric might have a higher SLA than an app metric
should we consider putting a boundary on the "how large an individual metric may be" by limiting the initial implementation to a number of attributes (that we could change later or make configurable?).
i'm personally really excited about this new set of extensibility being proposed and the creative things people will do with it. having loggregator as a built-in system component versus a bolt-on is already such a great capability compared with other platforms and i see investments to make it more extensible and apply to more scenarios as making cloud foundry more valuable and more fun to use.
On Fri, Sep 4, 2015 at 10:52 AM, Benjamin Black <bblack(a)pivotal.io> wrote:
johannes,
the problem of upstream schema changes causing downstream change or breakage is the current situation: every addition of a metric type implies a change to the dropsonde-protocol requiring everything downstream to be updated.
the schema concerns are similar. currently there is no schema whatsoever beyond the very fine grained "this is a name and this is a value". this means every implementation of redis info export, for example, can, and almost certainly will, be different. this results in every downstream consumer having to know every possible variant or to only support specific variants, both exactly the problem you are looking to avoid.
i share the core concern regarding ensuring points are "sufficiently" self describing. however, there is no clear line delineating what is sufficient. the current proposal pushes almost everything out to schema. we could imagine a change to the attributes that includes what an attribute is (gauge, counter, etc), what the units are for the attribute, and so on.
it is critical that we balance the complexity of the points against complexity of the consumers as there is no free lunch here. which specific functionality would you want to see in the generic points to achieve the balance you prefer?
b
On Wed, Sep 2, 2015 at 2:07 PM, Johannes Tuchscherer < jtuchscherer(a)pivotal.io> wrote:
The current way of sending metrics as either Values or Counters through the pipeline makes the development of a downstream consumer (=nozzle) pretty easy. If you look at the datadog nozzle[0], it just takes all ValueMetrics and Counters and sends them off to datadog. The nozzle does not have to know anything about these metrics (e.g. their origin, name, or layout).
Adding a new way to send metrics as a nested object would make the downstream implementation certainly more complicated. In that case, the nozzle developer has to know what metrics are included inside the generic point (basically the schema of the metric) and treat each point accordingly. For example, if I were to write a nozzle that emits metrics to Graphite with a StatsD client (like it is done here[1]), I need to know if my int64 value is a Gauge or a Counter. Also, my consumer is under constant risk of breaking when the upstream schema changes.
We are already facing this problem with the container metrics. But at least the container metrics are in a defined format that is well documented and not likely to change.
I agree with you, though, the the dropsonde protocol could use a mechanism for easier extension. Having a GenericPoint and/or GenericEvent seems like a good idea that I whole-heartedly support. I would just like to stay away from nested metrics. I think the cost of adding more logic into the downstream consumer (and making it harder to maintain) is not worth the benefit of a more concise metric transport.
[0] https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle [1] https://github.com/CloudCredo/graphite-nozzle
On Tue, Sep 1, 2015 at 5:52 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
great questions, dwayne.
1) the partition key is intended to be used in a similar manner to partitioners in distributed systems like cassandra or kafka. the specific behavior i would like to make part of the contract is two-fold: that all data with the same key is routed to the same partition and that all data in a partition is FIFO (meaning no ordering guarantees beyond arrival time).
this could help with the multi-line log problem by ensuring a single consumer will receive all lines for a given log entry in order, allowing simple reassembly. however, the lines might be interleaved with other lines with the same key or even other keys that happen to map to the same partition, so the consumer does require a bit of intelligence. this was actually one of the driving scenarios for adding the key.
2) i expect typical points to be in the hundreds of bytes to a few KB. if we find ourselves regularly needing much larger points, especially near that 64KB limit, i'd look to the JSON representation as the hierarchical structure is more efficiently managed there.
b
On Tue, Sep 1, 2015 at 4:42 PM, <dschultz(a)pivotal.io> wrote:
Hi Ben,
I was wondering if you could give a concrete use case for the partition key functionality.
In particular I am interested in how we solve multi line log entries. I think it would be better to solve it by keeping all the data (the multiple lines) together throughout the logging/metrics pipeline, but could see how something like a partition key might help keep the data together as well.
Second question: how large do you see these point messages getting (average and max)? There are still several stages of the logging/metrics pipeline that use UDP with a standard 64K size limit.
Thanks, Dwayne
On Aug 28, 2015, at 4:54 PM, Benjamin Black <bblack(a)pivotal.io> wrote:
All,
The existing dropsonde protocol uses a different message type for each event type. HttpStart, HttpStop, ContainerMetrics, and so on are all distinct types in the protocol definition. This requires protocol changes to introduce any new event type, making such changes very expensive. We've been working for the past few weeks on an addition to the dropsonde protocol to support easier future extension to new types of events and to make it easier for users to define their own events.
The document linked below [1] describes a generic data point message capable of carrying multi-dimensional, multi-metric points as sets of name/value pairs. This new message is expected to be added as an additional entry in the existing dropsonde protocol metric type enum. Things are now at a point where we'd like to get feedback from the community before moving forward with implementation.
Please contribute your thoughts on the document in whichever way you are most comfortable: comments on the document, email here, or email directly to me. If you comment on the document, please make sure you are logged in so we can keep track of who is asking for what. Your views are not just appreciated, but critical to the continued health and success of the Cloud Foundry community. Thank you!
b
[1] https://docs.google.com/document/d/1SzvT1BjrBPqUw6zfSYYFfaW9vX_dTZZjn5sl2nxB6Bc/edit?usp=sharing
-- Thank you,
James Bayer
|
|