Strategies for limiting metric updates with a clustered nozzle


Mike Youngstrom
 

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle. However, I
plan to cluster my nozzle using a subscriptionId. My understanding is that
when using a subscriptionId events will get balanced between the
subscribers. That would mean one nozzle instance might know when it last
sent a particular metric, but, the other instances wouldn't, without making
the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike


Mike Youngstrom
 

I suppose one relatively simple solution to this problem is I can have each
cluster member randomly decide if it should log each metric. :) If I pick
a number between 1 and 6 I suppose odds are I would log about every 6th
message on average or something like that. :)

Another idea, I could have each member pick a random number between 1 and
10 and I would skip that many messages before publishing then pick a new
random number.

I think it is mostly the dropsonde messages that are killing me. A
technique like this probably wouldn't really work for metrics derived from
http events and such.

Anyone have any other ideas?

MIke

On Wed, Aug 5, 2015 at 12:06 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle. However, I
plan to cluster my nozzle using a subscriptionId. My understanding is that
when using a subscriptionId events will get balanced between the
subscribers. That would mean one nozzle instance might know when it last
sent a particular metric, but, the other instances wouldn't, without making
the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike


James Bayer
 

warning, thinking out loud here...

your nozzle will tap the firehose, and filter for the metrics you care about

currently you're publishing theses events to your metrics backend as fast
as they come in across a horizontally scalable tier that doesn't coordinate
and that can be expensive if your backend charges by the transaction

to slow down the stream, you could consider having the work in two phases:
1) aggregation phase
2) publish phase

the aggregation phase could have each instance of the horizontally scale
out tier put the metric in a temporary data store such as redis or other
in-memory data grid with HA like apache geode [1].

the publish phase would have something like a cron / spring batch
capability to occasionally (as often as made sense for your costs) flush
the metrics from the temporary data store to the backend per-transaction
cost backend

[1] http://geode.incubator.apache.org/

On Fri, Aug 7, 2015 at 9:26 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I suppose one relatively simple solution to this problem is I can have
each cluster member randomly decide if it should log each metric. :) If I
pick a number between 1 and 6 I suppose odds are I would log about every
6th message on average or something like that. :)

Another idea, I could have each member pick a random number between 1 and
10 and I would skip that many messages before publishing then pick a new
random number.

I think it is mostly the dropsonde messages that are killing me. A
technique like this probably wouldn't really work for metrics derived from
http events and such.

Anyone have any other ideas?

MIke

On Wed, Aug 5, 2015 at 12:06 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle. However, I
plan to cluster my nozzle using a subscriptionId. My understanding is that
when using a subscriptionId events will get balanced between the
subscribers. That would mean one nozzle instance might know when it last
sent a particular metric, but, the other instances wouldn't, without making
the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike
--
Thank you,

James Bayer


Mike Youngstrom
 

Thanks James,

A little more complicated with more moving parts than I was hoping for but
if I don't want to miss anything I probably don't have much of a choice.

I think for now I'm going to go with some kind of random approach. At
least for the dropsonde generated metrics since they are by far the most
frequent/expensive and I think grabbing a random smattering of them will be
good enough for my current uses.

Mike

On Sat, Aug 8, 2015 at 7:02 AM, James Bayer <jbayer(a)pivotal.io> wrote:

warning, thinking out loud here...

your nozzle will tap the firehose, and filter for the metrics you care
about

currently you're publishing theses events to your metrics backend as fast
as they come in across a horizontally scalable tier that doesn't coordinate
and that can be expensive if your backend charges by the transaction

to slow down the stream, you could consider having the work in two phases:
1) aggregation phase
2) publish phase

the aggregation phase could have each instance of the horizontally scale
out tier put the metric in a temporary data store such as redis or other
in-memory data grid with HA like apache geode [1].

the publish phase would have something like a cron / spring batch
capability to occasionally (as often as made sense for your costs) flush
the metrics from the temporary data store to the backend per-transaction
cost backend

[1] http://geode.incubator.apache.org/

On Fri, Aug 7, 2015 at 9:26 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I suppose one relatively simple solution to this problem is I can have
each cluster member randomly decide if it should log each metric. :) If I
pick a number between 1 and 6 I suppose odds are I would log about every
6th message on average or something like that. :)

Another idea, I could have each member pick a random number between 1 and
10 and I would skip that many messages before publishing then pick a new
random number.

I think it is mostly the dropsonde messages that are killing me. A
technique like this probably wouldn't really work for metrics derived from
http events and such.

Anyone have any other ideas?

MIke

On Wed, Aug 5, 2015 at 12:06 PM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle. However,
I plan to cluster my nozzle using a subscriptionId. My understanding is
that when using a subscriptionId events will get balanced between the
subscribers. That would mean one nozzle instance might know when it last
sent a particular metric, but, the other instances wouldn't, without making
the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike

--
Thank you,

James Bayer


Erik Jasiak <ejasiak@...>
 

(list resend #1)
Hi Mike,

I think your random approach is workable; what you are doing in effect is
taking fewer polling samples off of the firehose stream.

Short of the aggregation answer James pointed out, this has the potential
to mess with a few things, like averages, but it's better than nothing if
you have to rate-control at ingest, and are looking for a low-cost solution.

In the longer-term, we are looking closely at how to make it easier to
aggregate metrics at either end of loggregator to help with the amount of
data, and hope to have more info shortly. Hopefully that will help with
controlling data flow no matter how often a component emits metrics.

Erik

On Sat, Aug 8, 2015 at 10:49 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

Thanks James,

A little more complicated with more moving parts than I was hoping for but
if I don't want to miss anything I probably don't have much of a choice.

I think for now I'm going to go with some kind of random approach. At
least for the dropsonde generated metrics since they are by far the most
frequent/expensive and I think grabbing a random smattering of them will be
good enough for my current uses.

Mike

On Sat, Aug 8, 2015 at 7:02 AM, James Bayer <jbayer(a)pivotal.io> wrote:

warning, thinking out loud here...

your nozzle will tap the firehose, and filter for the metrics you care
about

currently you're publishing theses events to your metrics backend as fast
as they come in across a horizontally scalable tier that doesn't coordinate
and that can be expensive if your backend charges by the transaction

to slow down the stream, you could consider having the work in two phases:
1) aggregation phase
2) publish phase

the aggregation phase could have each instance of the horizontally scale
out tier put the metric in a temporary data store such as redis or other
in-memory data grid with HA like apache geode [1].

the publish phase would have something like a cron / spring batch
capability to occasionally (as often as made sense for your costs) flush
the metrics from the temporary data store to the backend per-transaction
cost backend

[1] http://geode.incubator.apache.org/

On Fri, Aug 7, 2015 at 9:26 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I suppose one relatively simple solution to this problem is I can have
each cluster member randomly decide if it should log each metric. :) If I
pick a number between 1 and 6 I suppose odds are I would log about every
6th message on average or something like that. :)

Another idea, I could have each member pick a random number between 1
and 10 and I would skip that many messages before publishing then pick a
new random number.

I think it is mostly the dropsonde messages that are killing me. A
technique like this probably wouldn't really work for metrics derived from
http events and such.

Anyone have any other ideas?

MIke

On Wed, Aug 5, 2015 at 12:06 PM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle. However,
I plan to cluster my nozzle using a subscriptionId. My understanding is
that when using a subscriptionId events will get balanced between the
subscribers. That would mean one nozzle instance might know when it last
sent a particular metric, but, the other instances wouldn't, without making
the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike

--
Thank you,

James Bayer


Mike Youngstrom
 

Sounds great. I think the random solution works for me now. I'm glad you
are aware of the use case and have tentative plans to improve it in the
future. Thanks Erik!

Mike

On Tue, Aug 11, 2015 at 1:10 PM, Erik Jasiak <ejasiak(a)pivotal.io> wrote:

(list resend #1)
Hi Mike,

I think your random approach is workable; what you are doing in effect
is taking fewer polling samples off of the firehose stream.

Short of the aggregation answer James pointed out, this has the
potential to mess with a few things, like averages, but it's better than
nothing if you have to rate-control at ingest, and are looking for a
low-cost solution.

In the longer-term, we are looking closely at how to make it easier to
aggregate metrics at either end of loggregator to help with the amount of
data, and hope to have more info shortly. Hopefully that will help with
controlling data flow no matter how often a component emits metrics.

Erik

On Sat, Aug 8, 2015 at 10:49 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

Thanks James,

A little more complicated with more moving parts than I was hoping for
but if I don't want to miss anything I probably don't have much of a choice.

I think for now I'm going to go with some kind of random approach. At
least for the dropsonde generated metrics since they are by far the most
frequent/expensive and I think grabbing a random smattering of them will be
good enough for my current uses.

Mike

On Sat, Aug 8, 2015 at 7:02 AM, James Bayer <jbayer(a)pivotal.io> wrote:

warning, thinking out loud here...

your nozzle will tap the firehose, and filter for the metrics you care
about

currently you're publishing theses events to your metrics backend as
fast as they come in across a horizontally scalable tier that doesn't
coordinate and that can be expensive if your backend charges by the
transaction

to slow down the stream, you could consider having the work in two
phases:
1) aggregation phase
2) publish phase

the aggregation phase could have each instance of the horizontally scale
out tier put the metric in a temporary data store such as redis or other
in-memory data grid with HA like apache geode [1].

the publish phase would have something like a cron / spring batch
capability to occasionally (as often as made sense for your costs) flush
the metrics from the temporary data store to the backend per-transaction
cost backend

[1] http://geode.incubator.apache.org/

On Fri, Aug 7, 2015 at 9:26 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

I suppose one relatively simple solution to this problem is I can have
each cluster member randomly decide if it should log each metric. :) If I
pick a number between 1 and 6 I suppose odds are I would log about every
6th message on average or something like that. :)

Another idea, I could have each member pick a random number between 1
and 10 and I would skip that many messages before publishing then pick a
new random number.

I think it is mostly the dropsonde messages that are killing me. A
technique like this probably wouldn't really work for metrics derived from
http events and such.

Anyone have any other ideas?

MIke

On Wed, Aug 5, 2015 at 12:06 PM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

I'm working on adding support for Firehose metrics to our monitoring
solution. The firehose is working great. However, it appears each
component seems to send updates every 10 seconds or so. This might be a
great interval for some use cases but for my monitoring provider it can get
expensive. Any ideas on how I might limit the frequency of metric updates
from the firehose?

The obvious initial solution is to just do that in my nozzle.
However, I plan to cluster my nozzle using a subscriptionId. My
understanding is that when using a subscriptionId events will get balanced
between the subscribers. That would mean one nozzle instance might know
when it last sent a particular metric, but, the other instances wouldn't,
without making the solution more complex than I'd like it to be.

Any thoughts on how I might approach this problem?

Mike

--
Thank you,

James Bayer