Questions about removal of the heartbeat message type from dropsonde-protocol


Mike Youngstrom
 

I noticed that heartbeat messages are no longer a part of the
dropsonde-protocol.

Can I get a quick summary of the thinking behind this change?

Is there an assumption that we should be using the bosh health manager and
not the firehose for this type of thing?

I'm just like some background and help understanding the LAMB team's
monitoring mindset regarding the removal of this message.

Thanks,
Mike


Erik Jasiak <ejasiak@...>
 

(resend #2)
Hi again Mike,

There were quite a few pros and cons that went into it; the high (low?)
lights from my notes are below. I'll have the rest of the team check in
if they have more info.

1) A ruby version of the dropsonde-protocol would require some amount of
maintaining state done by the consumer, which is more challenging in ruby.
2) How to shoehorn a heartbeat mechanism into the statsd injector (by
its nature, statsd sends last known value; is a heartbeat binary yes/no, or
milliseconds uptime, and a component is dead when there's no increase?)
3) Whose job is it to maintain heartbeat state to begin with?
Metron's, as the aggregator of dropsonde counters? A Nozzle's?
4) Is the correct model to use heartbeats as the 'source of truth' about
a component being alive, regardless of the data being broadcast, or does a
component / developer prefer the non-statsd-model of wanting metric updates
to serve as a heartbeat? (We've leaned toward the statsd model of 'last
update is valid', but then that implies everyone agrees a heartbeat is
really a running uptime counter or similar.)

We didn't have answers to all of these questions; what we did find was
that dropsonde-protocol heartbeats were rarely being used, and largely
being ignored. Because they were also in the way of figuring out a path
forward for things like dropsonde with ruby, we went for their removal
until we had a clearer use case and strategy, or we could handle them in a
cleaner, generally agreed upon way.

Hope that helps,
Erik

On Sat, Aug 8, 2015 at 11:32 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I noticed that heartbeat messages are no longer a part of the
dropsonde-protocol.

Can I get a quick summary of the thinking behind this change?

Is there an assumption that we should be using the bosh health manager and
not the firehose for this type of thing?

I'm just like some background and help understanding the LAMB team's
monitoring mindset regarding the removal of this message.

Thanks,
Mike


Mike Youngstrom
 

That makes sense. So, this change wasn't so much the result of a change in
direction but instead an acknowledgement that use of it wasn't really a
good direction so we might as well remove it.

That is useful info. Thanks for the explantation. Helps me feel more
confident in the approach my group has decided on.

Thanks,
Mike

On Tue, Aug 11, 2015 at 1:14 PM, Erik Jasiak <ejasiak(a)pivotal.io> wrote:

(resend #2)
Hi again Mike,

There were quite a few pros and cons that went into it; the high (low?)
lights from my notes are below. I'll have the rest of the team check in
if they have more info.

1) A ruby version of the dropsonde-protocol would require some amount
of maintaining state done by the consumer, which is more challenging in
ruby.
2) How to shoehorn a heartbeat mechanism into the statsd injector (by
its nature, statsd sends last known value; is a heartbeat binary yes/no, or
milliseconds uptime, and a component is dead when there's no increase?)
3) Whose job is it to maintain heartbeat state to begin with?
Metron's, as the aggregator of dropsonde counters? A Nozzle's?
4) Is the correct model to use heartbeats as the 'source of truth'
about a component being alive, regardless of the data being broadcast, or
does a component / developer prefer the non-statsd-model of wanting metric
updates to serve as a heartbeat? (We've leaned toward the statsd model of
'last update is valid', but then that implies everyone agrees a heartbeat
is really a running uptime counter or similar.)

We didn't have answers to all of these questions; what we did find was
that dropsonde-protocol heartbeats were rarely being used, and largely
being ignored. Because they were also in the way of figuring out a path
forward for things like dropsonde with ruby, we went for their removal
until we had a clearer use case and strategy, or we could handle them in a
cleaner, generally agreed upon way.

Hope that helps,
Erik

On Sat, Aug 8, 2015 at 11:32 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I noticed that heartbeat messages are no longer a part of the
dropsonde-protocol.

Can I get a quick summary of the thinking behind this change?

Is there an assumption that we should be using the bosh health manager
and not the firehose for this type of thing?

I'm just like some background and help understanding the LAMB team's
monitoring mindset regarding the removal of this message.

Thanks,
Mike