How random is Metron's Doppler selection?


Mike Youngstrom <youngm@...>
 



I think the challenge here is being able to distinguish between events
that are emitted by an App log vs. a multi-line event. Only the app
really would know this information, wouldn't it? i.e. it has to be done at
the app level? This may not be too hard with the common logger frameworks,
and perhaps documenting the pattern in a blog post for developers to
reference.
I agree. The app knows what is multi line and what isn't. So, the problem
is what is the most appropriate way for the application to communicate to
loggregator what should be multi-line and what shouldn't be. The '\' was
just a thought that it might be a less heavy way compared to syslog to give
an application a way to communicate this intent with loggregator. Using a
more rich endpoint like syslog might be another approach.

Either way I think us as loggregator users need some help from the
loggregator team to improve this scenario.

Mike


Stuart Charlton
 

Hi Mike (Youngstrom),


On Tue, Jun 16, 2015 at 11:46 AM, <cf-dev-request(a)lists.cloudfoundry.org>
wrote:


Perhaps the solution could be as simple as supporing escaping the end of a
line with '\' to represent a log event that should include the next line?
Something like that might be good enough.
I think the challenge here is being able to distinguish between events that
are emitted by an App log vs. a multi-line event. Only the app really
would know this information, wouldn't it? i.e. it has to be done at the
app level? This may not be too hard with the common logger frameworks, and
perhaps documenting the pattern in a blog post for developers to reference.

--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io


Mike Youngstrom <youngm@...>
 

Great! Thanks for the acknowledgement John. :) To be clear I'm not
proposing that stdout and stderr should be done away with and I'm now
trying not to say that adding a syslog endpoint is the best solution.

Perhaps the solution could be as simple as supporing escaping the end of a
line with '\' to represent a log event that should include the next line?
Something like that might be good enough.

You guys are the smart ones I'm just trying to communicate a pain point for
which I don't believe an adequate solution has been presented yet. :)

Is there a story in the tracker to look into this? I couldn't find one.

Mike

On Tue, Jun 16, 2015 at 8:50 AM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

I'm not saying that we have a good solution to multi-line log messages.
It's definitely a challenge today.

It's my understanding that the reasons for providing the stdout/stderr
logging are:

- adherence to the 12 Factor App <http://12factor.net/logs> principles,
- zero-configuration, "just works" support for the broadest set of use
cases, and
- compatibility with other PaaS offerings (e.g. Heroku).

None of that is meant to disregard your use case. I completely agree that
it's difficult-to-impossible for Loggregator to play nice with multi-line
logs, and that to bypass it would eliminate the value that the system
provides. I also agree that, while line-by-line processing of the console
works fine for a human watching the logs in real-time, it makes storing and
processing messages more difficult.


– John Tuley

On Mon, Jun 15, 2015 at 4:27 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

As far as your comment John about having our applications send their
syslogs to a remote syslog server. Though that would certainly provide a
way to get better logs into splunk it would eliminate all the value we get
from loggregator.

* cf logs won't work (unless we fork our logs)
* we won't get the redundancy and reliability of logging locally (same
reason why metron exists as an agent)
* Complex customer config for a solution that should for the most part
"just work"
* etc.

There are all kinds of hacks we can use to improve our multi-line
logging. But, they are all hacks that diminish the customer experience.

I understand that nobody here can "speculate as to the future of CF and
whether or not a particular feature will someday be included". All I'm
asking for is an acknowledgement from the LAMB team that draining
multi-line log messages is a point point for users and that the team would
consider investing some future time to a solution (any solution) for this
issue.

If the logging team really believes that the way multi-line log events
are currently handled isn't a problem then lets discuss that. I as a user
believe this is a problem that ought to be looked at some point in the
future.

Mike

On Mon, Jun 15, 2015 at 3:48 PM, Mike Heath <elcapo(a)gmail.com> wrote:

I think our situation is a little bit different since we have a custom
syslog server that send logs directly to our Splunk indexers rather than
going through a Splunk forwarder that can aggregate multiple syslog streams
into a single event. This is part of our Splunk magic that allows our users
to do Splunk searches based of their Cloud Foundry app name, space, org,
etc rather than GUIDs.

Regardless, we can fix this by having our developers format their stack
traces differently.

Thanks Stuart.

-Mike

On Sat, Jun 13, 2015 at 1:32 PM, Stuart Charlton <scharlton(a)pivotal.io>
wrote:

Mike,


Actually, this might explain why some of our customers are so
frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That
means
each line of the stack trace goes out to a different syslog drain
making it
impossible to consolidate that stack trace into a single logging event
when
passed off to a third-party logging system like Splunk. This sucks. To
be
fair, Splunk has never been any good at dealing with stack traces.

I'm not sure this is a specific issue with Doppler, as I've dealt with
syslog aggregated servers in the past with Splunk and generally I've been
able to merge stack traces (with some false mergers on corner cases) with
some props.conf voodoo setting up custom linebreaker clauses for Java stack
traces.

Usually Log4J or whatnot can be configured to emit a predictable field
like an extra timestamp ahead of any any app log messages so I can
differentiate a multi-line event from single.

Multiple syslog drains shouldn't be a problem because Splunk will merge
events based on the date you tell it to merge on.


--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io



_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


John Tuley <jtuley@...>
 

Mike,

I'm not saying that we have a good solution to multi-line log messages.
It's definitely a challenge today.

It's my understanding that the reasons for providing the stdout/stderr
logging are:

- adherence to the 12 Factor App <http://12factor.net/logs> principles,
- zero-configuration, "just works" support for the broadest set of use
cases, and
- compatibility with other PaaS offerings (e.g. Heroku).

None of that is meant to disregard your use case. I completely agree that
it's difficult-to-impossible for Loggregator to play nice with multi-line
logs, and that to bypass it would eliminate the value that the system
provides. I also agree that, while line-by-line processing of the console
works fine for a human watching the logs in real-time, it makes storing and
processing messages more difficult.


– John Tuley

On Mon, Jun 15, 2015 at 4:27 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

As far as your comment John about having our applications send their
syslogs to a remote syslog server. Though that would certainly provide a
way to get better logs into splunk it would eliminate all the value we get
from loggregator.

* cf logs won't work (unless we fork our logs)
* we won't get the redundancy and reliability of logging locally (same
reason why metron exists as an agent)
* Complex customer config for a solution that should for the most part
"just work"
* etc.

There are all kinds of hacks we can use to improve our multi-line
logging. But, they are all hacks that diminish the customer experience.

I understand that nobody here can "speculate as to the future of CF and
whether or not a particular feature will someday be included". All I'm
asking for is an acknowledgement from the LAMB team that draining
multi-line log messages is a point point for users and that the team would
consider investing some future time to a solution (any solution) for this
issue.

If the logging team really believes that the way multi-line log events are
currently handled isn't a problem then lets discuss that. I as a user
believe this is a problem that ought to be looked at some point in the
future.

Mike

On Mon, Jun 15, 2015 at 3:48 PM, Mike Heath <elcapo(a)gmail.com> wrote:

I think our situation is a little bit different since we have a custom
syslog server that send logs directly to our Splunk indexers rather than
going through a Splunk forwarder that can aggregate multiple syslog streams
into a single event. This is part of our Splunk magic that allows our users
to do Splunk searches based of their Cloud Foundry app name, space, org,
etc rather than GUIDs.

Regardless, we can fix this by having our developers format their stack
traces differently.

Thanks Stuart.

-Mike

On Sat, Jun 13, 2015 at 1:32 PM, Stuart Charlton <scharlton(a)pivotal.io>
wrote:

Mike,


Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That
means
each line of the stack trace goes out to a different syslog drain
making it
impossible to consolidate that stack trace into a single logging event
when
passed off to a third-party logging system like Splunk. This sucks. To
be
fair, Splunk has never been any good at dealing with stack traces.

I'm not sure this is a specific issue with Doppler, as I've dealt with
syslog aggregated servers in the past with Splunk and generally I've been
able to merge stack traces (with some false mergers on corner cases) with
some props.conf voodoo setting up custom linebreaker clauses for Java stack
traces.

Usually Log4J or whatnot can be configured to emit a predictable field
like an extra timestamp ahead of any any app log messages so I can
differentiate a multi-line event from single.

Multiple syslog drains shouldn't be a problem because Splunk will merge
events based on the date you tell it to merge on.


--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io



_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Youngstrom <youngm@...>
 

As far as your comment John about having our applications send their
syslogs to a remote syslog server. Though that would certainly provide a
way to get better logs into splunk it would eliminate all the value we get
from loggregator.

* cf logs won't work (unless we fork our logs)
* we won't get the redundancy and reliability of logging locally (same
reason why metron exists as an agent)
* Complex customer config for a solution that should for the most part
"just work"
* etc.

There are all kinds of hacks we can use to improve our multi-line logging.
But, they are all hacks that diminish the customer experience.

I understand that nobody here can "speculate as to the future of CF and
whether or not a particular feature will someday be included". All I'm
asking for is an acknowledgement from the LAMB team that draining
multi-line log messages is a point point for users and that the team would
consider investing some future time to a solution (any solution) for this
issue.

If the logging team really believes that the way multi-line log events are
currently handled isn't a problem then lets discuss that. I as a user
believe this is a problem that ought to be looked at some point in the
future.

Mike

On Mon, Jun 15, 2015 at 3:48 PM, Mike Heath <elcapo(a)gmail.com> wrote:

I think our situation is a little bit different since we have a custom
syslog server that send logs directly to our Splunk indexers rather than
going through a Splunk forwarder that can aggregate multiple syslog streams
into a single event. This is part of our Splunk magic that allows our users
to do Splunk searches based of their Cloud Foundry app name, space, org,
etc rather than GUIDs.

Regardless, we can fix this by having our developers format their stack
traces differently.

Thanks Stuart.

-Mike

On Sat, Jun 13, 2015 at 1:32 PM, Stuart Charlton <scharlton(a)pivotal.io>
wrote:

Mike,


Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making
it
impossible to consolidate that stack trace into a single logging event
when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

I'm not sure this is a specific issue with Doppler, as I've dealt with
syslog aggregated servers in the past with Splunk and generally I've been
able to merge stack traces (with some false mergers on corner cases) with
some props.conf voodoo setting up custom linebreaker clauses for Java stack
traces.

Usually Log4J or whatnot can be configured to emit a predictable field
like an extra timestamp ahead of any any app log messages so I can
differentiate a multi-line event from single.

Multiple syslog drains shouldn't be a problem because Splunk will merge
events based on the date you tell it to merge on.


--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io



_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Heath
 

I think our situation is a little bit different since we have a custom
syslog server that send logs directly to our Splunk indexers rather than
going through a Splunk forwarder that can aggregate multiple syslog streams
into a single event. This is part of our Splunk magic that allows our users
to do Splunk searches based of their Cloud Foundry app name, space, org,
etc rather than GUIDs.

Regardless, we can fix this by having our developers format their stack
traces differently.

Thanks Stuart.

-Mike

On Sat, Jun 13, 2015 at 1:32 PM, Stuart Charlton <scharlton(a)pivotal.io>
wrote:

Mike,


Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making
it
impossible to consolidate that stack trace into a single logging event
when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

I'm not sure this is a specific issue with Doppler, as I've dealt with
syslog aggregated servers in the past with Splunk and generally I've been
able to merge stack traces (with some false mergers on corner cases) with
some props.conf voodoo setting up custom linebreaker clauses for Java stack
traces.

Usually Log4J or whatnot can be configured to emit a predictable field
like an extra timestamp ahead of any any app log messages so I can
differentiate a multi-line event from single.

Multiple syslog drains shouldn't be a problem because Splunk will merge
events based on the date you tell it to merge on.


--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io



_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Stuart Charlton
 

Mike,


Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making it
impossible to consolidate that stack trace into a single logging event when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

I'm not sure this is a specific issue with Doppler, as I've dealt with
syslog aggregated servers in the past with Splunk and generally I've been
able to merge stack traces (with some false mergers on corner cases) with
some props.conf voodoo setting up custom linebreaker clauses for Java stack
traces.

Usually Log4J or whatnot can be configured to emit a predictable field like
an extra timestamp ahead of any any app log messages so I can differentiate
a multi-line event from single.

Multiple syslog drains shouldn't be a problem because Splunk will merge
events based on the date you tell it to merge on.


--

Stuart Charlton

Pivotal Software | Field Engineering

Mobile: 403-671-9778 | Email: scharlton(a)pivotal.io


John Tuley <jtuley@...>
 

I can't speculate as to the future of CF and whether or not a particular
feature will someday be included.

But I can suggest a workaround: aside from very paranoid application
security group settings, there should be nothing preventing your
application from sending syslog to your external drain already, since apps
can make outbound connections. Obviously, this wouldn't also go through
Loggregator, and so wouldn't be available on `cf logs`. But perhaps your
logging utility can be configured to send to both syslog and stdout/stderr
simultaneously?

– John Tuley

On Fri, Jun 12, 2015 at 2:40 PM, Mike Heath <elcapo(a)gmail.com> wrote:

That's fair.

I think Mike Youngstrom is right. All of our logging problems would go
away if our applications could talk syslog to Loggregator. Capturing stdout
and stderr is certainly convenient, but it's not great for dealing with
stack traces.

-Mike

On Fri, Jun 12, 2015 at 8:38 AM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

I don't want to speak to the possibility, but I can explain why we
decided against app affinity. Basically, it comes down to sharding over a
dynamic pool. As Doppler instances come and go, Metron would need to
re-balance its affinity calculations. This becomes troublesome if you
assume that a single Doppler is responsible for each app (or app-instance),
including the recent history: does the old home of an app need to transfer
history to the new home? Or maybe a new server just picks up new apps, and
all the old mappings stay the same? We did some research into algorithms
for this sort of consistent hashing/sharding and determined that it would
be difficult to implement in the presence of distributed servers *and* distributed
clients.

Given that your goals don't include history, the problem becomes easier
for sure. But I'd (personally – not speaking for product leadership) be
wary of accepting a PR that only solved forward-rebalancing without
addressing the problem of historical data.

– John Tuley

On Thu, Jun 11, 2015 at 4:55 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That
means each line of the stack trace goes out to a different syslog drain
making it impossible to consolidate that stack trace into a single logging
event when passed off to a third-party logging system like Splunk. This
sucks. To be fair, Splunk has never been any good at dealing with stack
traces.

What are the possibilities of getting some kind of optionally enabled
application instance affinity put into Metron? (I know. I know. I can
submit a PR.)

-Mike

On Thu, Jun 11, 2015 at 3:54 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it
chooses one at random for each message.
- If no same-zone Dopplers are present, the random choice is made
from the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging
Agent before Metron existed. What we discovered with that approach is that
it balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same
Doppler server every time, then (without further information) I would guess
that you're either running a single Doppler instance in each availability
zone, or your deck is stacked. :-) If neither of those is true and you're
still observing that Metron routes messages from an app instance to a
single Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's
logs?

-Mike

1 -
https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Heath
 

That's fair.

I think Mike Youngstrom is right. All of our logging problems would go away
if our applications could talk syslog to Loggregator. Capturing stdout and
stderr is certainly convenient, but it's not great for dealing with stack
traces.

-Mike

On Fri, Jun 12, 2015 at 8:38 AM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

I don't want to speak to the possibility, but I can explain why we decided
against app affinity. Basically, it comes down to sharding over a dynamic
pool. As Doppler instances come and go, Metron would need to re-balance its
affinity calculations. This becomes troublesome if you assume that a single
Doppler is responsible for each app (or app-instance), including the recent
history: does the old home of an app need to transfer history to the new
home? Or maybe a new server just picks up new apps, and all the old
mappings stay the same? We did some research into algorithms for this sort
of consistent hashing/sharding and determined that it would be difficult to
implement in the presence of distributed servers *and* distributed
clients.

Given that your goals don't include history, the problem becomes easier
for sure. But I'd (personally – not speaking for product leadership) be
wary of accepting a PR that only solved forward-rebalancing without
addressing the problem of historical data.

– John Tuley

On Thu, Jun 11, 2015 at 4:55 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making it
impossible to consolidate that stack trace into a single logging event when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

What are the possibilities of getting some kind of optionally enabled
application instance affinity put into Metron? (I know. I know. I can
submit a PR.)

-Mike

On Thu, Jun 11, 2015 at 3:54 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it
chooses one at random for each message.
- If no same-zone Dopplers are present, the random choice is made
from the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same
Doppler server every time, then (without further information) I would guess
that you're either running a single Doppler instance in each availability
zone, or your deck is stacked. :-) If neither of those is true and you're
still observing that Metron routes messages from an app instance to a
single Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's
logs?

-Mike

1 -
https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


John Tuley <jtuley@...>
 

Mike,

I don't want to speak to the possibility, but I can explain why we decided
against app affinity. Basically, it comes down to sharding over a dynamic
pool. As Doppler instances come and go, Metron would need to re-balance its
affinity calculations. This becomes troublesome if you assume that a single
Doppler is responsible for each app (or app-instance), including the recent
history: does the old home of an app need to transfer history to the new
home? Or maybe a new server just picks up new apps, and all the old
mappings stay the same? We did some research into algorithms for this sort
of consistent hashing/sharding and determined that it would be difficult to
implement in the presence of distributed servers *and* distributed clients.

Given that your goals don't include history, the problem becomes easier for
sure. But I'd (personally – not speaking for product leadership) be wary of
accepting a PR that only solved forward-rebalancing without addressing the
problem of historical data.

– John Tuley

On Thu, Jun 11, 2015 at 4:55 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making it
impossible to consolidate that stack trace into a single logging event when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

What are the possibilities of getting some kind of optionally enabled
application instance affinity put into Metron? (I know. I know. I can
submit a PR.)

-Mike

On Thu, Jun 11, 2015 at 3:54 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it
chooses one at random for each message.
- If no same-zone Dopplers are present, the random choice is made
from the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same
Doppler server every time, then (without further information) I would guess
that you're either running a single Doppler instance in each availability
zone, or your deck is stacked. :-) If neither of those is true and you're
still observing that Metron routes messages from an app instance to a
single Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's
logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Youngstrom <youngm@...>
 

Or better yet support a syslog endpoint in an app container that sends to
loggregator. Then we can get full stack traces in a single event. :)

Mike

On Thu, Jun 11, 2015 at 4:55 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making it
impossible to consolidate that stack trace into a single logging event when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

What are the possibilities of getting some kind of optionally enabled
application instance affinity put into Metron? (I know. I know. I can
submit a PR.)

-Mike

On Thu, Jun 11, 2015 at 3:54 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it
chooses one at random for each message.
- If no same-zone Dopplers are present, the random choice is made
from the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same
Doppler server every time, then (without further information) I would guess
that you're either running a single Doppler instance in each availability
zone, or your deck is stacked. :-) If neither of those is true and you're
still observing that Metron routes messages from an app instance to a
single Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's
logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Heath
 

Actually, this might explain why some of our customers are so frustrated
trying to read their stack traces in Splunk. :\

So each line of a stack trace could go to a different Doppler. That means
each line of the stack trace goes out to a different syslog drain making it
impossible to consolidate that stack trace into a single logging event when
passed off to a third-party logging system like Splunk. This sucks. To be
fair, Splunk has never been any good at dealing with stack traces.

What are the possibilities of getting some kind of optionally enabled
application instance affinity put into Metron? (I know. I know. I can
submit a PR.)

-Mike

On Thu, Jun 11, 2015 at 3:54 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it chooses
one at random for each message.
- If no same-zone Dopplers are present, the random choice is made
from the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same Doppler
server every time, then (without further information) I would guess that
you're either running a single Doppler instance in each availability zone,
or your deck is stacked. :-) If neither of those is true and you're still
observing that Metron routes messages from an app instance to a single
Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's
logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


John Tuley <jtuley@...>
 

Oops, wrong link. Should be
https://github.com/cloudfoundry/loggregator/blob/develop/src/metron/main.go#L188-L197
.

Sorry about that!

– John Tuley

On Thu, Jun 11, 2015 at 3:36 PM, John Tuley <jtuley(a)pivotal.io> wrote:

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability
prefers same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it chooses
one at random for each message.
- If no same-zone Dopplers are present, the random choice is made from
the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same Doppler
server every time, then (without further information) I would guess that
you're either running a single Doppler instance in each availability zone,
or your deck is stacked. :-) If neither of those is true and you're still
observing that Metron routes messages from an app instance to a single
Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


John Tuley <jtuley@...>
 

Mike,

Metron chooses a randomly-available Doppler for each message
<https://www.pivotaltracker.com/story/show/96801752>. Availability prefers
same-zone Doppler servers:

- If a Metron instance knows about any same-zone Dopplers, it chooses
one at random for each message.
- If no same-zone Dopplers are present, the random choice is made from
the list of all known servers.


In fact, the behavior you describe is the behavior of DEA Logging Agent
before Metron existed. What we discovered with that approach is that it
balances load very unfairly, as a single high-volume app is stuck on one
server. While the "new" mechanism does not guarantee consistency, it does
enable the Doppler pool to more-evenly share load.

If you're seeing that a single app instance is routed to the same Doppler
server every time, then (without further information) I would guess that
you're either running a single Doppler instance in each availability zone,
or your deck is stacked. :-) If neither of those is true and you're still
observing that Metron routes messages from an app instance to a single
Doppler, I'd love to investigate how that is happening.

– John Tuley

On Thu, Jun 11, 2015 at 2:45 PM, Mike Heath <elcapo(a)gmail.com> wrote:

Metron's documentation [1] says "All Metron traffic is randomly
distributed across available Dopplers." How random is this? Based on
observation, it appears that logs for an individual application instance
are consistently sent to the same Doppler instance. The consistency aspect
is very important for us so that our Syslog forwarder can consolidate stack
traces into a single logging event.

How random is this distribution really for an application instance's logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Heath
 

Metron's documentation [1] says "All Metron traffic is randomly distributed
across available Dopplers." How random is this? Based on observation, it
appears that logs for an individual application instance are consistently
sent to the same Doppler instance. The consistency aspect is very important
for us so that our Syslog forwarder can consolidate stack traces into a
single logging event.

How random is this distribution really for an application instance's logs?

-Mike

1 - https://github.com/cloudfoundry/loggregator/tree/develop/src/metron