[vcap-dev] Loggregator dropping application logs and warning 'message output too high'


Daniel Jones
 

Hi all,

Enacting a bit of thread necromancy in the hope of aiding a few Java CF
users.

1. Logback 1.1.3 has new functionality that allows selective truncation
of Java stack traces: https://github.com/qos-ch/logback/pull/244. It's
not documented, but I've used it in tested code.
2. There may or may not be a flaw in the logic of Doppler which is
throttling messages sent to syslog to one per millisecond:
https://github.com/cloudfoundry/loggregator/issues/66
3. I've also raised issues to discuss the merits of dropping the whole
buffer as opposed to refusing new messages and/or allowing buffers to grow
in size: https://github.com/cloudfoundry/loggregator/issues/67
4. There's also this issue discussing the parameterisation of buffer
sizes: https://github.com/cloudfoundry/loggregator/issues/64

Hope the above is of help to someone.

On Thu, Mar 5, 2015 at 6:42 PM, Mike Heath <elcapo(a)gmail.com> wrote:

I have to disagree with regard to putting back pressure on the
application. If someone is tailing the logs of an app over a slow VPN
connection, that should not slow the application down. I do agree that 100
retained log messages is too few. We retain 500 log messages on our Cloud
Foundry instances and that seems to work well for us.

-Mike



On Thursday, March 5, 2015 at 9:17:54 AM UTC-7, Yann ROBERT wrote:

Hi James,

Just to clarify on the logging rate :
The main issue that is occuring pretty often is when an application is
logging let's say 120 lines of logs at a time and then keep quiet. This is
a pretty common scenario when a Java application is printing a stacktrace
after a failure. For instance if you have 2 or 3 threads accessing a
backend and the connection to the backend fails then all of the running
thread will get a failure, which may result in more than 100 lines of logs
at the same time.
This simple scenario is enough to get the LGR buffer overflowed because
it cannot send to the syslog before having it's buffer full.

NOTE : The number of lines required to be sent at once to get the LGR
overflow differs upon the public cloud provider I tested this.

In the light of the previous explaination, because the buffer size is
only 100, it's simply impossible to support that scenario.

I am not talking about a scenario when the application is constantly
logging 200 lines per seconds for several minutes or hours.

On the syslog server capability to keep up, and the available bandwidth
between CF and the syslog server :
My tests show that when the application is directly sending logs to the
syslog server, no log is losts, and the syslog server is able to keep up
with the flow, be it a temporary peak in the flow or a constant flow of
100s of lines per seconds.
So I am pretty confident that the hosted highly available syslog service
is able to keep up with a peak of 100 lines.

The question is just how much time it take for the LGR buffer to be full.
It you send 120 lines at the same time to STDOUT, the buffer is full
instantly and overflows immediately.
It ends up being FIRST a problem with the buffer size.

Java application are very chatty when it comes to failures because a
common practice is to print the stack trace.
Supporting very temporary peak in logs rate would be great. The default
buffer size should be 1000, not 100.

Also the strategy of what is done when the buffer is full is questionable.

Putting backpressure on the application is fine to me.
I guess some application developers would rather have the thread that is
logging be blocked while the buffer is full, than to loose some logs.
It would end up being the developers responsibility to either leave it
that way if it's rare enough, or optimize the application so that it logs
less. The developer would be in control.

An other strategy to discard to logs would be to discard the new messages
instead of discarding the content of the buffer. This would allow to at
least get 100 first lines of logs instead of none or just the latest lines
if your lucky.
Alternatively, using a circular buffer may also be interesting to avoid
dropping chunks; and allows to only drop the oldest overflow.

A dynamic strategy that would discard the logs based on the level (INFO,
WARN, ERROR, ...) like what is done by logback's AsyncAppender [
http://logback.qos.ch/apidocs/ch/qos/logback/classic/AsyncAppender.html
] would be great but we don't have a level on plain STDOUT flow. Also, when
all logs are important because it's WARNING or ERROR level, you don't want
to filter them.

A dynamic strategy that would arbitrarily discard 1 line out of 2 may be
better than dropping a full chunk of 100 messages. Because when you look at
the logs it's easier to 'read between the lines'.

We can think of many more alternative strategies to discard messages. It
all depends on the developers need.
Putting backpressure on the application to provide control to the
developer is what will work best.

Yann

--
You received this message because you are subscribed to the Google Groups
"Cloud Foundry Developers" group.
To view this discussion on the web visit
https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/3b9036ed-787c-4b12-b4da-6e317cabac52%40cloudfoundry.org
<https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/3b9036ed-787c-4b12-b4da-6e317cabac52%40cloudfoundry.org?utm_medium=email&utm_source=footer>
.

To unsubscribe from this group and stop receiving emails from it, send an
email to vcap-dev+unsubscribe(a)cloudfoundry.org.


--
Regards,

Daniel Jones
EngineerBetter.com
--
Regards,

Daniel Jones
EngineerBetter.com

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.