Loggregator/Doppler Syslog Drain Missing Logs


Michael Schwartz
 

Loggregator appears to be dropping logs without notice. I'm noticing about 50% of the logs do not make it to our external log service. If I tail the logs using "cf logs ...", all logs are visible. They just never make it to the external drain.

I have seen the "TB: Output channel too full. Dropped 100 messages for app..." message in the doppler logs, and that makes sense for applications producing many logs. What's confusing is that I'm seeing logs missing from apps that produce very little logs (like 10 per minute). If I send curl requests to a test app very slowly, I notice approx. every other log is missing.

We've been using an ELK stack for persisting application logs. All applications bind to a user-provided service containing the syslog drain URL. Logstash does not appear to be the bottleneck here because I've tested other endpoints such as a netcat listener and I see the same issue. Ever after doubling our logstash server count to 6, I see the exact same drop rate.

Our current CF (v210) deployment contains 4 loggregator instances each running doppler, syslog_drain_binder, and metron_agent. I tried bumping the loggregator instance count to 8 and noticed very little improvement.

Monitoring CPU, memory, and diskspace on loggregator nodes show no abnormalities. CPU is under 5%.

Is this expected behavior?

Thank you.


Michael Schwartz
 

The system is currently running ~200 apps and they all bind to an external syslog drain.


Erik Jasiak
 

Hi Michael

First question that springs to mind when I see ~50% - how many zones are
you running as part of your setup? ("every other log" sounds like a
round-robin to something dead or misconfigured.)

Have to run but will follow up more soon,
Erik

Michael Schwartz wrote:


The system is currently running ~200 apps and they all bind to an
external syslog drain.


Matthew Sykes <matthew.sykes@...>
 

v210 has quite a few bugs in this area. One fairly major one is a
connection leak [1] in the syslog_drain_binder component. When this
happens, changes to the syslog drain bindings do not make their way into
the doppler servers.

I'd strongly recommend you try to move to a newer release.

[1]:
https://github.com/cloudfoundry/loggregator/commit/b8d14b7fdc65b9d0d4a11cffa6b6f855e4d640ae

On Wed, Sep 23, 2015 at 2:48 PM, Michael Schwartz <mschwartz1411(a)gmail.com>
wrote:

The system is currently running ~200 apps and they all bind to an external
syslog drain.


--
Matthew Sykes
matthew.sykes(a)gmail.com


Michael Schwartz
 

We are running with 2 zones, 2 loggregators in each zone. I thought the same thing. Stopping all but one loggregator showed the same results.

If it helps, yesterday I was seeing 75% of the logs make it through with 4 loggregators and about 90% when I bumped the node count to 8. So it isn't always a 50/50.

Also, after shutting down or restarting a node, I see almost 100% of the logs come through at first. Then it slowly degrades back to 50% after a few minutes.


Michael Schwartz
 

That does makes sense. Especially since I'm seeing a consistent percentage. At one point, one app was getting 100% throughput and another was only getting 75%. So maybe one of the dopplers didn't have the drain binding.


Mehran Saliminia
 

Hi Michael,

We have the same issue now. How did you resolve this problem?

Regards,
Mehran