Loggregator appears to be dropping logs without notice. I'm noticing about 50% of the logs do not make it to our external log service. If I tail the logs using "cf logs ...", all logs are visible. They just never make it to the external drain.
I have seen the "TB: Output channel too full. Dropped 100 messages for app..." message in the doppler logs, and that makes sense for applications producing many logs. What's confusing is that I'm seeing logs missing from apps that produce very little logs (like 10 per minute). If I send curl requests to a test app very slowly, I notice approx. every other log is missing.
We've been using an ELK stack for persisting application logs. All applications bind to a user-provided service containing the syslog drain URL. Logstash does not appear to be the bottleneck here because I've tested other endpoints such as a netcat listener and I see the same issue. Ever after doubling our logstash server count to 6, I see the exact same drop rate.
Our current CF (v210) deployment contains 4 loggregator instances each running doppler, syslog_drain_binder, and metron_agent. I tried bumping the loggregator instance count to 8 and noticed very little improvement.
Monitoring CPU, memory, and diskspace on loggregator nodes show no abnormalities. CPU is under 5%.
Is this expected behavior?