doppler issue which fails to emit logs with syslog protocol on CFv212


Amit Kumar Gupta
 

Hi Matsumi,

What information exactly would you like to know about the etcd events?
There error message is "Got error while waiting for ETCD events: store
request timed". This could be due to several reasons -- network issues,
etcd slowness, etcd down, etc. If etcd is down, other components would be
affected. If the network is slow for some connections, it may only affect
some components.

Several components in cf-release talk to etcd:

- doppler
- metron_agent
- syslog_drain_binder
- loggregator_trafficcontroller
- routing-api (not router)
- hm9000

Please let us know what details you'd like to know.

Best,
Amit

On Fri, Oct 30, 2015 at 3:40 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi Warren and Nino,

Let me update our status.

Yes, the root cause was the inconsistency of a bulk user id between cc and
syslog drain binder due to the bug (a hardcorded id). We confirmed the
doppler had been recovered successfully.

Also, I would appreciate the information about ETCD events.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418p2524.html
Sent from the CF Dev mailing list archive at Nabble.com.


Masumi Ito
 

Hi Warren and Nino,

Let me update our status.

Yes, the root cause was the inconsistency of a bulk user id between cc and
syslog drain binder due to the bug (a hardcorded id). We confirmed the
doppler had been recovered successfully.

Also, I would appreciate the information about ETCD events.

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418p2524.html
Sent from the CF Dev mailing list archive at Nabble.com.


Warren Fernandes
 

Hey Masumi,

We setup a similar environment with CFv212 and went through a few scenarios. We took a look at the problem and we feel that there are couple of odd things.

"Error when polling cloud controller: Remote server error: Unauthorized" is caused when syslog_drain_binder gets a 401 response from CC which means your credentials could be invalid.

"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out" is caused when we took ETCD down and noticed a flood of these errors in the doppler.stdout.log. In this case (ETCD being down) you should also see in syslog_drain_binder.stdout.log the following error message: "Error when staying leader: store request timed out". And yes other components like trafficcontroller and metron agents should also error out.

How many ETCDs were running at the time of these errors? Was there a deployment going on at the time? As Dies Koper pointed out (lamb slack channel), in CF 212 the username for the syslog drain endpoint in the syslog_drain_binder is hardcoded to bulk_api and only the password is configurable. So maybe there was a password mismatch? Check the timestamps of the events to see if they are related.

Thanks.
Warren & Nino


Warren Fernandes
 

Hey Masumi,

The LAMB team created a chore to investigate this here https://www.pivotaltracker.com/story/show/106699398.

We'll update this thread with our findings.

Thanks.
Warren


Masumi Ito
 

Hi, I am investigating the doppler issue on CFv212 that sending logs to the
external logging service with syslog protocol fails.

As my understanding, the following messages are supposed to be recorded in
the "doppler.stdout.log" if the doppler got the syslog drain url from etcd
successfully.
However it is actually missing.

- Missing log messages which is expected to be shown
{"timestamp":xxxxxx,"process_id":xxxx,"source":"doppler","log_level":"info","message":"Syslog
Sink syslog://xx.xx.xx.xx:xxxx:
Running.","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/doppler/sinks/syslog/syslog_sink.go","line":56,"method":"doppler/sinks/syslog.(*SyslogSink).Run"}
{"timestamp":xxxxxx,"process_id":xxxx,"source":"doppler","log_level":"info","message":"Syslog
Sink syslog://xxxxxx:xxxxxx: successfully
connected.","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/doppler/sinks/syslog/syslog_sink.go","line":112,"method":"doppler/sinks/syslog.(*SyslogSink).Run"}

Instead, there are a lot of etcd error events found as follows.

{"timestamp":xxxxxxxxx,"process_id":xxx,"source":"doppler","log_level":"error","message":"AppStoreWatcher:
Got error while waiting for ETCD events: store request timed
out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go","line":79,"method":"github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run"}

I have two question about this.

=============
Q1) Does anyone know what this event indicates and how it affects the CF
environment?
In the normal envrionment, is this event still triggered (In other
words, can we ignore this error event messages?)

Q2) If the etcd got some trouble at the moment, which cf component is also
made an influence on?

I guess at the least the following cf components could be affected. Do we
have anyting else?
- router : to support routing api
- hm9000 : to support health check
- doppler: to get syslog drain urls from etcd
- syslog-binder to get syslog drain urls from cc and then store them to
etcd
- trafficcontroller, metron agents : to find healthy dopplers to access
to)
=============

Note that I am also doubting the following errors in the
"syslog_drain_binder.stdout.log".
This message indicates that syslog_drain_binder failed to get syslog drain
urls from cc.
{"timestamp":xxxx,"process_id":xxxx,"source":"syslog_drain_binder","log_level":"error","message":"Error
when polling cloud controller: Remote server error:
Unauthorized","data":null,"file":"/var/vcap/data/compile/syslog_drain_binder/loggregator/src/syslog_drain_binder/main.go","line":68,"method":"main.main"}

Therefore I have not yet concluded the etcd mainly caused the issue, however
need to understand the exact meaning of the error event message above as
well as any impact on the cf envrionment if there is something wrong with
the behaviour of etcd.

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418.html
Sent from the CF Dev mailing list archive at Nabble.com.