Date
1 - 5 of 5
doppler issue which fails to emit logs with syslog protocol on CFv212
Amit Kumar Gupta
Hi Matsumi,
toggle quoted message
Show quoted text
What information exactly would you like to know about the etcd events? There error message is "Got error while waiting for ETCD events: store request timed". This could be due to several reasons -- network issues, etcd slowness, etcd down, etc. If etcd is down, other components would be affected. If the network is slow for some connections, it may only affect some components. Several components in cf-release talk to etcd: - doppler - metron_agent - syslog_drain_binder - loggregator_trafficcontroller - routing-api (not router) - hm9000 Please let us know what details you'd like to know. Best, Amit On Fri, Oct 30, 2015 at 3:40 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:
Hi Warren and Nino, |
|
Masumi Ito
Hi Warren and Nino,
Let me update our status. Yes, the root cause was the inconsistency of a bulk user id between cc and syslog drain binder due to the bug (a hardcorded id). We confirmed the doppler had been recovered successfully. Also, I would appreciate the information about ETCD events. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418p2524.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Warren Fernandes
Hey Masumi,
We setup a similar environment with CFv212 and went through a few scenarios. We took a look at the problem and we feel that there are couple of odd things. "Error when polling cloud controller: Remote server error: Unauthorized" is caused when syslog_drain_binder gets a 401 response from CC which means your credentials could be invalid. "AppStoreWatcher: Got error while waiting for ETCD events: store request timed out" is caused when we took ETCD down and noticed a flood of these errors in the doppler.stdout.log. In this case (ETCD being down) you should also see in syslog_drain_binder.stdout.log the following error message: "Error when staying leader: store request timed out". And yes other components like trafficcontroller and metron agents should also error out. How many ETCDs were running at the time of these errors? Was there a deployment going on at the time? As Dies Koper pointed out (lamb slack channel), in CF 212 the username for the syslog drain endpoint in the syslog_drain_binder is hardcoded to bulk_api and only the password is configurable. So maybe there was a password mismatch? Check the timestamps of the events to see if they are related. Thanks. Warren & Nino |
|
Warren Fernandes
Hey Masumi,
The LAMB team created a chore to investigate this here https://www.pivotaltracker.com/story/show/106699398. We'll update this thread with our findings. Thanks. Warren |
|
Masumi Ito
Hi, I am investigating the doppler issue on CFv212 that sending logs to the
external logging service with syslog protocol fails. As my understanding, the following messages are supposed to be recorded in the "doppler.stdout.log" if the doppler got the syslog drain url from etcd successfully. However it is actually missing. - Missing log messages which is expected to be shown {"timestamp":xxxxxx,"process_id":xxxx,"source":"doppler","log_level":"info","message":"Syslog Sink syslog://xx.xx.xx.xx:xxxx: Running.","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/doppler/sinks/syslog/syslog_sink.go","line":56,"method":"doppler/sinks/syslog.(*SyslogSink).Run"} {"timestamp":xxxxxx,"process_id":xxxx,"source":"doppler","log_level":"info","message":"Syslog Sink syslog://xxxxxx:xxxxxx: successfully connected.","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/doppler/sinks/syslog/syslog_sink.go","line":112,"method":"doppler/sinks/syslog.(*SyslogSink).Run"} Instead, there are a lot of etcd error events found as follows. {"timestamp":xxxxxxxxx,"process_id":xxx,"source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go","line":79,"method":"github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run"} I have two question about this. ============= Q1) Does anyone know what this event indicates and how it affects the CF environment? In the normal envrionment, is this event still triggered (In other words, can we ignore this error event messages?) Q2) If the etcd got some trouble at the moment, which cf component is also made an influence on? I guess at the least the following cf components could be affected. Do we have anyting else? - router : to support routing api - hm9000 : to support health check - doppler: to get syslog drain urls from etcd - syslog-binder to get syslog drain urls from cc and then store them to etcd - trafficcontroller, metron agents : to find healthy dopplers to access to) ============= Note that I am also doubting the following errors in the "syslog_drain_binder.stdout.log". This message indicates that syslog_drain_binder failed to get syslog drain urls from cc. {"timestamp":xxxx,"process_id":xxxx,"source":"syslog_drain_binder","log_level":"error","message":"Error when polling cloud controller: Remote server error: Unauthorized","data":null,"file":"/var/vcap/data/compile/syslog_drain_binder/loggregator/src/syslog_drain_binder/main.go","line":68,"method":"main.main"} Therefore I have not yet concluded the etcd mainly caused the issue, however need to understand the exact meaning of the error event message above as well as any impact on the cf envrionment if there is something wrong with the behaviour of etcd. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418.html Sent from the CF Dev mailing list archive at Nabble.com. |
|