Hi, Mike,
Unfortunately, I haven't found anything that looks like it would account for this difference in logging behavior. Between those releases, Diego components haven't changed how they emit their logs to syslog: they still redirect stdout to a file descriptor produced by process substitution on commands such as `tee -a $LOG_DIR/rep.stdout.log | logger -t vcap.rep`. We have changed most of these components (except garden-linux) to run as the vcap user, but in doing so they now operate as most CF components do, and you've observed no problems on the VMs in your CF deployment.
As it turns out, both of those Diego versions are still using Go 1.4, so this change in behavior wouldn't be due to any change in the Golang runtime. The actual syslog forwarder config with the syslog_daemon_config is installed by the metron_agent job template, but that comes from cf-release and is also on all the CF jobs. Likewise, there were no changes at all to the consul_agent job between these CF versions.
If you create a dummy job in the Diego deployment with, say, just the metron_agent job template, do you still see these problems with the log ingestion? The metron agent and its ctl script themselves will emit at least a few lines of logs to syslog under normal operation, so that might be enough of a test case, and you could also pipe data to a `logger` process to provide more. Also, is there any difference at all between the /etc/rsyslog.d/00-syslog_forwarder.conf files on your Diego deployment VMs and on your CF ones?
Thanks, Eric
toggle quoted message
Show quoted text
On Tue, Aug 11, 2015 at 2:26 PM, Mike Jacobi <jacobi(a)adobe.com> wrote: I’m wondering if there are any updates on this.
thanks, Mike
On Jul 21, 2015, at 10:14 AM, Mike Jacobi wrote:
The CF messages ingest OK into Splunk; None of the Diego messages ingest, not even the first line of their payload.
The problematic deployment:
· CF v212 · Stemcell 3012 (bosh-stemcell-3012-vsphere-esxi-ubuntu-trusty-go_agent.tgz) · Diego 1304
An single CF syslog message payload:
2015-07-21T00:26:51.619730+00:00 10.5.139.228 vcap.hm9000.listener [job=hm9000_z1 index=1] {"timestamp":1437438411.619571447,"process_id":9750,"source":"vcap.hm9000.listener","log_level":"info","message":"Saved Heartbeats - {\"Duration\":\"217.887334ms\",\"Heartbeats to Save\":\"3\"}","data":null}
An single Diego syslog message payload:
2015-07-21T00:27:02.389177+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389095783","source":"receptor","message":"receptor.task-handler.get-all.succeeded-fetching-tasks-from-store","log_level":1,"data":{"domain":"","session":"3.3949"}} <13>2015-07-21T00:27:02.389399+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389337301","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/tasks","session":"20204"}} <13>2015-07-21T00:27:02.389982+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389916658","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}} <13>2015-07-21T00:27:02.620949+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.620829105","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}} <13>2015-07-21T00:27:02.621660+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.621593714","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}} <13>2015-07-21T00:27:03.145810+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.145678282","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}} <13>2015-07-21T00:27:03.146939+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.146871328","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}} <13>2015-07-21T00:27:03.192371+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.192266226","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}} <12>2015-07-21T00:27:03.692338+00:00 10.5.139.241 vmsvc [job=cell_z1 index=1] [ warning] [guestinfo] Failed to get vmstats. <13>2015-07-21T00:27:03.936841+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.935924530","source":"rep","message":"rep.running-bulker.sync.starting","log_level":1,"data":{"session":"11.7938"}} <13>2015-07-21T00:27:03.937079+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936018705","source":"rep","message":"rep.running-bulker.sync.batch-operations.started","log_level":1,"data":{"session":"11.7938.1"}} <13>2015-07-21T00:27:03.937110+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936035156","source":"rep","message":"rep.running-bulker.sync.batch-operations.getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}} <13>2015-07-21T00:27:03.937132+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936167717","source":"rep","message":"rep.running-bulker.sync.batch-operations.fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}} <13>2015-07-21T00:27:03.968190+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.968116999","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}} <13>2015-07-21T00:27:06.495263+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438426.495136499","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}} <13>2015-07-21T00:27:06.495341+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438426.495275497","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded","log_level":1,"data":{"batch-size":0,"session":"11.7938.1"}}
Another (single) Diego syslog message payload:
2015-07-21T00:27:10.310490+00:00 10.5.139.246 vcap.route-emitter [job=route_emitter_z1 index=0] {"timestamp":"1437438430.310252428","source":"route-emitter","message":"route-emitter.watcher.sync.emitting-messages","log_level":1,"data":{"messages":{"RegistrationMessages":null,"UnregistrationMessages":null},"session":"5.10080"}} <13>2015-07-21T00:27:10.312037+00:00 10.5.139.246 vcap.route-emitter [job=route_emitter_z1 index=0] {"timestamp":"1437438430.311890841","source":"route-emitter","message":"route-emitter.watcher.sync.complete","log_level":1,"data":{"session":"5.10080"}} <12>2015-07-21T00:27:11.049315+00:00 10.5.139.246 vmsvc [job=route_emitter_z1 index=0] [ warning] [guestinfo] Failed to get vmstats.
-Mike
*From:* cf-dev-bounces(a)lists.cloudfoundry.org [ mailto:cf-dev-bounces(a)lists.cloudfoundry.org <cf-dev-bounces(a)lists.cloudfoundry.org>] *On Behalf Of *Eric Malm *Sent:* Tuesday, July 21, 2015 12:01 AM *To:* Discussions about Cloud Foundry projects and the system overall. *Subject:* Re: [cf-dev] Diego log grouping
Hi, Mike,
Thanks for the report! From your packet captures or on-VM logs, do you have an example of the log line groups that Splunk is failing to ingest? Is it all the log lines, or just ones coming from particular Diego components?
The github.com/tedsuo/ifrit dependency hasn't changed in diego-release between 1099 and 1304, but it's possible that our use of it in diego-release has. Likewise, the github.com/pivotal-golang/lager package that's emitting logs has changed in only trivial ways between those releases. We have upgraded the release to use Go 1.4.2 instead of 1.4, though.
Also, what stemcell versions are you using in the deployments? I'm assuming that if CF is deployed alongside these Diego deployments, it's at the corresponding recommended final version (v207 and v212, respectively). If so, are there any problems with the syslog messages coming from those deployments?
Thanks, Eric, CF Runtime Diego PM
On Mon, Jul 20, 2015 at 6:51 PM, Mike Jacobi wrote:
We have a Diego 1099 deployment and syslog_daemon_config configured. We see a 1:1 mapping for Diego platform messages to syslog messages. In other words, for each syslog message that hits the wire, there is one platform message as its payload. This works well with Splunk, which is ultimately where the messages end up.
We have another deployment, but on Diego 1304, with its syslog_daemon_config identical to the other, but Splunk is **not** ingesting its logs. We ran a packet capture and discovered that this deployment is grouping its log messages in a 1:n manner: For each syslog message on the wire, we have multiple platform messages within, separated by newlines. I suspect this is the reason the logs aren’t being ingested.
I took a quick glance at the code and it seems like this might be due to ifrit/grouper, but I can’t say for sure.
Has anyone run into this issue?
Thanks, Mike
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|