Re: Diego log grouping


MJ
 

The CF messages ingest OK into Splunk; None of the Diego messages ingest, not even the first line of their payload.

The problematic deployment:


· CF v212

· Stemcell 3012 (bosh-stemcell-3012-vsphere-esxi-ubuntu-trusty-go_agent.tgz)

· Diego 1304


An single CF syslog message payload:

2015-07-21T00:26:51.619730+00:00 10.5.139.228 vcap.hm9000.listener [job=hm9000_z1 index=1] {"timestamp":1437438411.619571447,"process_id":9750,"source":"vcap.hm9000.listener","log_level":"info","message":"Saved Heartbeats - {\"Duration\":\"217.887334ms\",\"Heartbeats to Save\":\"3\"}","data":null}

An single Diego syslog message payload:

2015-07-21T00:27:02.389177+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389095783","source":"receptor","message":"receptor.task-handler.get-all.succeeded-fetching-tasks-from-store","log_level":1,"data":{"domain":"","session":"3.3949"}}
<13>2015-07-21T00:27:02.389399+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389337301","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/tasks","session":"20204"}}
<13>2015-07-21T00:27:02.389982+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.389916658","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}}
<13>2015-07-21T00:27:02.620949+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.620829105","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}}
<13>2015-07-21T00:27:02.621660+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438422.621593714","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}}
<13>2015-07-21T00:27:03.145810+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.145678282","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}}
<13>2015-07-21T00:27:03.146939+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.146871328","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}}
<13>2015-07-21T00:27:03.192371+00:00 10.5.139.241 vcap.receptor [job=cell_z1 index=1] {"timestamp":"1437438423.192266226","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}}
<12>2015-07-21T00:27:03.692338+00:00 10.5.139.241 vmsvc [job=cell_z1 index=1] [ warning] [guestinfo] Failed to get vmstats.
<13>2015-07-21T00:27:03.936841+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.935924530","source":"rep","message":"rep.running-bulker.sync.starting","log_level":1,"data":{"session":"11.7938"}}
<13>2015-07-21T00:27:03.937079+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936018705","source":"rep","message":"rep.running-bulker.sync.batch-operations.started","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.937110+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936035156","source":"rep","message":"rep.running-bulker.sync.batch-operations.getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.937132+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.936167717","source":"rep","message":"rep.running-bulker.sync.batch-operations.fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.968190+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438423.968116999","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:06.495263+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438426.495136499","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:06.495341+00:00 10.5.139.241 vcap.rep [job=cell_z1 index=1] {"timestamp":"1437438426.495275497","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded","log_level":1,"data":{"batch-size":0,"session":"11.7938.1"}}

Another (single) Diego syslog message payload:

2015-07-21T00:27:10.310490+00:00 10.5.139.246 vcap.route-emitter [job=route_emitter_z1 index=0] {"timestamp":"1437438430.310252428","source":"route-emitter","message":"route-emitter.watcher.sync.emitting-messages","log_level":1,"data":{"messages":{"RegistrationMessages":null,"UnregistrationMessages":null},"session":"5.10080"}}
<13>2015-07-21T00:27:10.312037+00:00 10.5.139.246 vcap.route-emitter [job=route_emitter_z1 index=0] {"timestamp":"1437438430.311890841","source":"route-emitter","message":"route-emitter.watcher.sync.complete","log_level":1,"data":{"session":"5.10080"}}
<12>2015-07-21T00:27:11.049315+00:00 10.5.139.246 vmsvc [job=route_emitter_z1 index=0] [ warning] [guestinfo] Failed to get vmstats.

-Mike

From: cf-dev-bounces(a)lists.cloudfoundry.org [mailto:cf-dev-bounces(a)lists.cloudfoundry.org] On Behalf Of Eric Malm
Sent: Tuesday, July 21, 2015 12:01 AM
To: Discussions about Cloud Foundry projects and the system overall.
Subject: Re: [cf-dev] Diego log grouping

Hi, Mike,

Thanks for the report! From your packet captures or on-VM logs, do you have an example of the log line groups that Splunk is failing to ingest? Is it all the log lines, or just ones coming from particular Diego components?

The github.com/tedsuo/ifrit<http://github.com/tedsuo/ifrit> dependency hasn't changed in diego-release between 1099 and 1304, but it's possible that our use of it in diego-release has. Likewise, the github.com/pivotal-golang/lager<http://github.com/pivotal-golang/lager> package that's emitting logs has changed in only trivial ways between those releases. We have upgraded the release to use Go 1.4.2 instead of 1.4, though.

Also, what stemcell versions are you using in the deployments? I'm assuming that if CF is deployed alongside these Diego deployments, it's at the corresponding recommended final version (v207 and v212, respectively). If so, are there any problems with the syslog messages coming from those deployments?

Thanks,
Eric, CF Runtime Diego PM



On Mon, Jul 20, 2015 at 6:51 PM, Mike Jacobi <jacobi(a)adobe.com<mailto:jacobi(a)adobe.com>> wrote:
We have a Diego 1099 deployment and syslog_daemon_config configured. We see a 1:1 mapping for Diego platform messages to syslog messages. In other words, for each syslog message that hits the wire, there is one platform message as its payload. This works well with Splunk, which is ultimately where the messages end up.

We have another deployment, but on Diego 1304, with its syslog_daemon_config identical to the other, but Splunk is *not* ingesting its logs. We ran a packet capture and discovered that this deployment is grouping its log messages in a 1:n manner: For each syslog message on the wire, we have multiple platform messages within, separated by newlines. I suspect this is the reason the logs aren’t being ingested.

I took a quick glance at the code and it seems like this might be due to ifrit/grouper, but I can’t say for sure.

Has anyone run into this issue?

Thanks,
Mike


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org<mailto:cf-dev(a)lists.cloudfoundry.org>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.