Re: Diego log grouping


Eric Malm <emalm@...>
 

Hi, Mike,

Unfortunately, I haven't found anything that looks like it would account
for this difference in logging behavior. Between those releases, Diego
components haven't changed how they emit their logs to syslog: they still
redirect stdout to a file descriptor produced by process substitution on
commands such as `tee -a $LOG_DIR/rep.stdout.log | logger -t vcap.rep`. We
have changed most of these components (except garden-linux) to run as the
vcap user, but in doing so they now operate as most CF components do, and
you've observed no problems on the VMs in your CF deployment.

As it turns out, both of those Diego versions are still using Go 1.4, so
this change in behavior wouldn't be due to any change in the Golang
runtime. The actual syslog forwarder config with the syslog_daemon_config
is installed by the metron_agent job template, but that comes from
cf-release and is also on all the CF jobs. Likewise, there were no changes
at all to the consul_agent job between these CF versions.

If you create a dummy job in the Diego deployment with, say, just the
metron_agent job template, do you still see these problems with the log
ingestion? The metron agent and its ctl script themselves will emit at
least a few lines of logs to syslog under normal operation, so that might
be enough of a test case, and you could also pipe data to a `logger`
process to provide more. Also, is there any difference at all between the
/etc/rsyslog.d/00-syslog_forwarder.conf files on your Diego deployment VMs
and on your CF ones?

Thanks,
Eric

On Tue, Aug 11, 2015 at 2:26 PM, Mike Jacobi <jacobi(a)adobe.com> wrote:

I’m wondering if there are any updates on this.

thanks,
Mike

On Jul 21, 2015, at 10:14 AM, Mike Jacobi wrote:

The CF messages ingest OK into Splunk; None of the Diego messages ingest,
not even the first line of their payload.

The problematic deployment:

· CF v212
· Stemcell 3012
(bosh-stemcell-3012-vsphere-esxi-ubuntu-trusty-go_agent.tgz)
· Diego 1304


An single CF syslog message payload:

2015-07-21T00:26:51.619730+00:00 10.5.139.228 vcap.hm9000.listener
[job=hm9000_z1 index=1]
{"timestamp":1437438411.619571447,"process_id":9750,"source":"vcap.hm9000.listener","log_level":"info","message":"Saved
Heartbeats - {\"Duration\":\"217.887334ms\",\"Heartbeats to
Save\":\"3\"}","data":null}

An single Diego syslog message payload:

2015-07-21T00:27:02.389177+00:00 10.5.139.241 vcap.receptor [job=cell_z1
index=1]
{"timestamp":"1437438422.389095783","source":"receptor","message":"receptor.task-handler.get-all.succeeded-fetching-tasks-from-store","log_level":1,"data":{"domain":"","session":"3.3949"}}
<13>2015-07-21T00:27:02.389399+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438422.389337301","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/tasks","session":"20204"}}
<13>2015-07-21T00:27:02.389982+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438422.389916658","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}}
<13>2015-07-21T00:27:02.620949+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438422.620829105","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/desired_lrps","session":"20205"}}
<13>2015-07-21T00:27:02.621660+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438422.621593714","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}}
<13>2015-07-21T00:27:03.145810+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438423.145678282","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/actual_lrps","session":"20206"}}
<13>2015-07-21T00:27:03.146939+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438423.146871328","source":"receptor","message":"receptor.request.serving","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}}
<13>2015-07-21T00:27:03.192371+00:00 10.5.139.241 vcap.receptor
[job=cell_z1 index=1]
{"timestamp":"1437438423.192266226","source":"receptor","message":"receptor.request.done","log_level":1,"data":{"method":"GET","request":"/v1/domains","session":"20207"}}
<12>2015-07-21T00:27:03.692338+00:00 10.5.139.241 vmsvc [job=cell_z1
index=1] [ warning] [guestinfo] Failed to get vmstats.
<13>2015-07-21T00:27:03.936841+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438423.935924530","source":"rep","message":"rep.running-bulker.sync.starting","log_level":1,"data":{"session":"11.7938"}}
<13>2015-07-21T00:27:03.937079+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438423.936018705","source":"rep","message":"rep.running-bulker.sync.batch-operations.started","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.937110+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438423.936035156","source":"rep","message":"rep.running-bulker.sync.batch-operations.getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.937132+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438423.936167717","source":"rep","message":"rep.running-bulker.sync.batch-operations.fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:03.968190+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438423.968116999","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-fetching-tasks-from-store","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:06.495263+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438426.495136499","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded-getting-containers-lrps-and-tasks","log_level":1,"data":{"session":"11.7938.1"}}
<13>2015-07-21T00:27:06.495341+00:00 10.5.139.241 vcap.rep [job=cell_z1
index=1]
{"timestamp":"1437438426.495275497","source":"rep","message":"rep.running-bulker.sync.batch-operations.succeeded","log_level":1,"data":{"batch-size":0,"session":"11.7938.1"}}

Another (single) Diego syslog message payload:

2015-07-21T00:27:10.310490+00:00 10.5.139.246 vcap.route-emitter
[job=route_emitter_z1 index=0]
{"timestamp":"1437438430.310252428","source":"route-emitter","message":"route-emitter.watcher.sync.emitting-messages","log_level":1,"data":{"messages":{"RegistrationMessages":null,"UnregistrationMessages":null},"session":"5.10080"}}
<13>2015-07-21T00:27:10.312037+00:00 10.5.139.246 vcap.route-emitter
[job=route_emitter_z1 index=0]
{"timestamp":"1437438430.311890841","source":"route-emitter","message":"route-emitter.watcher.sync.complete","log_level":1,"data":{"session":"5.10080"}}
<12>2015-07-21T00:27:11.049315+00:00 10.5.139.246 vmsvc
[job=route_emitter_z1 index=0] [ warning] [guestinfo] Failed to get
vmstats.

-Mike

*From:* cf-dev-bounces(a)lists.cloudfoundry.org [
mailto:cf-dev-bounces(a)lists.cloudfoundry.org
<cf-dev-bounces(a)lists.cloudfoundry.org>] *On Behalf Of *Eric Malm
*Sent:* Tuesday, July 21, 2015 12:01 AM
*To:* Discussions about Cloud Foundry projects and the system overall.
*Subject:* Re: [cf-dev] Diego log grouping

Hi, Mike,

Thanks for the report! From your packet captures or on-VM logs, do you
have an example of the log line groups that Splunk is failing to ingest? Is
it all the log lines, or just ones coming from particular Diego components?

The github.com/tedsuo/ifrit dependency hasn't changed in diego-release
between 1099 and 1304, but it's possible that our use of it in
diego-release has. Likewise, the github.com/pivotal-golang/lager package
that's emitting logs has changed in only trivial ways between those
releases. We have upgraded the release to use Go 1.4.2 instead of 1.4,
though.

Also, what stemcell versions are you using in the deployments? I'm
assuming that if CF is deployed alongside these Diego deployments, it's at
the corresponding recommended final version (v207 and v212, respectively).
If so, are there any problems with the syslog messages coming from those
deployments?

Thanks,
Eric, CF Runtime Diego PM



On Mon, Jul 20, 2015 at 6:51 PM, Mike Jacobi wrote:

We have a Diego 1099 deployment and syslog_daemon_config configured. We
see a 1:1 mapping for Diego platform messages to syslog messages. In other
words, for each syslog message that hits the wire, there is one platform
message as its payload. This works well with Splunk, which is ultimately
where the messages end up.

We have another deployment, but on Diego 1304, with its
syslog_daemon_config identical to the other, but Splunk is **not**
ingesting its logs. We ran a packet capture and discovered that this
deployment is grouping its log messages in a 1:n manner: For each syslog
message on the wire, we have multiple platform messages within, separated
by newlines. I suspect this is the reason the logs aren’t being ingested.

I took a quick glance at the code and it seems like this might be due to
ifrit/grouper, but I can’t say for sure.

Has anyone run into this issue?

Thanks,
Mike



_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.