Missing routing logs from "cf logs app-name"


Simon Johansson <simon@...>
 

Howdie! Not really a dev related question, but since there is no cf-user list I'm trying my luck here. :)

Im looking into an issue where we don't get enough RTR logs from cf logs / firehose-to-syslog
I'm using cf-env[1] as the test app, and Im hitting app.domain.com/kehehehe.
For each request the app itself logs the the path requested and the gorouters logs the same information but a bit more verbose.

$ cf logs cf-env-test
...
2015-08-11T11:43:41.24+0200 [RTR/0] OUT cf-env-test.domain.com - [11/08/2015:09:43:41 +0000] "GET /kehehehe HTTP/1.1" 404 0 18 "-" "curl/7.43.0" 10.230.15.4:59728 x_forwarded_for:"10.230.15.4" vcap_request_id:4a186d9e-046a-47bc-74d0-f0c3a8cb1257 response_time:0.008503156 app_id:7f1944a2-2197-43b5-9334-0be7c8b9b40e
2015-08-11T11:43:41.31+0200 [App/0] ERR 10.230.15.8 - - [11/Aug/2015 09:43:41] "GET /kehehehe HTTP/1.1" 404 18 0.0006

I shoot off 1000 requests and would expect to see 1000 of each log type. But only the App logs are correct.

$ grep "App/" /tmp/cf-logs | wc -l # /tmp/cf-logs comes from cf logs cf-env-test > /tmp/cf-logs
1000
$ grep "RTR/" /tmp/cf-logs | wc -l
145

Looking at the gorouters
nsaadmin(a)a99a5339-4308-43db-b3a4-59442169861d:~$ grep "GET /kehehehe" /var/vcap/sys/log/gorouter/access.log | wc -l
484
nsaadmin(a)1d8ef2c2-2ec1-417d-8882-0234de251c60:~$ grep "GET /kehehehe" /var/vcap/sys/log/gorouter/access.log | wc -l
516
So the gorouter process definitely logs the right amount of requests to disk.

In metron_agent.stdout.log on the gorouters there are a lot of lines like

{"timestamp":1439286941.958369255,"process_id":25749,"source":"metron","log_level":"warn","message":"no matching HTTP start message found for {low:6001753075995010215 high:3404432699330643838 1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.959420443,"process_id":25749,"source":"metron","log_level":"warn","message":"no matching HTTP start message found for {low:8668459744461365080 high:13197025281073682796 2}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.964927912,"process_id":25749,"source":"metron","log_level":"warn","message":"no matching HTTP start message found for {low:13205430422726438703 high:7208042014820494920 1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}

Has anyone seen this before?

[1] https://github.com/cloudfoundry-community/cf-env


Gwenn Etourneau
 

Simon not sure is related but don,t if you don't consume enough quickly the
message, the buffer will be exhausted.

Did you try to increase the buffer size ?


On Wed, Aug 12, 2015 at 5:47 AM, Simon Johansson <simon(a)simonjohansson.com>
wrote:

Howdie! Not really a dev related question, but since there is no cf-user
list I'm trying my luck here. :)

Im looking into an issue where we don't get enough RTR logs from cf logs /
firehose-to-syslog
I'm using cf-env[1] as the test app, and Im hitting
app.domain.com/kehehehe.
For each request the app itself logs the the path requested and the
gorouters logs the same information but a bit more verbose.

$ cf logs cf-env-test
...
2015-08-11T11:43:41.24+0200 [RTR/0] OUT cf-env-test.domain.com -
[11/08/2015:09:43:41 +0000] "GET /kehehehe HTTP/1.1" 404 0 18 "-"
"curl/7.43.0" 10.230.15.4:59728 x_forwarded_for:"10.230.15.4"
vcap_request_id:4a186d9e-046a-47bc-74d0-f0c3a8cb1257
response_time:0.008503156 app_id:7f1944a2-2197-43b5-9334-0be7c8b9b40e
2015-08-11T11:43:41.31+0200 [App/0] ERR 10.230.15.8 - - [11/Aug/2015
09:43:41] "GET /kehehehe HTTP/1.1" 404 18 0.0006

I shoot off 1000 requests and would expect to see 1000 of each log type.
But only the App logs are correct.

$ grep "App/" /tmp/cf-logs | wc -l # /tmp/cf-logs comes from cf logs
cf-env-test > /tmp/cf-logs
1000
$ grep "RTR/" /tmp/cf-logs | wc -l
145

Looking at the gorouters
nsaadmin(a)a99a5339-4308-43db-b3a4-59442169861d:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
484
nsaadmin(a)1d8ef2c2-2ec1-417d-8882-0234de251c60:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
516
So the gorouter process definitely logs the right amount of requests to
disk.

In metron_agent.stdout.log on the gorouters there are a lot of lines like

{"timestamp":1439286941.958369255,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:6001753075995010215
high:3404432699330643838
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.959420443,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:8668459744461365080
high:13197025281073682796
2}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.964927912,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:13205430422726438703
high:7208042014820494920
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}

Has anyone seen this before?

[1] https://github.com/cloudfoundry-community/cf-env


Simon Johansson <simon@...>
 

Gwenn, our buffer is set to 500(default is 100).


On Wed, Aug 12, 2015 at 3:24 AM, Gwenn Etourneau <getourneau(a)pivotal.io>
wrote:

Simon not sure is related but don,t if you don't consume enough quickly
the message, the buffer will be exhausted.

Did you try to increase the buffer size ?


On Wed, Aug 12, 2015 at 5:47 AM, Simon Johansson <simon(a)simonjohansson.com
wrote:
Howdie! Not really a dev related question, but since there is no cf-user
list I'm trying my luck here. :)

Im looking into an issue where we don't get enough RTR logs from cf logs
/ firehose-to-syslog
I'm using cf-env[1] as the test app, and Im hitting
app.domain.com/kehehehe.
For each request the app itself logs the the path requested and the
gorouters logs the same information but a bit more verbose.

$ cf logs cf-env-test
...
2015-08-11T11:43:41.24+0200 [RTR/0] OUT cf-env-test.domain.com -
[11/08/2015:09:43:41 +0000] "GET /kehehehe HTTP/1.1" 404 0 18 "-"
"curl/7.43.0" 10.230.15.4:59728 x_forwarded_for:"10.230.15.4"
vcap_request_id:4a186d9e-046a-47bc-74d0-f0c3a8cb1257
response_time:0.008503156 app_id:7f1944a2-2197-43b5-9334-0be7c8b9b40e
2015-08-11T11:43:41.31+0200 [App/0] ERR 10.230.15.8 - - [11/Aug/2015
09:43:41] "GET /kehehehe HTTP/1.1" 404 18 0.0006

I shoot off 1000 requests and would expect to see 1000 of each log type.
But only the App logs are correct.

$ grep "App/" /tmp/cf-logs | wc -l # /tmp/cf-logs comes from cf logs
cf-env-test > /tmp/cf-logs
1000
$ grep "RTR/" /tmp/cf-logs | wc -l
145

Looking at the gorouters
nsaadmin(a)a99a5339-4308-43db-b3a4-59442169861d:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
484
nsaadmin(a)1d8ef2c2-2ec1-417d-8882-0234de251c60:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
516
So the gorouter process definitely logs the right amount of requests to
disk.

In metron_agent.stdout.log on the gorouters there are a lot of lines like

{"timestamp":1439286941.958369255,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:6001753075995010215
high:3404432699330643838
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.959420443,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:8668459744461365080
high:13197025281073682796
2}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.964927912,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:13205430422726438703
high:7208042014820494920
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}

Has anyone seen this before?

[1] https://github.com/cloudfoundry-community/cf-env


Matthew Sykes <matthew.sykes@...>
 

Simon, what level of cf-release are you using? We've seen some extremely
high log drop rates (~40% during log spurts) to log sinks with v209 and
v210. May or may not be related since we were focused on application logs...

In some cases it was doppler dropping the messages because the truncating
buffer was overrun, in some cases it was a bug where sinks were not picked
up by doppler because it's poller ran out of sockets due to a connection
leak, and in others still, we just never saw the message get to one of the
dopplers.

The http start/stop stuff is usually due to requests that take longer than
a minute (the starts get purged). I don't know if that makes sense for
these requests.

On Wed, Aug 12, 2015 at 11:19 AM, Simon Johansson <simon(a)simonjohansson.com>
wrote:

Gwenn, our buffer is set to 500(default is 100).


On Wed, Aug 12, 2015 at 3:24 AM, Gwenn Etourneau <getourneau(a)pivotal.io>
wrote:

Simon not sure is related but don,t if you don't consume enough quickly
the message, the buffer will be exhausted.

Did you try to increase the buffer size ?


On Wed, Aug 12, 2015 at 5:47 AM, Simon Johansson <
simon(a)simonjohansson.com> wrote:

Howdie! Not really a dev related question, but since there is no cf-user
list I'm trying my luck here. :)

Im looking into an issue where we don't get enough RTR logs from cf logs
/ firehose-to-syslog
I'm using cf-env[1] as the test app, and Im hitting
app.domain.com/kehehehe.
For each request the app itself logs the the path requested and the
gorouters logs the same information but a bit more verbose.

$ cf logs cf-env-test
...
2015-08-11T11:43:41.24+0200 [RTR/0] OUT cf-env-test.domain.com -
[11/08/2015:09:43:41 +0000] "GET /kehehehe HTTP/1.1" 404 0 18 "-"
"curl/7.43.0" 10.230.15.4:59728 x_forwarded_for:"10.230.15.4"
vcap_request_id:4a186d9e-046a-47bc-74d0-f0c3a8cb1257
response_time:0.008503156 app_id:7f1944a2-2197-43b5-9334-0be7c8b9b40e
2015-08-11T11:43:41.31+0200 [App/0] ERR 10.230.15.8 - -
[11/Aug/2015 09:43:41] "GET /kehehehe HTTP/1.1" 404 18 0.0006

I shoot off 1000 requests and would expect to see 1000 of each log type.
But only the App logs are correct.

$ grep "App/" /tmp/cf-logs | wc -l # /tmp/cf-logs comes from cf logs
cf-env-test > /tmp/cf-logs
1000
$ grep "RTR/" /tmp/cf-logs | wc -l
145

Looking at the gorouters
nsaadmin(a)a99a5339-4308-43db-b3a4-59442169861d:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
484
nsaadmin(a)1d8ef2c2-2ec1-417d-8882-0234de251c60:~$ grep "GET /kehehehe"
/var/vcap/sys/log/gorouter/access.log | wc -l
516
So the gorouter process definitely logs the right amount of requests to
disk.

In metron_agent.stdout.log on the gorouters there are a lot of lines like

{"timestamp":1439286941.958369255,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:6001753075995010215
high:3404432699330643838
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.959420443,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:8668459744461365080
high:13197025281073682796
2}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}
{"timestamp":1439286941.964927912,"process_id":25749,"source":"metron","log_level":"warn","message":"no
matching HTTP start message found for {low:13205430422726438703
high:7208042014820494920
1}","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/messageaggregator/message_aggregator.go","line":97,"method":"metron/messageaggregator.(*MessageAggregator).handleHTTPStop"}

Has anyone seen this before?

[1] https://github.com/cloudfoundry-community/cf-env

--
Matthew Sykes
matthew.sykes(a)gmail.com


Simon Johansson <simon@...>
 

The http start/stop stuff is usually due to requests that take longer than
a minute (the starts get purged). I don't know if that makes sense for
these requests.
Ah, that is interesting, thanks!

Re my initial problem, turned out to be a scaling issue. Most of the CPU on the gorouters were consumed by metron agent, so adding more CPU and scaling out the routers fixed our problem.