cloud_controller_ng performance degrades slowly over time


Filip Hanik
 

Could be related (or totally unrelated), but ubuntu users have reported
slowness with dnsmasq. and if the recursors flag implies that it tries
another set of dns servers and bypasses dnsmasq (127.0.1.1).
I experienced that this morning, after disabling dnsmasq, the system
correctly uses all the entries in my resolv.conf.

I simply commented out dnsmasq, and my problems went away. Since CF uses
the trusty image, this may be worth trying out.

file: /etc/NetworkManager/NetworkManager.conf


[main]
plugins=ifupdown,keyfile,ofono
#dns=dnsmasq

[ifupdown]
managed=false

On Fri, Nov 20, 2015 at 12:33 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, it's helpful to validate that the solution worked for you.
I'm curious about how this will play with Diego, which actually makes heavy
use of Consul. We have a placeholder story to support recursing:
https://www.pivotaltracker.com/story/show/108394110. I don't think it's
as simple as exposing a config option, because the other current values in
the /etc/resolv.conf (which would become values consul recurses to) are
sometimes populated by data that BOSH knows, not something you would
configure in a CF manifest property.

Our consul-release is undergoing some heavy refactoring and
test-backfilling, and we'll want to wait for some of that to stabilize
before we start adding new functionality. But this issue is prominently on
our radar.

On Thu, Nov 19, 2015 at 8:11 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Configuring the consul agent to recurse does solve the issue.

The consul agent's configuration is directory based: it loads every json
file in its config directory. Since our deployment already consumes an
internal release with our components in addition to cf, I've created a
template as part of that which copies an additional json file (with a
configuration for https://www.consul.io/docs/agent/options.html#recursors)
into to the agent's configuration directory. It's a hack, but it gets us
past having to build our own cf-release just to customize the
consul_agent's job templates.

Would you folks accept a pull request that adds "recursors" as a
configuration option to
https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb?
Or is this work you already have in progress since you mentioned exploring
having consul recurse? Or were you thinking of solving this more
dynamically & consuming the existing content in resolv.conf somehow as the
upstream recursors?

Thanks again for all the help trying to debug this one.



On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Sure. Here's the Ruby:

dns_open: 8 ms | dns_close: 0 ms
dns_open: 9 ms | dns_close: 0 ms
dns_open: 10 ms | dns_close: 0 ms
dns_open: 3013 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5011 ms | dns_close: 0 ms
dns_open: 5014 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5012 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms

https://gist.github.com/cholick/f8bab2d9752728671b94

And the Go:
dns_open: 16ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 19ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 16ms | dns_close: 0ms
dns_open: 3016ms | dns_close: 0ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5033ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 13ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms

https://gist.github.com/cholick/7d4e01dcf99588d13016


On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.

The alternating slow/fast stuff with Go is interesting. Not clue yet
what that could be, maybe multiple goroutines?

On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was
more useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing
this behavior and, within a few hours of including the consul template, the
issue manifested.


Amit Kumar Gupta
 

Thanks Matt, it's helpful to validate that the solution worked for you.
I'm curious about how this will play with Diego, which actually makes heavy
use of Consul. We have a placeholder story to support recursing:
https://www.pivotaltracker.com/story/show/108394110. I don't think it's as
simple as exposing a config option, because the other current values in the
/etc/resolv.conf (which would become values consul recurses to) are
sometimes populated by data that BOSH knows, not something you would
configure in a CF manifest property.

Our consul-release is undergoing some heavy refactoring and
test-backfilling, and we'll want to wait for some of that to stabilize
before we start adding new functionality. But this issue is prominently on
our radar.

On Thu, Nov 19, 2015 at 8:11 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Configuring the consul agent to recurse does solve the issue.

The consul agent's configuration is directory based: it loads every json
file in its config directory. Since our deployment already consumes an
internal release with our components in addition to cf, I've created a
template as part of that which copies an additional json file (with a
configuration for https://www.consul.io/docs/agent/options.html#recursors)
into to the agent's configuration directory. It's a hack, but it gets us
past having to build our own cf-release just to customize the
consul_agent's job templates.

Would you folks accept a pull request that adds "recursors" as a
configuration option to
https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb?
Or is this work you already have in progress since you mentioned exploring
having consul recurse? Or were you thinking of solving this more
dynamically & consuming the existing content in resolv.conf somehow as the
upstream recursors?

Thanks again for all the help trying to debug this one.



On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Sure. Here's the Ruby:

dns_open: 8 ms | dns_close: 0 ms
dns_open: 9 ms | dns_close: 0 ms
dns_open: 10 ms | dns_close: 0 ms
dns_open: 3013 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5011 ms | dns_close: 0 ms
dns_open: 5014 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5012 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms

https://gist.github.com/cholick/f8bab2d9752728671b94

And the Go:
dns_open: 16ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 19ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 16ms | dns_close: 0ms
dns_open: 3016ms | dns_close: 0ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5033ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 13ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms

https://gist.github.com/cholick/7d4e01dcf99588d13016


On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.

The alternating slow/fast stuff with Go is interesting. Not clue yet
what that could be, maybe multiple goroutines?

On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was
more useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing
this behavior and, within a few hours of including the consul template, the
issue manifested.


Matt Cholick
 

Configuring the consul agent to recurse does solve the issue.

The consul agent's configuration is directory based: it loads every json
file in its config directory. Since our deployment already consumes an
internal release with our components in addition to cf, I've created a
template as part of that which copies an additional json file (with a
configuration for https://www.consul.io/docs/agent/options.html#recursors)
into to the agent's configuration directory. It's a hack, but it gets us
past having to build our own cf-release just to customize the
consul_agent's job templates.

Would you folks accept a pull request that adds "recursors" as a
configuration option to
https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb?
Or is this work you already have in progress since you mentioned exploring
having consul recurse? Or were you thinking of solving this more
dynamically & consuming the existing content in resolv.conf somehow as the
upstream recursors?

Thanks again for all the help trying to debug this one.

On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Sure. Here's the Ruby:

dns_open: 8 ms | dns_close: 0 ms
dns_open: 9 ms | dns_close: 0 ms
dns_open: 10 ms | dns_close: 0 ms
dns_open: 3013 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5011 ms | dns_close: 0 ms
dns_open: 5014 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5012 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms

https://gist.github.com/cholick/f8bab2d9752728671b94

And the Go:
dns_open: 16ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 19ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 16ms | dns_close: 0ms
dns_open: 3016ms | dns_close: 0ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5033ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 13ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms

https://gist.github.com/cholick/7d4e01dcf99588d13016


On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.

The alternating slow/fast stuff with Go is interesting. Not clue yet
what that could be, maybe multiple goroutines?

On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was
more useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.


Matt Cholick
 

Sure. Here's the Ruby:

dns_open: 8 ms | dns_close: 0 ms
dns_open: 9 ms | dns_close: 0 ms
dns_open: 10 ms | dns_close: 0 ms
dns_open: 3013 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5011 ms | dns_close: 0 ms
dns_open: 5014 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5012 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms

https://gist.github.com/cholick/f8bab2d9752728671b94

And the Go:
dns_open: 16ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 19ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 16ms | dns_close: 0ms
dns_open: 3016ms | dns_close: 0ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5033ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 13ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms

https://gist.github.com/cholick/7d4e01dcf99588d13016

On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.

The alternating slow/fast stuff with Go is interesting. Not clue yet what
that could be, maybe multiple goroutines?

On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was
more useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.


Amit Kumar Gupta
 

Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.

The alternating slow/fast stuff with Go is interesting. Not clue yet what
that could be, maybe multiple goroutines?

On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was more
useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.


Matt Cholick
 

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was more
useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.


Amit Kumar Gupta
 

Heya Matt,

I keep thinking about this puzzle every couple days to see if I can get any
more insight. It would be great to understand the root cause; in the mean
time we are going to try and setup Consul to recurse to other nameservers
(instead of just letting the system fall to the next namserver in
/etc/resolv.conf), and hopefully that'll make some of the weird Consul
behaviour we've seen in different environments better.

But back to this one, what I'm seeing is that there's a bunch of fast
iterations, then one weird one where it takes a few seconds, and then all
subsequent iterations are slow and hit the 5s timeout. In the fast
iteration, we see a sendmmsg sending multiple (2) messages, followed by 2
poll/ioctl/recvfroms. Each fast iteration does this first for 127.0.0.1,
then 8.8.8.8.

The first bad iteration, that takes about 3s, does the sendmmsg, and the 2
poll/ioctl/recvfroms for 127.0.0.1 just fine. Then after doing the
sendmmsg to 8.8.8.8, it gets a recv's a response for one of the messages,
but *times out (3s) polling for a response to the other message*. It then
appears to retry talking to 8.8.8.8, this time, and *all subsequent calls,
using sendto instead of sendmmsg* to send one message at a time:
sendto/poll/ioctl/recvfrom, sendto/poll/ioctl/recvfrom.

All subsequent iterations only use sendto. And now it's the first *sendto
to 127.0.0.1 that times out (5s)*. It immediately goes on to 8.8.8.8 after
that first sendto to 127.0.0.1 fails.

Some questions are: (1) why doesn't the response to one of the sendmmsg
messages arrive? (2) why do all system calls switch from sendmmsg to sendto
after that first failure? (3) why do requests to 127.0.0.1 in the form of
separate sendto calls timeout on the first sendto call, whereas when sent
simultaneously as two messages in sendmmsg, it seems to work fine?

I put these findings together and asked on serverfault.com:
http://serverfault.com/questions/736041/dns-lookups-eventually-become-consistently-slow-in-long-lived-ruby-process.
One suggestion I saw, which could shed a bit more light is to *call strace
with the additional -r -T* flags to get relative time between syscalls, and
the time spent within each syscall.

Care to give that a try when you have some time?

Cheers,
Amit

On Wed, Nov 4, 2015 at 1:54 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Gotcha. Yeah, the rescue lets that test run; after 425k lookups, it never
got slow.

Here's a bit of the strace:
https://gist.github.com/cholick/88c756760faca77208f8


On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the TCPSocket.open to fail when
talking to 127.0.0.1, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the
TCPSocket.open is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:

TCPSocket.open("--UAA-DOMAIN--", 80).close

with

TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only 127.0.0.1 listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to 127.0.0.1, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for
8.8.8.8. The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.

Thanks,
Amit

On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.


On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without
resolver on localhost". If you try it with *only* a resolver on localhost,
do you get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you
mean "the loop with only a resolver on local host"? Sorry if my setup
wasn't clear, but my intention was to only have 127.0.0.1 in
etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did
manifest. Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something.
There's a couple variables I'd like to eliminate, so I wonder if you could
try the following. Also, feel free at any point to let me know if you are
not interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of
your CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the
processes to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try
that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby,
not the ruby package used by CC, so it maintains some separation. When
running this, it will spit out logs to the terminal, so one can keep an eye
on what it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `
/var/vcap/bosh/etc/gemrc`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected,
so it has something to do with long running ruby processes, (2) the problem
is independent of the other nameservers listed in /etc/resolv.conf,
and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead
of consul on 127.0.0.1:53, to determine that the problem is not
specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms |
dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms |
dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin
down where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP
that the UAA_DOMAIN resolves to (it will be some load balancer, possibly
the gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Matt Cholick
 

Gotcha. Yeah, the rescue lets that test run; after 425k lookups, it never
got slow.

Here's a bit of the strace:
https://gist.github.com/cholick/88c756760faca77208f8

On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the TCPSocket.open to fail when
talking to 127.0.0.1, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the
TCPSocket.open is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:

TCPSocket.open("--UAA-DOMAIN--", 80).close

with

TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only 127.0.0.1 listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to 127.0.0.1, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for
8.8.8.8. The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.

Thanks,
Amit

On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.


On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did
manifest. Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something.
There's a couple variables I'd like to eliminate, so I wonder if you could
try the following. Also, feel free at any point to let me know if you are
not interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of your
CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try
that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby,
not the ruby package used by CC, so it maintains some separation. When
running this, it will spit out logs to the terminal, so one can keep an eye
on what it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `
/var/vcap/bosh/etc/gemrc`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected,
so it has something to do with long running ruby processes, (2) the problem
is independent of the other nameservers listed in /etc/resolv.conf,
and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead
of consul on 127.0.0.1:53, to determine that the problem is not
specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close:
0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close:
0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin
down where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP
that the UAA_DOMAIN resolves to (it will be some load balancer, possibly
the gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the TCPSocket.open to fail when
talking to 127.0.0.1, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the
TCPSocket.open is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:

TCPSocket.open("--UAA-DOMAIN--", 80).close

with

TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only 127.0.0.1 listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to 127.0.0.1, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for
8.8.8.8. The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.

Thanks,
Amit

On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.


On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's
a couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of your
CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try
that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc
`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close:
0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close:
0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Matt Cholick
 

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the first
to the second nameserver in ruby, when the first nameserver doesn't know
the address.

On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's
a couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0
ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0
ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

I looked at the strace, I see you did indeed mean "loop without resolver on
localhost". If you try it with *only* a resolver on localhost, do you get
the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0
ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0
ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close:
0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close:
0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.

On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop with
a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io
<javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything
to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com
<javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close:
0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close:
0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io
<javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com
<javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Matt Cholick
 

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop with
a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow. Though
it's hard to prove with something so inconsistent, it hadn't happened after
100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.

On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is restarted,
so it feels as though there's some sort of resource that the ruby process
is not able to GC, leading to this problem to show up eventually, and then
go away when restarted. I want to confirm this by trying two different
loops, one where the loop is in bash, spinning up a new ruby process each
time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's out
of commission, so that the router doesn't load balance to it, because it
will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for consul_agent
to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]])
do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything
to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0
ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0
ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is restarted,
so it feels as though there's some sort of resource that the ruby process
is not able to GC, leading to this problem to show up eventually, and then
go away when restarted. I want to confirm this by trying two different
loops, one where the loop is in bash, spinning up a new ruby process each
time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE`
to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's out
of commission, so that the router doesn't load balance to it, because it
will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to be
stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for consul_agent
to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in /etc/resolv.conf,
and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]])
do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything to
be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0
ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0
ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Matt Cholick
 

Amit,
It looks like consul isn't configured as a recursive resolver. When running
the above code, resolving fails on the first nameserver and the script
fails. resolv-replace's TCPSocket.open is different from the code http.rb
(and thus api) is using. http.rb is pulling in 'net/protocol'. I changed
the script, replacing the require for 'resolv-replace' to 'net/protocol' to
match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56

On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down where
the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down where
the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the corresponding
ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Matt Cholick
 

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the corresponding
ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Amit Kumar Gupta
 

Ah, my bad. We need to patch the logger to somehow include timestamps when
the net/http library calls << on it instead of calling info:

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
def l.<<(msg); info(msg); end
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

On Fri, Oct 30, 2015 at 7:35 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration}
| curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Thanks for taking a look, fingers crossed you can see it happen as
well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite,
but eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc
that involve cc making a request to uaa or hm9k have a 5s delay while the
local consul agent fails to resolves the DNS for uaa/hm9k, before moving on
to a different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before and
after the DNS slowdown. By having the http object take a debug logger, we
can narrow down what Ruby is doing that's making it uniquely slow.

On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as you
are Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that
the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha