Configuring the consul agent to recurse does solve the issue.
The consul agent's configuration is directory based: it loads every json
file in its config directory. Since our deployment already consumes an
internal release with our components in addition to cf, I've created a
template as part of that which copies an additional json file (with a
configuration for
https://www.consul.io/docs/agent/options.html#recursors)
into to the agent's configuration directory. It's a hack, but it gets us
past having to build our own cf-release just to customize the
consul_agent's job templates.
Would you folks accept a pull request that adds "recursors" as a
configuration option to
https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb?Or is this work you already have in progress since you mentioned exploring
having consul recurse? Or were you thinking of solving this more
dynamically & consuming the existing content in resolv.conf somehow as the
upstream recursors?
Thanks again for all the help trying to debug this one.
toggle quoted message
Show quoted text
On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:
Sure. Here's the Ruby:
dns_open: 8 ms | dns_close: 0 ms
dns_open: 9 ms | dns_close: 0 ms
dns_open: 10 ms | dns_close: 0 ms
dns_open: 3013 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5011 ms | dns_close: 0 ms
dns_open: 5014 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
dns_open: 5012 ms | dns_close: 0 ms
dns_open: 5015 ms | dns_close: 0 ms
https://gist.github.com/cholick/f8bab2d9752728671b94
And the Go:
dns_open: 16ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 19ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 16ms | dns_close: 0ms
dns_open: 3016ms | dns_close: 0ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5033ms | dns_close: 0ms
dns_open: 11ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 12ms | dns_close: 0ms
dns_open: 5017ms | dns_close: 0ms
dns_open: 13ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
https://gist.github.com/cholick/7d4e01dcf99588d13016
On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet
what that could be, maybe multiple goroutines?
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.
package main
import (
"net"
"fmt"
"time"
)
func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}
After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was
more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.