Thanks Matt, that's super interesting. It would be awesome to get the
strace from the Ruby thing, as well another strace from the Go thing but
this time starting from before it goes slow. I think there's some sort of
weird thing that happens once, and then after that happens, the fact that
Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS
server) is present makes all subsequent requests slow. So I'd like to see
strace from both the Ruby and the Go, with enough before and after the
slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what
that could be, maybe multiple goroutines?
toggle quoted message
Show quoted text
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.
package main
import (
"net"
"fmt"
"time"
)
func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}
After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more
useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.