Re: cloud_controller_ng performance degrades slowly over time


Matt Cholick
 

So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the
consult_agent template to the box with these jobs and I saw similar
behavior: requests with dns resolution taking +5s consistently, fixable by
restarting the process. I wrote a small Golang app to try and reproduce it.
I got similar, but not identical behavior in the test app.

package main


import (
"net"
"fmt"
"time"
)

func main() {
for {
startOpen := time.Now()
conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http")
if err != nil {
fmt.Printf("%#v", err)
} else {
endOpen := time.Now()
conn.Close()
endClose := time.Now()
openDur := endOpen.Sub(startOpen) / 1e6
closeDur := endClose.Sub(endOpen) / 1e6

fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur)
}
}
}

After some time, I saw:
dns_open: 14ms | dns_close: 0ms
dns_open: 5019ms | dns_close: 1ms
dns_open: 25ms | dns_close: 1ms
dns_open: 27ms | dns_close: 1ms
dns_open: 20ms | dns_close: 0ms
dns_open: 5015ms | dns_close: 0ms
dns_open: 14ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 15ms | dns_close: 0ms
dns_open: 5018ms | dns_close: 1ms
dns_open: 19ms | dns_close: 0ms
dns_open: 5046ms | dns_close: 0ms

To reiterate, the behavior in the actual app seemed to be every single
request. The test app went into a similar mode where a roughly every other
request failed. Here's the strace for that section:

https://gist.github.com/cholick/428df93c112067f50558

On some other runs it was quite consistently alternating slow and fast
calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby,
once things get into an unhappy state they stay that way until restarting
the process. (The process ran for 12 hours and never recovered).

I can capture a Ruby strace too if you'd like, but I figured this was more
useful as it added more novel information to the mix.

These Golang jobs have run in production for months without showing this
behavior and, within a few hours of including the consul template, the
issue manifested.

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.