Re: cloud_controller_ng performance degrades slowly over time
Matt Cholick
So things got a little more interesting yesterday. We have some custom
Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app. package main import ( "net" "fmt" "time" ) func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6 fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } } After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section: https://gist.github.com/cholick/428df93c112067f50558 On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered). I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix. These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested. |
|