Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta

Heya Matt,

I keep thinking about this puzzle every couple days to see if I can get any
more insight. It would be great to understand the root cause; in the mean
time we are going to try and setup Consul to recurse to other nameservers
(instead of just letting the system fall to the next namserver in
/etc/resolv.conf), and hopefully that'll make some of the weird Consul
behaviour we've seen in different environments better.

But back to this one, what I'm seeing is that there's a bunch of fast
iterations, then one weird one where it takes a few seconds, and then all
subsequent iterations are slow and hit the 5s timeout. In the fast
iteration, we see a sendmmsg sending multiple (2) messages, followed by 2
poll/ioctl/recvfroms. Each fast iteration does this first for,

The first bad iteration, that takes about 3s, does the sendmmsg, and the 2
poll/ioctl/recvfroms for just fine. Then after doing the
sendmmsg to, it gets a recv's a response for one of the messages,
but *times out (3s) polling for a response to the other message*. It then
appears to retry talking to, this time, and *all subsequent calls,
using sendto instead of sendmmsg* to send one message at a time:
sendto/poll/ioctl/recvfrom, sendto/poll/ioctl/recvfrom.

All subsequent iterations only use sendto. And now it's the first *sendto
to that times out (5s)*. It immediately goes on to after
that first sendto to fails.

Some questions are: (1) why doesn't the response to one of the sendmmsg
messages arrive? (2) why do all system calls switch from sendmmsg to sendto
after that first failure? (3) why do requests to in the form of
separate sendto calls timeout on the first sendto call, whereas when sent
simultaneously as two messages in sendmmsg, it seems to work fine?

I put these findings together and asked on
One suggestion I saw, which could shed a bit more light is to *call strace
with the additional -r -T* flags to get relative time between syscalls, and
the time spent within each syscall.

Care to give that a try when you have some time?


On Wed, Nov 4, 2015 at 1:54 PM, Matt Cholick <cholick(a)> wrote:

Gotcha. Yeah, the rescue lets that test run; after 425k lookups, it never
got slow.

Here's a bit of the strace:

On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)> wrote:

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the to fail when
talking to, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:"--UAA-DOMAIN--", 80).close

with"--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.


On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the in resolve.conf I changed the url in the ruby loop to
"", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.

On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)> wrote:

I looked at the strace, I see you did indeed mean "loop without
resolver on localhost". If you try it with *only* a resolver on localhost,
do you get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you
mean "the loop with only a resolver on local host"? Sorry if my setup
wasn't clear, but my intention was to only have in

On Tuesday, November 3, 2015, Matt Cholick <cholick(a)> wrote:

Here are the results of the ruby loop with strace:

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:

On the final loop, with the SERVFAIL resolver, the issue did
manifest. Here's the trace of that run:

Thanks for digging in on this.

On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)> wrote:

Okay, interesting, hopefully we're narrowing in on something.
There's a couple variables I'd like to eliminate, so I wonder if you could
try the following. Also, feel free at any point to let me know if you are
not interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e '"--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t =;"--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*( - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of
your CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the
processes to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "", 53], [:tcp, "",
53]]) do
otherwise do |transaction|!(:ServFail)

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby,
not the ruby package used by CC, so it maintains some separation. When
running this, it will spit out logs to the terminal, so one can keep an eye
on what it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `
/var/vcap/bosh/etc/gemrc`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected,
so it has something to do with long running ruby processes, (2) the problem
is independent of the other nameservers listed in /etc/resolv.conf,
and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead
of consul on, to determine that the problem is not
specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)>

It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.


3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms |
dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms |
dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms |
dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:

On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)>

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

- name: cloud_controller_z1
log_level: debug

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin
down where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. ''
UAA_IP = '--CHANGE-ME-TOO--' # e.g. ''

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"

1.step do |i|
ip_start =
s =, 80)
ip_open =
ip_close =

dns_start =
s =, 80)
dns_open =
dns_close =

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"

You will need to first nslookup (or otherwise determine) the IP
that the UAA_DOMAIN resolves to (it will be some load balancer, possibly
the gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/


On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)>

Here's the results:

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {, conn_port, @local_host, @local_port)
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.


Join to automatically receive all group messages.