Re: cloud_controller_ng performance degrades slowly over time


Amit Kumar Gupta
 

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is restarted,
so it feels as though there's some sort of resource that the ruby process
is not able to GC, leading to this problem to show up eventually, and then
go away when restarted. I want to confirm this by trying two different
loops, one where the loop is in bash, spinning up a new ruby process each
time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE`
to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's out
of commission, so that the router doesn't load balance to it, because it
will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to be
stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for consul_agent
to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in /etc/resolv.conf,
and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]])
do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything to
be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0
ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0
ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.