Re: cloud_controller_ng performance degrades slowly over time


Matt Cholick
 

Amit,
It looks like consul isn't configured as a recursive resolver. When running
the above code, resolving fails on the first nameserver and the script
fails. resolv-replace's TCPSocket.open is different from the code http.rb
(and thus api) is using. http.rb is pulling in 'net/protocol'. I changed
the script, replacing the require for 'resolv-replace' to 'net/protocol' to
match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56

On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down where
the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.