Here's the results: https://gist.github.com/cholick/1325fe0f592b1805eba5The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method: D "opening connection to #{conn_address}:#{conn_port}..." s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened" I don't know much ruby, so that's as far I drilled down. -Matt
|
|
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
toggle quoted message
Show quoted text
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote: Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller. Results: 3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
toggle quoted message
Show quoted text
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
toggle quoted message
Show quoted text
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote: Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
toggle quoted message
Show quoted text
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
toggle quoted message
Show quoted text
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote: Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io <javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com <javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io <javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com <javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
I looked at the strace, I see you did indeed mean "loop without resolver on localhost". If you try it with *only* a resolver on localhost, do you get the eventually consistent DNS slowdown?
toggle quoted message
Show quoted text
On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:
Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Ah, I misunderstood. Consul isn't configured as a recursive resolver, so for a test with only the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to "uaa.service.cf.internal", which is what uaa is registering for in consul. I ran through 225k lookups and it never got slow. Here's a bit of the strace: https://gist.github.com/cholick/38e02ce3f351847d5fa3Bother versions of that test definitely pointing to the move from the first to the second nameserver in ruby, when the first nameserver doesn't know the address.
toggle quoted message
Show quoted text
On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote: I looked at the strace, I see you did indeed mean "loop without resolver on localhost". If you try it with *only* a resolver on localhost, do you get the eventually consistent DNS slowdown?
On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:
Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Hey Matt,
I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal domain, for that experiment. I do expect the TCPSocket.open to fail when talking to 127.0.0.1, what I wanted to know is, in the presence of no other nameservers, does it eventually start to fail slow again, or does this behaviour happen only when there are other nameservers. I imagine the TCPSocket.open is blowing up on the first iteration in the loop and exiting the script? My bad, can you replace:
TCPSocket.open("--UAA-DOMAIN--", 80).close
with
TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil
for the experiment with only 127.0.0.1 listed amongst the nameservers?
Yes, something about the move from the first to second nameserver seems weird. I have seen strace of one case where it times out polling the FD of the socket it opened to talk to 127.0.0.1, but in one of your straces it looked like the poll timeout was on polling the FD for the socket for 8.8.8.8. The fact that the problem persists is interesting too, it seems like it's not just a one-off race condition where someone messed up with FD it was supposed to be polling.
Thanks, Amit
toggle quoted message
Show quoted text
On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote: Ah, I misunderstood.
Consul isn't configured as a recursive resolver, so for a test with only the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to "uaa.service.cf.internal", which is what uaa is registering for in consul.
I ran through 225k lookups and it never got slow. Here's a bit of the strace: https://gist.github.com/cholick/38e02ce3f351847d5fa3
Bother versions of that test definitely pointing to the move from the first to the second nameserver in ruby, when the first nameserver doesn't know the address.
On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
I looked at the strace, I see you did indeed mean "loop without resolver on localhost". If you try it with *only* a resolver on localhost, do you get the eventually consistent DNS slowdown?
On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:
Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc `, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
toggle quoted message
Show quoted text
On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)pivotal.io> wrote: Hey Matt,
I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal domain, for that experiment. I do expect the TCPSocket.open to fail when talking to 127.0.0.1, what I wanted to know is, in the presence of no other nameservers, does it eventually start to fail slow again, or does this behaviour happen only when there are other nameservers. I imagine the TCPSocket.open is blowing up on the first iteration in the loop and exiting the script? My bad, can you replace:
TCPSocket.open("--UAA-DOMAIN--", 80).close
with
TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil
for the experiment with only 127.0.0.1 listed amongst the nameservers?
Yes, something about the move from the first to second nameserver seems weird. I have seen strace of one case where it times out polling the FD of the socket it opened to talk to 127.0.0.1, but in one of your straces it looked like the poll timeout was on polling the FD for the socket for 8.8.8.8. The fact that the problem persists is interesting too, it seems like it's not just a one-off race condition where someone messed up with FD it was supposed to be polling.
Thanks, Amit
On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:
Ah, I misunderstood.
Consul isn't configured as a recursive resolver, so for a test with only the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to "uaa.service.cf.internal", which is what uaa is registering for in consul.
I ran through 225k lookups and it never got slow. Here's a bit of the strace: https://gist.github.com/cholick/38e02ce3f351847d5fa3
Bother versions of that test definitely pointing to the move from the first to the second nameserver in ruby, when the first nameserver doesn't know the address.
On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
I looked at the strace, I see you did indeed mean "loop without resolver on localhost". If you try it with *only* a resolver on localhost, do you get the eventually consistent DNS slowdown?
On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:
Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to ` /var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
Heya Matt, I keep thinking about this puzzle every couple days to see if I can get any more insight. It would be great to understand the root cause; in the mean time we are going to try and setup Consul to recurse to other nameservers (instead of just letting the system fall to the next namserver in /etc/resolv.conf), and hopefully that'll make some of the weird Consul behaviour we've seen in different environments better. But back to this one, what I'm seeing is that there's a bunch of fast iterations, then one weird one where it takes a few seconds, and then all subsequent iterations are slow and hit the 5s timeout. In the fast iteration, we see a sendmmsg sending multiple (2) messages, followed by 2 poll/ioctl/recvfroms. Each fast iteration does this first for 127.0.0.1, then 8.8.8.8. The first bad iteration, that takes about 3s, does the sendmmsg, and the 2 poll/ioctl/recvfroms for 127.0.0.1 just fine. Then after doing the sendmmsg to 8.8.8.8, it gets a recv's a response for one of the messages, but *times out (3s) polling for a response to the other message*. It then appears to retry talking to 8.8.8.8, this time, and *all subsequent calls, using sendto instead of sendmmsg* to send one message at a time: sendto/poll/ioctl/recvfrom, sendto/poll/ioctl/recvfrom. All subsequent iterations only use sendto. And now it's the first *sendto to 127.0.0.1 that times out (5s)*. It immediately goes on to 8.8.8.8 after that first sendto to 127.0.0.1 fails. Some questions are: (1) why doesn't the response to one of the sendmmsg messages arrive? (2) why do all system calls switch from sendmmsg to sendto after that first failure? (3) why do requests to 127.0.0.1 in the form of separate sendto calls timeout on the first sendto call, whereas when sent simultaneously as two messages in sendmmsg, it seems to work fine? I put these findings together and asked on serverfault.com: http://serverfault.com/questions/736041/dns-lookups-eventually-become-consistently-slow-in-long-lived-ruby-process. One suggestion I saw, which could shed a bit more light is to *call strace with the additional -r -T* flags to get relative time between syscalls, and the time spent within each syscall. Care to give that a try when you have some time? Cheers, Amit
toggle quoted message
Show quoted text
On Wed, Nov 4, 2015 at 1:54 PM, Matt Cholick <cholick(a)gmail.com> wrote: Gotcha. Yeah, the rescue lets that test run; after 425k lookups, it never got slow.
Here's a bit of the strace: https://gist.github.com/cholick/88c756760faca77208f8
On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)pivotal.io> wrote:
Hey Matt,
I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal domain, for that experiment. I do expect the TCPSocket.open to fail when talking to 127.0.0.1, what I wanted to know is, in the presence of no other nameservers, does it eventually start to fail slow again, or does this behaviour happen only when there are other nameservers. I imagine the TCPSocket.open is blowing up on the first iteration in the loop and exiting the script? My bad, can you replace:
TCPSocket.open("--UAA-DOMAIN--", 80).close
with
TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil
for the experiment with only 127.0.0.1 listed amongst the nameservers?
Yes, something about the move from the first to second nameserver seems weird. I have seen strace of one case where it times out polling the FD of the socket it opened to talk to 127.0.0.1, but in one of your straces it looked like the poll timeout was on polling the FD for the socket for 8.8.8.8. The fact that the problem persists is interesting too, it seems like it's not just a one-off race condition where someone messed up with FD it was supposed to be polling.
Thanks, Amit
On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:
Ah, I misunderstood.
Consul isn't configured as a recursive resolver, so for a test with only the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to "uaa.service.cf.internal", which is what uaa is registering for in consul.
I ran through 225k lookups and it never got slow. Here's a bit of the strace: https://gist.github.com/cholick/38e02ce3f351847d5fa3
Bother versions of that test definitely pointing to the move from the first to the second nameserver in ruby, when the first nameserver doesn't know the address.
On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
I looked at the strace, I see you did indeed mean "loop without resolver on localhost". If you try it with *only* a resolver on localhost, do you get the eventually consistent DNS slowdown?
On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt!
When you say "the loop without the resolver on local host" did you mean "the loop with only a resolver on local host"? Sorry if my setup wasn't clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.
On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:
Here are the results of the ruby loop with strace: https://gist.github.com/cholick/e7e122e34b524cae5fa1
As expected, things eventually get slow. The bash version of the loop with a new vm each time didn't get slow.
For the loop without a resolver on localhost, it never did get slow. Though it's hard to prove with something so inconsistent, it hadn't happened after 100k requests. Here's some of the strace: https://gist.github.com/cholick/81e58f58e82bfe0a1489
On the final loop, with the SERVFAIL resolver, the issue did manifest. Here's the trace of that run: https://gist.github.com/cholick/bd2af46795911cb9f63c
Thanks for digging in on this.
On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Okay, interesting, hopefully we're narrowing in on something. There's a couple variables I'd like to eliminate, so I wonder if you could try the following. Also, feel free at any point to let me know if you are not interesting in digging further.
Try all things as sudo, on one of the CCs.
1. It appears that the problem goes away when the CC process is restarted, so it feels as though there's some sort of resource that the ruby process is not able to GC, leading to this problem to show up eventually, and then go away when restarted. I want to confirm this by trying two different loops, one where the loop is in bash, spinning up a new ruby process each time, and one where the loop is in ruby.
* bash loop:
while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done
* ruby loop
/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}: #{(1000*(Time.now - t)).round}ms"; end'
For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE` to see what system calls are going on before and after.
2. Another variable is the interaction with the other nameservers. For this experiment, I would do `monit stop all` to take one of your CC's out of commission, so that the router doesn't load balance to it, because it will likely fail requests given the following changes:
* monit stop all && watch monit summary # wait for all the processes to be stopped, then ctrl+c to stop the watch * monit start consul_agent && watch monit summary # wait for consul_agent to be running, then ctrl+c to stop the watch * Remove nameservers other than 127.0.0.1 from /etc/resolv.conf * Run the "ruby loop", and see if it still eventually gets slow * When it's all done, put the original nameservers back in /etc/resolv.conf, and `monit restart all`
Again, strace-ing the ruby loop would be interesting here.
3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver that can be run instead of consul, that will always SERVFAIL (same as what you see from consul when you nslookup something), so we can try that:
* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag * Run `gem install rubydns` * Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:
#!/usr/bin/env ruby
require "rubydns"
RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]]) do otherwise do |transaction| transaction.fail!(:ServFail) end end
* monit stop all && watch monit summary # and again, wait for everything to be stopped * Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command, and the previous `gem install`, use the system gem/ruby, not the ruby package used by CC, so it maintains some separation. When running this, it will spit out logs to the terminal, so one can keep an eye on what it's doing, make sure it all looks reasonable * Make sure the original nameservers are back in the `/etc/resolv.conf` (i.e. ensure this experiment is independent of the previous experiment). * Run the "ruby loop" (in a separate shell session on the CC) * After it's all done, add back `--local` to ` /var/vcap/bosh/etc/gemrc`, and `monit restart all`
Again, run strace on the ruby process.
What I hope we find out is that (1) only the ruby loop is affected, so it has something to do with long running ruby processes, (2) the problem is independent of the other nameservers listed in /etc/resolv.conf, and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul on 127.0.0.1:53, to determine that the problem is not specific to consul.
On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Amit, It looks like consul isn't configured as a recursive resolver. When running the above code, resolving fails on the first nameserver and the script fails. resolv-replace's TCPSocket.open is different from the code http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I changed the script, replacing the require for 'resolv-replace' to 'net/protocol' to match the cloud controller.
Results:
3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms 3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms 3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms 3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms 3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms 3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms 3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms 3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
And the consul logs, though there's nothing interesting there: https://gist.github.com/cholick/03d74f7f012e54c50b56
On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Yup, that's what I was suspecting. Can you try the following now:
1. Add something like the following to your cf manifest:
... jobs: ... - name: cloud_controller_z1 ... properties: consul: agent: ... log_level: debug ...
This will set the debug level for the consul agents on your CC job to debug, so we might be able to see more for its logs. It only sets it on the job that matters, so when you redeploy, it won't have to roll the whole deployment. It's okay if you can't/don't want to do this, I'm not sure how much you want to play around with your environment, but it could be helpful.
2. Add the following line to the bottom of your /etc/resolv.conf
options timeout:4
Let's see if the slow DNS is on the order of 4000ms now, to pin down where the 5s is exactly coming from.
3. Run the following script on your CC box:
require 'resolv-replace'
UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io' UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'
def dur(start_time, end_time) "#{(1000*(end_time-start_time)).round} ms" end
1.step do |i| ip_start = Time.now s = TCPSocket.open(UAA_IP, 80) ip_open = Time.now s.close ip_close = Time.now
dns_start = Time.now s = TCPSocket.open(UAA_DOMAIN, 80) dns_open = Time.now s.close dns_close = Time.now
ip_open_dur = dur(ip_start, ip_open) ip_close_dur = dur(ip_open, ip_close) dns_open_dur = dur(dns_start, dns_open) dns_close_dur = dur(dns_open, dns_close)
puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close: #{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}" end
You will need to first nslookup (or otherwise determine) the IP that the UAA_DOMAIN resolves to (it will be some load balancer, possibly the gorouter, ha_proxy, or your own upstream LB)
4. Grab the files in /var/vcap/sys/log/consul_agent/
Cheers, Amit
On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Here's the results:
https://gist.github.com/cholick/1325fe0f592b1805eba5
The time all between opening connection and opened, with the corresponding ruby source in http.rb's connect method:
D "opening connection to #{conn_address}:#{conn_port}..."
s = Timeout.timeout(@open_timeout, Net::OpenTimeout) { TCPSocket.open(conn_address, conn_port, @local_host, @local_port) } s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) D "opened"
I don't know much ruby, so that's as far I drilled down.
-Matt
|
|
So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app. package main import ( "net" "fmt" "time" ) func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6 fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } } After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section: https://gist.github.com/cholick/428df93c112067f50558On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered). I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix. These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|
Thanks Matt, that's super interesting. It would be awesome to get the strace from the Ruby thing, as well another strace from the Go thing but this time starting from before it goes slow. I think there's some sort of weird thing that happens once, and then after that happens, the fact that Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS server) is present makes all subsequent requests slow. So I'd like to see strace from both the Ruby and the Go, with enough before and after the slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what that could be, maybe multiple goroutines?
toggle quoted message
Show quoted text
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote: So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app.
package main
import ( "net" "fmt" "time" )
func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } }
After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|
Sure. Here's the Ruby: dns_open: 8 ms | dns_close: 0 ms dns_open: 9 ms | dns_close: 0 ms dns_open: 10 ms | dns_close: 0 ms dns_open: 3013 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5011 ms | dns_close: 0 ms dns_open: 5014 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5012 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms https://gist.github.com/cholick/f8bab2d9752728671b94And the Go: dns_open: 16ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 19ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 16ms | dns_close: 0ms dns_open: 3016ms | dns_close: 0ms dns_open: 15ms | dns_close: 0ms dns_open: 5033ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 5018ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 13ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms https://gist.github.com/cholick/7d4e01dcf99588d13016
toggle quoted message
Show quoted text
On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Thanks Matt, that's super interesting. It would be awesome to get the strace from the Ruby thing, as well another strace from the Go thing but this time starting from before it goes slow. I think there's some sort of weird thing that happens once, and then after that happens, the fact that Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS server) is present makes all subsequent requests slow. So I'd like to see strace from both the Ruby and the Go, with enough before and after the slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what that could be, maybe multiple goroutines?
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app.
package main
import ( "net" "fmt" "time" )
func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } }
After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|
Configuring the consul agent to recurse does solve the issue. The consul agent's configuration is directory based: it loads every json file in its config directory. Since our deployment already consumes an internal release with our components in addition to cf, I've created a template as part of that which copies an additional json file (with a configuration for https://www.consul.io/docs/agent/options.html#recursors) into to the agent's configuration directory. It's a hack, but it gets us past having to build our own cf-release just to customize the consul_agent's job templates. Would you folks accept a pull request that adds "recursors" as a configuration option to https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb?Or is this work you already have in progress since you mentioned exploring having consul recurse? Or were you thinking of solving this more dynamically & consuming the existing content in resolv.conf somehow as the upstream recursors? Thanks again for all the help trying to debug this one.
toggle quoted message
Show quoted text
On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote: Sure. Here's the Ruby:
dns_open: 8 ms | dns_close: 0 ms dns_open: 9 ms | dns_close: 0 ms dns_open: 10 ms | dns_close: 0 ms dns_open: 3013 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5011 ms | dns_close: 0 ms dns_open: 5014 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5012 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms
https://gist.github.com/cholick/f8bab2d9752728671b94
And the Go: dns_open: 16ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 19ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 16ms | dns_close: 0ms dns_open: 3016ms | dns_close: 0ms dns_open: 15ms | dns_close: 0ms dns_open: 5033ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 5018ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 13ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms
https://gist.github.com/cholick/7d4e01dcf99588d13016
On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt, that's super interesting. It would be awesome to get the strace from the Ruby thing, as well another strace from the Go thing but this time starting from before it goes slow. I think there's some sort of weird thing that happens once, and then after that happens, the fact that Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS server) is present makes all subsequent requests slow. So I'd like to see strace from both the Ruby and the Go, with enough before and after the slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what that could be, maybe multiple goroutines?
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app.
package main
import ( "net" "fmt" "time" )
func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } }
After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|
Thanks Matt, it's helpful to validate that the solution worked for you. I'm curious about how this will play with Diego, which actually makes heavy use of Consul. We have a placeholder story to support recursing: https://www.pivotaltracker.com/story/show/108394110. I don't think it's as simple as exposing a config option, because the other current values in the /etc/resolv.conf (which would become values consul recurses to) are sometimes populated by data that BOSH knows, not something you would configure in a CF manifest property. Our consul-release is undergoing some heavy refactoring and test-backfilling, and we'll want to wait for some of that to stabilize before we start adding new functionality. But this issue is prominently on our radar.
toggle quoted message
Show quoted text
On Thu, Nov 19, 2015 at 8:11 PM, Matt Cholick <cholick(a)gmail.com> wrote: Configuring the consul agent to recurse does solve the issue.
The consul agent's configuration is directory based: it loads every json file in its config directory. Since our deployment already consumes an internal release with our components in addition to cf, I've created a template as part of that which copies an additional json file (with a configuration for https://www.consul.io/docs/agent/options.html#recursors) into to the agent's configuration directory. It's a hack, but it gets us past having to build our own cf-release just to customize the consul_agent's job templates.
Would you folks accept a pull request that adds "recursors" as a configuration option to https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb? Or is this work you already have in progress since you mentioned exploring having consul recurse? Or were you thinking of solving this more dynamically & consuming the existing content in resolv.conf somehow as the upstream recursors?
Thanks again for all the help trying to debug this one.
On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:
Sure. Here's the Ruby:
dns_open: 8 ms | dns_close: 0 ms dns_open: 9 ms | dns_close: 0 ms dns_open: 10 ms | dns_close: 0 ms dns_open: 3013 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5011 ms | dns_close: 0 ms dns_open: 5014 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5012 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms
https://gist.github.com/cholick/f8bab2d9752728671b94
And the Go: dns_open: 16ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 19ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 16ms | dns_close: 0ms dns_open: 3016ms | dns_close: 0ms dns_open: 15ms | dns_close: 0ms dns_open: 5033ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 5018ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 13ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms
https://gist.github.com/cholick/7d4e01dcf99588d13016
On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt, that's super interesting. It would be awesome to get the strace from the Ruby thing, as well another strace from the Go thing but this time starting from before it goes slow. I think there's some sort of weird thing that happens once, and then after that happens, the fact that Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS server) is present makes all subsequent requests slow. So I'd like to see strace from both the Ruby and the Go, with enough before and after the slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what that could be, maybe multiple goroutines?
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app.
package main
import ( "net" "fmt" "time" )
func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } }
After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|
Could be related (or totally unrelated), but ubuntu users have reported slowness with dnsmasq. and if the recursors flag implies that it tries another set of dns servers and bypasses dnsmasq (127.0.1.1). I experienced that this morning, after disabling dnsmasq, the system correctly uses all the entries in my resolv.conf.
I simply commented out dnsmasq, and my problems went away. Since CF uses the trusty image, this may be worth trying out.
file: /etc/NetworkManager/NetworkManager.conf
[main] plugins=ifupdown,keyfile,ofono #dns=dnsmasq
[ifupdown] managed=false
toggle quoted message
Show quoted text
On Fri, Nov 20, 2015 at 12:33 AM, Amit Gupta <agupta(a)pivotal.io> wrote: Thanks Matt, it's helpful to validate that the solution worked for you. I'm curious about how this will play with Diego, which actually makes heavy use of Consul. We have a placeholder story to support recursing: https://www.pivotaltracker.com/story/show/108394110. I don't think it's as simple as exposing a config option, because the other current values in the /etc/resolv.conf (which would become values consul recurses to) are sometimes populated by data that BOSH knows, not something you would configure in a CF manifest property.
Our consul-release is undergoing some heavy refactoring and test-backfilling, and we'll want to wait for some of that to stabilize before we start adding new functionality. But this issue is prominently on our radar.
On Thu, Nov 19, 2015 at 8:11 PM, Matt Cholick <cholick(a)gmail.com> wrote:
Configuring the consul agent to recurse does solve the issue.
The consul agent's configuration is directory based: it loads every json file in its config directory. Since our deployment already consumes an internal release with our components in addition to cf, I've created a template as part of that which copies an additional json file (with a configuration for https://www.consul.io/docs/agent/options.html#recursors) into to the agent's configuration directory. It's a hack, but it gets us past having to build our own cf-release just to customize the consul_agent's job templates.
Would you folks accept a pull request that adds "recursors" as a configuration option to https://github.com/cloudfoundry-incubator/consul-release/blob/master/jobs/consul_agent/templates/config.json.erb? Or is this work you already have in progress since you mentioned exploring having consul recurse? Or were you thinking of solving this more dynamically & consuming the existing content in resolv.conf somehow as the upstream recursors?
Thanks again for all the help trying to debug this one.
On Thu, Nov 19, 2015 at 8:54 AM, Matt Cholick <cholick(a)gmail.com> wrote:
Sure. Here's the Ruby:
dns_open: 8 ms | dns_close: 0 ms dns_open: 9 ms | dns_close: 0 ms dns_open: 10 ms | dns_close: 0 ms dns_open: 3013 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5011 ms | dns_close: 0 ms dns_open: 5014 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms dns_open: 5012 ms | dns_close: 0 ms dns_open: 5015 ms | dns_close: 0 ms
https://gist.github.com/cholick/f8bab2d9752728671b94
And the Go: dns_open: 16ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 19ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 16ms | dns_close: 0ms dns_open: 3016ms | dns_close: 0ms dns_open: 15ms | dns_close: 0ms dns_open: 5033ms | dns_close: 0ms dns_open: 11ms | dns_close: 0ms dns_open: 5018ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 12ms | dns_close: 0ms dns_open: 5017ms | dns_close: 0ms dns_open: 13ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms
https://gist.github.com/cholick/7d4e01dcf99588d13016
On Wed, Nov 18, 2015 at 10:10 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Thanks Matt, that's super interesting. It would be awesome to get the strace from the Ruby thing, as well another strace from the Go thing but this time starting from before it goes slow. I think there's some sort of weird thing that happens once, and then after that happens, the fact that Consul (or any other always-SERVFAIL server, e.g. Dmitriy's little Ruby DNS server) is present makes all subsequent requests slow. So I'd like to see strace from both the Ruby and the Go, with enough before and after the slowness starts to establish the same pattern of behaviour.
The alternating slow/fast stuff with Go is interesting. Not clue yet what that could be, maybe multiple goroutines?
On Wed, Nov 18, 2015 at 4:13 PM, Matt Cholick <cholick(a)gmail.com> wrote:
So things got a little more interesting yesterday. We have some custom Golang jobs for auditing and some other purposes. A recent change added the consult_agent template to the box with these jobs and I saw similar behavior: requests with dns resolution taking +5s consistently, fixable by restarting the process. I wrote a small Golang app to try and reproduce it. I got similar, but not identical behavior in the test app.
package main
import ( "net" "fmt" "time" )
func main() { for { startOpen := time.Now() conn, err := net.Dial("tcp", "api.useast.appfog.qa.ctl.io:http") if err != nil { fmt.Printf("%#v", err) } else { endOpen := time.Now() conn.Close() endClose := time.Now() openDur := endOpen.Sub(startOpen) / 1e6 closeDur := endClose.Sub(endOpen) / 1e6
fmt.Printf("dns_open: %dms | dns_close: %dms\n", openDur, closeDur) } } }
After some time, I saw: dns_open: 14ms | dns_close: 0ms dns_open: 5019ms | dns_close: 1ms dns_open: 25ms | dns_close: 1ms dns_open: 27ms | dns_close: 1ms dns_open: 20ms | dns_close: 0ms dns_open: 5015ms | dns_close: 0ms dns_open: 14ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 15ms | dns_close: 0ms dns_open: 5018ms | dns_close: 1ms dns_open: 19ms | dns_close: 0ms dns_open: 5046ms | dns_close: 0ms
To reiterate, the behavior in the actual app seemed to be every single request. The test app went into a similar mode where a roughly every other request failed. Here's the strace for that section:
https://gist.github.com/cholick/428df93c112067f50558
On some other runs it was quite consistently alternating slow and fast calls (for ~300 calls in a row alternating slow/fast). Just like the Ruby, once things get into an unhappy state they stay that way until restarting the process. (The process ran for 12 hours and never recovered).
I can capture a Ruby strace too if you'd like, but I figured this was more useful as it added more novel information to the mix.
These Golang jobs have run in production for months without showing this behavior and, within a few hours of including the consul template, the issue manifested.
|
|