Re: cloud_controller_ng performance degrades slowly over time


Dieu Cao <dcao@...>
 

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Looks like you're right and we're experiencing the same issue as you are
Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some point,
requests start taking 5 second longer, which I assume is a timeout. One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time out
varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig" against
localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server, just
the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches for
user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the
DNS request to consul agent fails fast and moves on to the next resolver,
we haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.