cloud_controller_ng performance degrades slowly over time


Matt Cholick
 

This is a pretty tricky one, as it takes a long time to manifest. After a
while without a restart, cloud_controller_ng take a long time listing org
users. For example, in an org with 350 users, before restart `cf org-users
ORG` took 1:20. After restart, the call took 0:07. We've only notice this
twice, both times after the cloud controller had been running for a couple
weeks without a restart.

Looking in New Relic, the breakdown of an individual call to
organization/guid/managers shows the majority of the call externally:

[image: Inline image 1]

Though uaa itself isn't the issue, as things are fine immediately after
restarting cloud_controller_ng (and restarting uaa has no affect).

Have other Cloud Foundry operators seen this degraded performance? Is there
other information we could provide to turn this into a workable bug report?

-Matt Cholick


Mike Youngstrom
 

We don't have the same performance numbers since we don't use New Relic.
However, for the last 6 months or so after 3-4 weeks our smoke tests begin
failing randomly with 500 errors doing basic things like creating routes.
If we restart the CC(s) then everything works great for another 3-4 weeks.

Mike

On Tue, Sep 29, 2015 at 1:08 PM, Matt Cholick <cholick(a)gmail.com> wrote:

This is a pretty tricky one, as it takes a long time to manifest. After a
while without a restart, cloud_controller_ng take a long time listing org
users. For example, in an org with 350 users, before restart `cf org-users
ORG` took 1:20. After restart, the call took 0:07. We've only notice this
twice, both times after the cloud controller had been running for a couple
weeks without a restart.

Looking in New Relic, the breakdown of an individual call to
organization/guid/managers shows the majority of the call externally:

[image: Inline image 1]

Though uaa itself isn't the issue, as things are fine immediately after
restarting cloud_controller_ng (and restarting uaa has no affect).

Have other Cloud Foundry operators seen this degraded performance? Is
there other information we could provide to turn this into a workable bug
report?

-Matt Cholick



Jeffrey Pak
 

We don't see this degradation on any of our environments. We typically deploy at least every two weeks, so it's possible none of our environments are up long enough to exhibit this behavior.

Do you notice this for other endpoints that exercise the CloudController's UAA client (for example: other Organization user types and/or Space users)? Do you see slowdown for any endpoints that do not hit UAA?


Matt Cholick
 

Yes, we see similar behavior. The top slowest calls (with average times):

* GET v2/organizations/([^/?#]+)/users (9s)
* GET v2/organizations/([^/?#]+)/managers (7.48s)
* GET v2/spaces/([^/?#]+)/managers (6.45s)
* PUT v2/service_brokers/([^/?#]+) (6.25s)
* GET v2/spaces/([^/?#]+)/developers (5.81s)
* DELETE v2/spaces/([^/?#]+) (5.66s)
* POST v2/service_instances (4.95s)
* GET v2/apps/([^/?#]+)/summary (4.21s)

service_brokers, delete spaces, and service_instances all communicate with
a service broker which in turn communicates with external services: slow
does seem legitimate there. apps/summary is slow in a similar way around
communication with hm9000:

[image: Inline image 2]

It definitely looks like network, but that wouldn't explain why things get
immediately better after restarting the controller (a monit restart on the
job, not restarting the whole vm). If network is the cause, I would also
expect Postgres to be slow as those machines are next to everything else in
the network (though that's private, internal IP based rather than public
IP, so the path is a bit different).

I've tried looking back a bit through history, but the response time of
those users/managers/developers endpoints vary so much based on the number
of users in the organization that the averages over time don't show any
meaningful trend: we just start hearing from users when it gets really
slow, as those are the endpoints that have a lot of paging for the big orgs.

-Matt


Zach Robinson
 

Hey Matt,

Do you notice if the memory that the cc process is using increases by a large amount over that time?

-Zach and Jeff


Matt Cholick
 

Memory doesn't look problematic:

[image: Inline image 1]

On Thu, Oct 1, 2015 at 2:38 PM, Zach Robinson <zrobinson(a)pivotal.io> wrote:

Hey Matt,

Do you notice if the memory that the cc process is using increases by a
large amount over that time?

-Zach and Jeff


Zach Robinson
 

Hey Matt,

I'm trying to think of other things that would affect only the endpoints that interact with UAA and would be fixed after a CC restart. I'm wondering if it's possible there are a large number of connections being kept-alive, or stuck in a wait state or something. Could you take a look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

We've seen issues on some environments where requests to cc that involve cc
making a request to uaa or hm9k have a 5s delay while the local consul
agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the DNS
request to consul agent fails fast and moves on to the next resolver, we
haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit

On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the endpoints
that interact with UAA and would be fixed after a CC restart. I'm
wondering if it's possible there are a large number of connections being
kept-alive, or stuck in a wait state or something. Could you take a look
at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is closer
to 10s. Just included those two calls with paging by the cf command line
included numbers to demonstrate the dramatic difference after a restart.
Delays disappear after a restart. We're not running consul yet, so it
wouldn't be that.

-Matt

On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that involve
cc making a request to uaa or hm9k have a 5s delay while the local consul
agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the DNS
request to consul agent fails fast and moves on to the next resolver, we
haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the endpoints
that interact with UAA and would be fixed after a CC restart. I'm
wondering if it's possible there are a large number of connections being
kept-alive, or stuck in a wait state or something. Could you take a look
at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

You may not be running any consul servers, but you may have a consul agent
colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is closer
to 10s. Just included those two calls with paging by the cf command line
included numbers to demonstrate the dramatic difference after a restart.
Delays disappear after a restart. We're not running consul yet, so it
wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that involve
cc making a request to uaa or hm9k have a 5s delay while the local consul
agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the DNS
request to consul agent fails fast and moves on to the next resolver, we
haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the endpoints
that interact with UAA and would be fixed after a CC restart. I'm
wondering if it's possible there are a large number of connections being
kept-alive, or stuck in a wait state or something. Could you take a look
at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Looks like you're right and we're experiencing the same issue as you are
Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed:
#{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some point,
requests start taking 5 second longer, which I assume is a timeout. One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time out
varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the amount
of paging that a fetching data for a large org does, as that 5 seconds is
multiplied out over several calls. Every user is feeling this delay, it's
just that it's only unworkable pulling the large datasets from UAA.

I was not able to reproduce timeouts using a script calling "dig" against
localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server, just
the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches for
user management in our larger orgs. Amit: is there a bug we can follow?

-Matt

On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul agent
colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that involve
cc making a request to uaa or hm9k have a 5s delay while the local consul
agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the
DNS request to consul agent fails fast and moves on to the next resolver,
we haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Dieu Cao <dcao@...>
 

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Looks like you're right and we're experiencing the same issue as you are
Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some point,
requests start taking 5 second longer, which I assume is a timeout. One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time out
varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig" against
localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server, just
the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches for
user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the
DNS request to consul agent fails fast and moves on to the next resolver,
we haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no change.
I'm also shelling out to curl in the script, to see if both curl and ruby
get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Looks like you're right and we're experiencing the same issue as you are
Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some point,
requests start taking 5 second longer, which I assume is a timeout. One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time
out varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig" against
localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server, just
the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches for
user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the
DNS request to consul agent fails fast and moves on to the next resolver,
we haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby 2.2,
then started another loop on the same box with ruby 2.1. In the end, they
got up to 40-50k without showing any signs of change. I had to switch to
resolving the UAA endpoint, eventually google started responding with 302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no change.
I'm also shelling out to curl in the script, to see if both curl and ruby
get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Looks like you're right and we're experiencing the same issue as you are
Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some point,
requests start taking 5 second longer, which I assume is a timeout. One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time
out varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches
for user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that the
DNS request to consul agent fails fast and moves on to the next resolver,
we haven't figured out why a couple envs exhibit different behavior. The
impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't explain your
1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby 2.2,
then started another loop on the same box with ruby 2.1. In the end, they
got up to 40-50k without showing any signs of change. I had to switch to
resolving the UAA endpoint, eventually google started responding with 302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no change.
I'm also shelling out to curl in the script, to see if both curl and ruby
get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as you
are Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize :879

I've experimented some with the environment and, after narrowing things
down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time
out varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches
for user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that
the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change on
all CC's), although it'll get undone on each redeploy. We do want to find
the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once you
see the slow DNS manifest itself, you will get a dump of all the goroutines
running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit

On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby 2.2,
then started another loop on the same box with ruby 2.1. In the end, they
got up to 40-50k without showing any signs of change. I had to switch to
resolving the UAA endpoint, eventually google started responding with 302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as you
are Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things time
out varies considerably (one run started in the 10s and another took 20k
requests), but it always happens. After that, lookups take an additional 5
second and never recover to their initial speed. This is why restarting the
cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches
for user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that
the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a

On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change on
all CC's), although it'll get undone on each redeploy. We do want to find
the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as you
are Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real headaches
for user management in our larger orgs. Amit: is there a bug we can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a consul
agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next time.

Amit,
1:20 delay is due to paging. The total call length for each page is
closer to 10s. Just included those two calls with paging by the cf command
line included numbers to demonstrate the dramatic difference after a
restart. Delays disappear after a restart. We're not running consul yet, so
it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that
the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before and
after the DNS slowdown. By having the http object take a debug logger, we
can narrow down what Ruby is doing that's making it uniquely slow.

On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as you
are Amit. We're suffering slow DNS lookups. The code is spending all of its
time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to the
amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul server,
just the agents. I also successfully reproduce this problem in completely
different 217 install in a different datacenter. This setup also didn't
have an actual consul server, just the agent. I don't see anything in the
release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is that
the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <zrobinson(a)pivotal.io>
wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Matt Cholick
 

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Amit Kumar Gupta
 

Ah, my bad. We need to patch the logger to somehow include timestamps when
the net/http library calls << on it instead of calling info:

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
def l.<<(msg); info(msg); end
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

On Fri, Oct 30, 2015 at 7:35 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration}
| curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Thanks for taking a look, fingers crossed you can see it happen as
well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite,
but eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc
that involve cc making a request to uaa or hm9k have a 5s delay while the
local consul agent fails to resolves the DNS for uaa/hm9k, before moving on
to a different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha