Date   

Re: Error to make a Request to update password in UAA

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Many thanks for the reply.

Next monday in the morning, I will update the class to test again.

Cheers


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Ah, my bad. We need to patch the logger to somehow include timestamps when
the net/http library calls << on it instead of calling info:

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
def l.<<(msg); info(msg); end
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

On Fri, Oct 30, 2015 at 7:35 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration}
| curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Thanks for taking a look, fingers crossed you can see it happen as
well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite,
but eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc
that involve cc making a request to uaa or hm9k have a 5s delay while the
local consul agent fails to resolves the DNS for uaa/hm9k, before moving on
to a different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Re: Problem deploying basic Apps on PWS

Charles Wu
 

You can also download the latest CLI.

Note all new apps deployed on PWS are defaulted to Diego as the app runner
environment. The enable-diego is only needed to switch DEA deployed apps to
use Diego.

br,

Charles


On Fri, Oct 30, 2015 at 1:56 AM, Juan Antonio Breña Moral <
bren(a)juanantonio.info> wrote:

Hi Charles,

You said the clue!!!
Yesterday, I updated the development and I could deploy on PWS.

From environments without Diego, the way to run a Node development is:

var localPort = process.env.VCAP_APP_PORT|| 5000;

With Diego the way is:

var localPort = process.env.PORT || 5000;

If the developer uses GO Cli, it is necessary to indicate the application
that it uses Diego:

cf push APP_XXX --no-start
cf enable-diego APP_XXX
cf start APP_XXX


Re: Permission denied error when unpacking droplet

Noburou TANIGUCHI
 

I am not sure at all but it might be related umask.

What is the umask of the user you deployed your CF (I assume you are talking
about a private CF).

I've been feeling there's an implicit assumption in the cf deployment with
bosh and cf-release that umask is 022 (or 002; Ubuntu default).



-----
I'm not a ...
noburou taniguchi
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Re-Permission-denied-error-when-unpacking-droplet-tp2441p2537.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: how to get the CF endpoint API in my program

Noburou TANIGUCHI
 

There may be a proposal that adds "CC endpoint" to the `VCAP_APPLICATION`
environment variable. It seems no additional risk exists. However, there
must be one or more reasonable use cases for it.

Why do you want to know the CF endpoint from your app running on CF, zooba?



-----
I'm not a ...
noburou taniguchi
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-how-to-get-the-CF-endpoint-API-in-my-program-tp2385p2536.html
Sent from the CF Dev mailing list archive at Nabble.com.


How to install a custom stack in your own Cloud Foundry cluster?

Daniel van Dorp
 

This is a follow-up of this Github conversation: https://github.com/cloudfoundry/stacks/issues/20

Currently we ship the Java Runtime and various dependencies in the droplet but found the packaging, uploading to the blob store, downloading and unpacking steps to be quite slow (> 1 minute).

As we only have apps in our cluster which have common dependencies (Oracle Java Runtime, large .jars), it would be great if we can build our custom stack and install it on our CF cluster, so we can optimize the buildpack to omit certain things from the droplet as they are already present on the root fs.

Can this (something like "cf create-stack"/"cf add-stack") be done?
Compiling our own stack was easy; forking https://github.com/cloudfoundry/stacks and modify it to our needs by adding things like the Oracle Java Runtime to be available by default.
However, we have no clue on how to install/add it in CF and make it the default stack.
We hope others might have production experience with making their own stacks.

Some more context: this is for our users' development environments, so it's mostly about getting changed code live as quickly as possible. Downtime is not an issue. The problem is that the droplet gets pretty bloated (150M+, while the app itself is < 10M) and transferring that is slow. We're aiming to minimize time from app push to accepting http requests.

JT Archie (@jtarchie on Github) was already kind enough to provide us with some helpful information, but has pointed us in this direction for further questions and/or help.


Re: Configured SSL/TLS for Cloud Foundry 170 release

Noburou TANIGUCHI
 

Hi Pravin,

I may be hitting a wrong point, but I've recently heard a similar problem,
so I want ask:

Did you assign SpaceDeveloper role of the target space to the user you are
using for `cf push`?

If the answer is NO, you should set the space role to the user.
cf.
https://docs.cloudfoundry.org/adminguide/cli-user-management.html#space-roles



Pravin Mishra wrote
Hello All,

We have deployed Cloudfoundry 170 and Configured SSL/TLS.
&lt;https://johnpfield.wordpress.com/2014/09/10/configuring-ssltls-for-cloud-foundry/>;

- Deployment is successful
- Able to login
- Able to create Organization/Space

When deploying a application then getting below Warning:

Warning: error tailing logs
Unauthorized error: You are not authorized. message:"Error: Invalid
authorization" message_type:ERR timestamp:1445336917813794880
app_id:"07841692-f22e-4caa-8a86-edf999d15817" source_name:"LGR"

Then Error:

FAILED
TIP: use 'cf logs railswithoutdb --recent' for more information

When tracing logs then:

FAILED
Unauthorized error: You are not authorized. message:"Error: Invalid
authorization" message_type:ERR timestamp:1445337478878348487
app_id:"07841692-f22e-4caa-8a86-edf999d15817" source_name:"LGR"

Note: I am pushing application using Admin user.

Best Regards,
Pravin Mishra
&lt;https://johnpfield.wordpress.com/2014/09/10/configuring-ssltls-for-cloud-foundry/>;




-----
I'm not a ...
noburou taniguchi
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Configured-SSL-TLS-for-Cloud-Foundry-170-release-tp2338p2534.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: PHP extension 'gettext' doesn't work?

Mike Dalessio
 

Unfortunately, the apt-buildpack only works for installing staging-time
dependencies, and not runtime dependencies.

It could be made to work, but the core buildpacks team simply have not done
so because nobody has asked (yet). ;)

On Fri, Oct 30, 2015 at 1:37 PM, Guillaume Berche <bercheg(a)gmail.com> wrote:

I agree the inclusion of the lang pack into linuxfs2 seems best option.

I'm wondering though whether a temporary workaround could be to install
the "locales" debian package using apt-buildpack [1] (no sudo needed) and
combine it with php buildpack using the multi buildpack [2] ? I was
planning to test that for another purpose but had not the chance yet. I'm
interested in hearing the outcome.

Guillaume.

[1] https://github.com/pivotal-cf-experimental/apt-buildpack

[2] https://github.com/ddollar/heroku-buildpack-multi
Le 27 oct. 2015 02:41, "Hiroaki Ukaji" <dt3snow.w(a)gmail.com> a écrit :


Hi.

Thanks to you, I understood why i18n by gettext didn't work on CF.
Certainly, the language pack "ja_JP.utf8" only exists in my local
machine..

Anyway, we're glad to hear that the debian package "locales" will be added
to the rootfs.
It will resolve this issue and then we will be able to manage i18n by
gettext extension on CF.


Thanks a lot.

Hiroaki UKAJI



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-PHP-extension-gettext-doesn-t-work-tp1984p2450.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: PHP extension 'gettext' doesn't work?

Guillaume Berche
 

I agree the inclusion of the lang pack into linuxfs2 seems best option.

I'm wondering though whether a temporary workaround could be to install the
"locales" debian package using apt-buildpack [1] (no sudo needed) and
combine it with php buildpack using the multi buildpack [2] ? I was
planning to test that for another purpose but had not the chance yet. I'm
interested in hearing the outcome.

Guillaume.

[1] https://github.com/pivotal-cf-experimental/apt-buildpack

[2] https://github.com/ddollar/heroku-buildpack-multi

Le 27 oct. 2015 02:41, "Hiroaki Ukaji" <dt3snow.w(a)gmail.com> a écrit :


Hi.

Thanks to you, I understood why i18n by gettext didn't work on CF.
Certainly, the language pack "ja_JP.utf8" only exists in my local machine..

Anyway, we're glad to hear that the debian package "locales" will be added
to the rootfs.
It will resolve this issue and then we will be able to manage i18n by
gettext extension on CF.


Thanks a lot.

Hiroaki UKAJI



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-PHP-extension-gettext-doesn-t-work-tp1984p2450.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: Logs, Timestamps

Guillaume Berche
 

The steno-prettify tool may be given a try to format timestamps in human
readeable dates on the vm itself . See doc I contributed at [1] for sample
invocation in-situ on the vm. I haven't tested in a while, let me know if
this still works or doc needs update (the procedure had been broken a while
ago [2]

Hope this helps,

Guillaume.

[1] http://docs.cloudfoundry.org/running/troubleshooting.html#failing-job
[2] https://github.com/cloudfoundry/docs-running-cf/pull/7

Le 28 oct. 2015 09:58, "Bharath Posa" <bharathp(a)vedams.com> a écrit :

Hi daniel ,


The logs formats are basically using log library called steno
https://github.com/cloudfoundry/steno

It was mentioned that we can configure to modify the logs to a human
readable format. I never did that . I think you can have a look into it

you can also look at firehouse and nozzles project . I think that is also
use ful

http://docs.cloudfoundry.org/loggregator/architecture.html#firehose

http://docs.cloudfoundry.org/loggregator/architecture.html#nozzles

tutorials on nozzle
http://docs.cloudfoundry.org/loggregator/nozzle-tutorial.html


regards
Bharath

On Wed, Oct 28, 2015 at 2:06 PM, Daniel Jones <
daniel.jones(a)engineerbetter.com> wrote:

Hi all,

Why are logs like those of the CloudController dated by machine-readable
timestamps? Are there any tools to parse CF logs in-situ?

I find it a constant source of frustration trying to debug an issue
reported by a human with a human-parseable date, putting that into a Unix
timestamp converter, searching through logs for that timestamp, then
scrolling up and down having to occasionally copy/paste a timestamp back
into a converter to see if I've gone too far away from my rough target time.

Computers are rather good at converting data formats - my brain, not so
much. Wouldn't it make more sense to have the logs human-readable by
default, and if an automated system needs to ingest those logs, let *it* do
the parsing?

Regards,

Daniel Jones
EngineerBetter.com


Re: doppler issue which fails to emit logs with syslog protocol on CFv212

Amit Kumar Gupta
 

Hi Matsumi,

What information exactly would you like to know about the etcd events?
There error message is "Got error while waiting for ETCD events: store
request timed". This could be due to several reasons -- network issues,
etcd slowness, etcd down, etc. If etcd is down, other components would be
affected. If the network is slow for some connections, it may only affect
some components.

Several components in cf-release talk to etcd:

- doppler
- metron_agent
- syslog_drain_binder
- loggregator_trafficcontroller
- routing-api (not router)
- hm9000

Please let us know what details you'd like to know.

Best,
Amit

On Fri, Oct 30, 2015 at 3:40 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi Warren and Nino,

Let me update our status.

Yes, the root cause was the inconsistency of a bulk user id between cc and
syslog drain binder due to the bug (a hardcorded id). We confirmed the
doppler had been recovered successfully.

Also, I would appreciate the information about ETCD events.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418p2524.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: xip.io IO errors

Guillaume Berche
 

Would the community release embedding power dns to provide xip.io style
impl be appliceable to your contexte and useful [1] ? This relies a DNS
delegation though. I haven't tested it myself yet.

[1] https://github.com/cloudfoundry-community/xip-release

Guillaume.

Le 30 oct. 2015 01:45, "Dan Wendorf" <dwendorf(a)pivotal.io> a écrit :

If you're using the common bosh lite IP of 10.244.0.34, you can also use
the more-reliable *.bosh-lite.com

On Thu, Oct 29, 2015, 5:40 PM Amit Gupta <agupta(a)pivotal.io> wrote:

Yes, xip.io is flaky. I would recommend setting up real DNS if you want
to avoid the 2% failure rate if you want a longer term, reliable solution.

On Thu, Oct 29, 2015 at 4:55 PM, Sumanth Yamala <Sumanth.Yamala(a)sas.com>
wrote:

Hi,



Currently we see IO errors when trying to contact xip.io – this happens
around 2% of the times. To apply a temporary fix – I updated the /etc/hosts
of the ha_proxy VM to map the - IP to the service routes of each app
deployed on the system.



After the above change I still see the packet drops – so wanted to know
if this needs to be done on all the runners as well or ..? any thoughts.

Thanks,

Sumanth


Re: Request timeout in CloudFoundry

Mike Youngstrom <youngm@...>
 

Sorry, after looking at our manifests it appears we actually set it to 0
not -1.

properties:
request_timeout_in_seconds: 0

Then you need to set CC timeout to something else since nginx doesn't like
0.
jobs:
- name: cloud_controller
properties:
request_timeout_in_seconds: 900
templates:
- name: cloud_controller_ng
release: cf

Mike

On Fri, Oct 30, 2015 at 7:24 AM, Kayode Odeyemi <dreyemi(a)gmail.com> wrote:

Hi Mike,

Could you let me know how you set timeout on gorouter to -1?

Appreciate your response.

On Fri, Sep 4, 2015 at 5:27 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I have disabled the timeout on the gorouter "-1" trusting our Front end
load balancer to manage the timeout. That way I only need to worry about
configuring timeout on one device in my proxy chain. That may simplify
your problem. Is there any theoretical issues with that approach? It
seems to have worked well for us.

Mike

On Fri, Sep 4, 2015 at 6:49 AM, James Bayer <jbayer(a)pivotal.io> wrote:

i would trace through the various system component logs to see where the
timeout is starting with each component that the client connects to and
seeing if that request came through. by submitting a silly uniquely
identifiable hostname you can usually find something quickly by grepping
logs.

On Thu, Sep 3, 2015 at 6:47 AM, Kayode Odeyemi <dreyemi(a)gmail.com>
wrote:

Hi,

I'm experiencing this same issue.

I have a NodeJS app, using the nodejs buildpack. A request taking
approximately 80 secs long gets timed out (502 response). We're using the
default CF ELBs (gorouter and ha_proxy).

On Tue, Aug 25, 2015 at 10:57 PM, CF Runtime <cfruntime(a)gmail.com>
wrote:

Hi,

We think this message was dropped and so we are posting it again.

How long of a timeout are you seeing? The gorouter and ha_proxy have
a default timeout of 15 minutes. If you are using an ELB or some
other load balancer they may have their own timeout set. I'm not sure
if any of the app servers installed by buildpacks have a build in
timeout or not, what type of app is it?

Joseph and Dan
OSS Release Integration Team

On Mon, Aug 17, 2015 at 2:18 PM, Flávio Henrique Schuindt da Silva <
flavio.schuindt(a)gmail.com> wrote:

Hi guys,

How can I increase the request timeout in CF? I would like to give
more time to my app to send a response to the requests before getting a
timeout. Should I increase it in the buildpack? If yes, where should I
change it on java buildpack?

Thanks in advance!

--
Thank you,

James Bayer


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration} |
curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Thanks for taking a look, fingers crossed you can see it happen as well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite, but
eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc that
involve cc making a request to uaa or hm9k have a 5s delay while the local
consul agent fails to resolves the DNS for uaa/hm9k, before moving on to a
different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Re: Request timeout in CloudFoundry

Paul Bakare
 

Hi Mike,

Could you let me know how you set timeout on gorouter to -1?

Appreciate your response.

On Fri, Sep 4, 2015 at 5:27 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

I have disabled the timeout on the gorouter "-1" trusting our Front end
load balancer to manage the timeout. That way I only need to worry about
configuring timeout on one device in my proxy chain. That may simplify
your problem. Is there any theoretical issues with that approach? It
seems to have worked well for us.

Mike

On Fri, Sep 4, 2015 at 6:49 AM, James Bayer <jbayer(a)pivotal.io> wrote:

i would trace through the various system component logs to see where the
timeout is starting with each component that the client connects to and
seeing if that request came through. by submitting a silly uniquely
identifiable hostname you can usually find something quickly by grepping
logs.

On Thu, Sep 3, 2015 at 6:47 AM, Kayode Odeyemi <dreyemi(a)gmail.com> wrote:

Hi,

I'm experiencing this same issue.

I have a NodeJS app, using the nodejs buildpack. A request taking
approximately 80 secs long gets timed out (502 response). We're using the
default CF ELBs (gorouter and ha_proxy).

On Tue, Aug 25, 2015 at 10:57 PM, CF Runtime <cfruntime(a)gmail.com>
wrote:

Hi,

We think this message was dropped and so we are posting it again.

How long of a timeout are you seeing? The gorouter and ha_proxy have a
default timeout of 15 minutes. If you are using an ELB or some other
load balancer they may have their own timeout set. I'm not sure if any
of the app servers installed by buildpacks have a build in timeout or
not, what type of app is it?

Joseph and Dan
OSS Release Integration Team

On Mon, Aug 17, 2015 at 2:18 PM, Flávio Henrique Schuindt da Silva <
flavio.schuindt(a)gmail.com> wrote:

Hi guys,

How can I increase the request timeout in CF? I would like to give
more time to my app to send a response to the requests before getting a
timeout. Should I increase it in the buildpack? If yes, where should I
change it on java buildpack?

Thanks in advance!

--
Thank you,

James Bayer


Re: xip.io IO errors

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

I was reading this thread, I noticed that I have the same problem and I thought that we had a problem in our network.

In local I am using development from Yudai:
https://github.com/yudai/cf_nise_installer

I am going to fix this issue too


Re: doppler issue which fails to emit logs with syslog protocol on CFv212

Masumi Ito
 

Hi Warren and Nino,

Let me update our status.

Yes, the root cause was the inconsistency of a bulk user id between cc and
syslog drain binder due to the bug (a hardcorded id). We confirmed the
doppler had been recovered successfully.

Also, I would appreciate the information about ETCD events.

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/doppler-issue-which-fails-to-emit-logs-with-syslog-protocol-on-CFv212-tp2418p2524.html
Sent from the CF Dev mailing list archive at Nabble.com.


Does exist some REST Call to detect Diego on any environment?

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Hi,

I would like to know if exist some REST API to detect if a platform has installed Diego.
Reading API from Cloud Controller, I didn't recognize some method to identify Diego in CF architecture.

Is it possible?

Many thanks in advance

Juan Antonio


Re: Problem deploying basic Apps on PWS

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Hi Charles,

You said the clue!!!
Yesterday, I updated the development and I could deploy on PWS.

From environments without Diego, the way to run a Node development is:

var localPort = process.env.VCAP_APP_PORT|| 5000;

With Diego the way is:

var localPort = process.env.PORT || 5000;

If the developer uses GO Cli, it is necessary to indicate the application that it uses Diego:

cf push APP_XXX --no-start
cf enable-diego APP_XXX
cf start APP_XXX


Re: Source IP ACLs

Gwenn Etourneau
 

Oh right I misread it I thought it was avoid application to connect to
certain IP.....

On Fri, Oct 30, 2015 at 4:46 PM, ronak banka <ronakbanka.cse(a)gmail.com>
wrote:

Gwenn,

If I'm not wrong Application sec rules are for restricting outbound
traffic from application side , no?

Ronak

On Fri, Oct 30, 2015, 16:38 Gwenn Etourneau <getourneau(a)pivotal.io> wrote:

What about
https://docs.pivotal.io/pivotalcf/adminguide/app-sec-groups.html ?

On Fri, Oct 30, 2015 at 4:21 PM, Carlo Alberto Ferraris <
carlo.ferraris(a)rakuten.com> wrote:

Is there any provision for restricting the source IPs that are allowed
to access a certain application (or route)? Or the only way to do this is
to place a reverse proxy in front of the gorouter?
In case the reverse proxy is the only way to go, would there be interest
to have something like this implemented inside the gorouter itself? (we're
willing to contribute)