Date   

abacus collector doesn't work

MaggieMeng
 

Hi

I am trying to run abacus in my cloudfoundry env. However after successfully push all abacus applications into CF, I found following error from some of the applications:

dmadmin(a)dmadmin-Lenovo-Product:~/cloudfoundry/cf-abacus/cf-abacus$ cf logs abacus-usage-aggregator
Connected, tailing logs for app abacus-usage-aggregator in org cf / space space as admin...

2015-11-04T04:33:36.47-0500 [App/0] OUT 2015-11-04T09:33:36.469Z e-abacus-request 46 Request error { message: 'connect ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT code: 'ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT errno: 'ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT syscall: 'connect' } - Error: connect ECONNREFUSED
2015-11-04T04:33:36.47-0500 [App/0] OUT at exports._errnoException (util.js:746:11)
2015-11-04T04:33:36.47-0500 [App/0] OUT at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1010:19)

Same as abacus-usage-collector. "npm run demo" also failed which may due to this error. Could it be CF configuration issue? How could I enable verbose log or debug? Any help would be appreciated.

Thanks,
Maggie


Deploy on OpenNebula

Yancey
 

Any on deploy CloudFoundry on OpenNebula? I can only find the cpi for OpenStack, VMWare etc...


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

I looked at the strace, I see you did indeed mean "loop without resolver on
localhost". If you try it with *only* a resolver on localhost, do you get
the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0
ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0
ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close:
0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close:
0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.

On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop with
a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io
<javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything
to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com
<javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close:
0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close:
0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io
<javascript:_e(%7B%7D,'cvml','agupta(a)pivotal.io');>> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com
<javascript:_e(%7B%7D,'cvml','cholick(a)gmail.com');>> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop with
a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow. Though
it's hard to prove with something so inconsistent, it hadn't happened after
100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.

On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is restarted,
so it feels as though there's some sort of resource that the ruby process
is not able to GC, leading to this problem to show up eventually, and then
go away when restarted. I want to confirm this by trying two different
loops, one where the loop is in bash, spinning up a new ruby process each
time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's out
of commission, so that the router doesn't load balance to it, because it
will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for consul_agent
to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]])
do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything
to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0
ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0
ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: PHP extension 'gettext' doesn't work?

Mike Dalessio
 

Hi Jack,

That sounds like a great idea. I'll prioritize this work in our Tracker
backlog.

-mike

On Tue, Nov 3, 2015 at 2:58 PM, Jack Cai <greensight(a)gmail.com> wrote:

This is a great list of languages to support. May I ask to add ar (Arabic)
and iw (Hebrew) to the list? These two are considered "Group 2" languages
in the company I work for. With them added, all Group 1 and Group 2
languages will be included in the list, which is a confirmation that we
have a good list.

Jack


On Mon, Nov 2, 2015 at 4:41 PM, JT Archie <jarchie(a)pivotal.io> wrote:

We've added support for some locales into the rootfs. There are quite a
few locales, of which we don't know if we need to officially support them.
Our current list, is from a consumer level list of *most used* locales.

Please feel free to review this commit
<https://github.com/cloudfoundry/stacks/commit/748cb604ef55f3eeb334d960f496661b58ec50ca>.
We've made it explicit what locales are supported.

With these changes, we've been able to make the PHP `gettext` extension
work with the example. It has been added to our test suite
<https://github.com/cloudfoundry/php-buildpack/tree/develop/cf_spec/fixtures/php_app_testing_locale>
for future support.

Let us know if you need anything else.

Kind Regards,

JT

On Fri, Oct 30, 2015 at 1:41 PM, Mike Dalessio <mdalessio(a)pivotal.io>
wrote:

Unfortunately, the apt-buildpack only works for installing staging-time
dependencies, and not runtime dependencies.

It could be made to work, but the core buildpacks team simply have not
done so because nobody has asked (yet). ;)


On Fri, Oct 30, 2015 at 1:37 PM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

I agree the inclusion of the lang pack into linuxfs2 seems best option.

I'm wondering though whether a temporary workaround could be to install
the "locales" debian package using apt-buildpack [1] (no sudo needed) and
combine it with php buildpack using the multi buildpack [2] ? I was
planning to test that for another purpose but had not the chance yet. I'm
interested in hearing the outcome.

Guillaume.

[1] https://github.com/pivotal-cf-experimental/apt-buildpack

[2] https://github.com/ddollar/heroku-buildpack-multi
Le 27 oct. 2015 02:41, "Hiroaki Ukaji" <dt3snow.w(a)gmail.com> a écrit :


Hi.

Thanks to you, I understood why i18n by gettext didn't work on CF.
Certainly, the language pack "ja_JP.utf8" only exists in my local
machine..

Anyway, we're glad to hear that the debian package "locales" will be
added
to the rootfs.
It will resolve this issue and then we will be able to manage i18n by
gettext extension on CF.


Thanks a lot.

Hiroaki UKAJI



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-PHP-extension-gettext-doesn-t-work-tp1984p2450.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: Diff between cf restart and cf restage

Cornelia Davis <cdavis@...>
 

The staging process has access to env variables, etc. so the env can affect
the contents of the droplet.

You might notice that when you do a cf set-env you get a message that
advises you to do a cf restage. Because CF doesn't know whether your
buildpack is affected by env changes, it recommends the more extreme option.

For some (many) apps, a cf restart would be sufficient.

On Tue, Nov 3, 2015 at 9:15 PM, Matthew Sykes <matthew.sykes(a)gmail.com>
wrote:

`restage` will stop your application, run the application bits through the
staging process to create a new droplet, and then start the new droplet.
It's a lot like `push` but without actually pushing new application bits.

`restart` will simply stop your application and start it with the existing
droplet.

You typically restart when you need your applicaiton's environment
refreshed and you typically restage when you need/want the buildpack to run
without updating the application source.

Hope that helps.

On Tue, Nov 3, 2015 at 2:46 PM, Nikhil Katre <nikhil.katre(a)appdynamics.com
wrote:
Hi,

Can someone explain in detail what is the difference between cf restart
and cf restage ?

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE Trial
| Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


--
Matthew Sykes
matthew.sykes(a)gmail.com


Re: Diff between cf restart and cf restage

Matthew Sykes <matthew.sykes@...>
 

`restage` will stop your application, run the application bits through the
staging process to create a new droplet, and then start the new droplet.
It's a lot like `push` but without actually pushing new application bits.

`restart` will simply stop your application and start it with the existing
droplet.

You typically restart when you need your applicaiton's environment
refreshed and you typically restage when you need/want the buildpack to run
without updating the application source.

Hope that helps.

On Tue, Nov 3, 2015 at 2:46 PM, Nikhil Katre <nikhil.katre(a)appdynamics.com>
wrote:

Hi,

Can someone explain in detail what is the difference between cf restart
and cf restage ?

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE Trial
| Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


--
Matthew Sykes
matthew.sykes(a)gmail.com


Re: PHP extension 'gettext' doesn't work?

Jack Cai
 

This is a great list of languages to support. May I ask to add ar (Arabic)
and iw (Hebrew) to the list? These two are considered "Group 2" languages
in the company I work for. With them added, all Group 1 and Group 2
languages will be included in the list, which is a confirmation that we
have a good list.

Jack

On Mon, Nov 2, 2015 at 4:41 PM, JT Archie <jarchie(a)pivotal.io> wrote:

We've added support for some locales into the rootfs. There are quite a
few locales, of which we don't know if we need to officially support them.
Our current list, is from a consumer level list of *most used* locales.

Please feel free to review this commit
<https://github.com/cloudfoundry/stacks/commit/748cb604ef55f3eeb334d960f496661b58ec50ca>.
We've made it explicit what locales are supported.

With these changes, we've been able to make the PHP `gettext` extension
work with the example. It has been added to our test suite
<https://github.com/cloudfoundry/php-buildpack/tree/develop/cf_spec/fixtures/php_app_testing_locale>
for future support.

Let us know if you need anything else.

Kind Regards,

JT

On Fri, Oct 30, 2015 at 1:41 PM, Mike Dalessio <mdalessio(a)pivotal.io>
wrote:

Unfortunately, the apt-buildpack only works for installing staging-time
dependencies, and not runtime dependencies.

It could be made to work, but the core buildpacks team simply have not
done so because nobody has asked (yet). ;)


On Fri, Oct 30, 2015 at 1:37 PM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

I agree the inclusion of the lang pack into linuxfs2 seems best option.

I'm wondering though whether a temporary workaround could be to install
the "locales" debian package using apt-buildpack [1] (no sudo needed) and
combine it with php buildpack using the multi buildpack [2] ? I was
planning to test that for another purpose but had not the chance yet. I'm
interested in hearing the outcome.

Guillaume.

[1] https://github.com/pivotal-cf-experimental/apt-buildpack

[2] https://github.com/ddollar/heroku-buildpack-multi
Le 27 oct. 2015 02:41, "Hiroaki Ukaji" <dt3snow.w(a)gmail.com> a écrit :


Hi.

Thanks to you, I understood why i18n by gettext didn't work on CF.
Certainly, the language pack "ja_JP.utf8" only exists in my local
machine..

Anyway, we're glad to hear that the debian package "locales" will be
added
to the rootfs.
It will resolve this issue and then we will be able to manage i18n by
gettext extension on CF.


Thanks a lot.

Hiroaki UKAJI



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-PHP-extension-gettext-doesn-t-work-tp1984p2450.html
Sent from the CF Dev mailing list archive at Nabble.com.


Diff between cf restart and cf restage

Nikhil Katre <nikhil.katre@...>
 

Hi,

Can someone explain in detail what is the difference between cf restart and
cf restage ?

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE Trial |
Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


Re: How can i listen two or more port in an app?

Shannon Coen
 

We're currently working on adding support for routing to multiple app
ports.

Proposed UX can be found at the bottom of this doc, starting with step #9:
https://docs.google.com/document/d/1SfwaQ1hnngfopXC_Q24cT6lbo0yFwvbAbPcCPEHeNPY/edit?usp=sharing

This feature is bing implemented in these epics:
https://www.pivotaltracker.com/epic/show/2025858
https://www.pivotaltracker.com/epic/show/2025948

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.

On Tue, Nov 3, 2015 at 3:58 AM, Matthew Sykes <matthew.sykes(a)gmail.com>
wrote:

With DEA's, you can't. With Diego, you can look at the new TCP routing
support [1] if you want the port to be accessible to everyone, all the
time, or ssh port forwarding [2] if you only want someone in the developer
role to access the imx port.

For the latter, you can look at [3] for an example of how to configure the
JVM and how to use the port forwarding mechanisms of the client.

[1]: https://github.com/cloudfoundry-incubator/routing-api-cli
[2]: https://github.com/cloudfoundry-incubator/diego-ssh
[3]: http://sykesm.mybluemix.net/posts/jmx-in-diego/

On Tue, Nov 3, 2015 at 4:28 AM, yancey0623 <yancey0623(a)163.com> wrote:

Dear all!

How can i listen two or more port in an app? such as my Java app include
two port: one is process web request and another one is a jmx port.


--
Matthew Sykes
matthew.sykes(a)gmail.com


Re: UAA branding and scope descriptions

Sree Tummidi
 

Hi Josh,

Rebranding is possible today. This can be done by updating the assets under : https://github.com/cloudfoundry/uaa/blob/master/uaa/src/main/webapp/resources/

As mentioned by Matt below we do have plan for the removal of Pivotal Assets from the UAA open source repository. Please see below

Disable UAA Login
The first thing we will do is disabling the Login Server and allow the explicit configuration of an external login server. This would mean that the UAA UI pages will be no longer accessible and the info end point would return the login server location as the external login server.
The corresponding tracker story is here and is part of the current backlog of work.
https://www.pivotaltracker.com/story/show/106668494

Removal of Pivotal Assets
We have an internal project underway to create a Pivotal branded login and account management experience. Until this is done we will not be able to remove the Pivotal assets from the UAA open source repository. This work is currently slated to be competed in Q1 of 2016
You can track the story for Pivotal Assets removal here : https://www.pivotaltracker.com/story/show/106670296


Matt: Let's connect on the story to disable the external login server. I want to make sure that we have covered all the cases !


Thanks,
Sree

Sent from my iPad

On Nov 2, 2015, at 9:20 AM, Matthew Sykes <matthew.sykes(a)gmail.com> wrote:

No formal extension process currently exists to do what you're asking for. The topic has been raised at the runtime PMC as others have similar needs. It sounds like the identity team may have some plans to address that before too long.

In the meantime, you can build your own login server and configure your deployment to use it. It won't completely disable all of the branding that exists in the UAA (another issue raised at the PMC) but my understanding is that there are plans to address that too.

Sree, can you elaborate on any plans in this space?

Thanks.

On Mon, Nov 2, 2015 at 11:55 AM, john mcteague <john.mcteague(a)gmail.com> wrote:
I have two ways in which I want to customize the UA
Brand the login screen with my company L&F
Add descriptons for custom scopes so that the access confirmation messages are relevant (currently defined in messages.properties [1] )
Do I need to fork the UAA and maintain that or is there an extension process that I am not aware of?

Thanks,
John

[1] - https://github.com/cloudfoundry/uaa/blob/bbea63986bbf2de9c42f231668e344a4a321184c/uaa/src/main/resources/messages.properties


--
Matthew Sykes
matthew.sykes(a)gmail.com


Re: CFScaler - CloudFoundry Auto Scaling

Alexander Lomov <alexander.lomov@...>
 

Hey! Nice to hear you open source your solutions.

Actually in Altoros we had a deel with such tasks, we also open sourced the solution that can be used for such purposes [1]. This is pretty straightforward script implemented in ruby and it can be deployed as an application to CF. It’s really simple and it takes advantage of cfoundry gem [2], I use it to show how easy you can customize the behavior of your cloud environment.

Still autoscaling is very complex topic and I am sure there is no common answer for everyone in this field.

Another thing I wanted to if you plan to extract you API wrapper to a separate project, just to let others use it. It would be really nice.

[1] https://github.com/allomov/cf-auto-scaling
[2] https://github.com/cloudfoundry-attic/cfoundry

Thank you,
Alex L.

On Nov 2, 2015, at 7:57 AM, Nguyen Dang Minh <nguyendangminh(a)gmail.com<mailto:nguyendangminh(a)gmail.com>> wrote:

Hi CF nuts,

I'm from FPT Software. We've just opened source CFScaler - auto scaling feature for CloudFoundry. The repository locates here: https://github.com/cloudfoundry-community/cfscaler

Auto scaling seems a high demand feature in the CF community, but we didn't find it in any open source CF distribution. So we decided to develop it ourselves. CFScaler is being used in our some workloads, it serves well enough.

There's some stuffs need to be done: code cleanup, refactor, document,... Hope it'll be ready for you guys in one week later.

CFScaler still needs to be improved, we'll public the milestone soon. At FPT Software we have CF Team and dedicated people for maintaining and developing CFScaler. All of your contributions are welcomed: code, submit issue, idea, feature request,...

Enjoy it.

Regards,
MinhND
--
Nguyen Dang Minh - 阮登明
http://www.minhnd.com<http://www.minhnd.com/>


Re: How can i listen two or more port in an app?

Matthew Sykes <matthew.sykes@...>
 

With DEA's, you can't. With Diego, you can look at the new TCP routing
support [1] if you want the port to be accessible to everyone, all the
time, or ssh port forwarding [2] if you only want someone in the developer
role to access the imx port.

For the latter, you can look at [3] for an example of how to configure the
JVM and how to use the port forwarding mechanisms of the client.

[1]: https://github.com/cloudfoundry-incubator/routing-api-cli
[2]: https://github.com/cloudfoundry-incubator/diego-ssh
[3]: http://sykesm.mybluemix.net/posts/jmx-in-diego/

On Tue, Nov 3, 2015 at 4:28 AM, yancey0623 <yancey0623(a)163.com> wrote:

Dear all!

How can i listen two or more port in an app? such as my Java app include
two port: one is process web request and another one is a jmx port.
--
Matthew Sykes
matthew.sykes(a)gmail.com


How can i listen two or more port in an app?

Yancey
 

Dear all!
How can i listen two or more port in an app? such as my Java app include two port: one is process web request and another one is a jmx port.


OpenAM integration

Antonio Diaz Arroyo
 

Hello,
We are trying to integrate a Single Sign-On authentication from OpenAM into an application deployed on Cloud Foundry.
Does anyone know what would be the best approach to do this?

Thank you!


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is restarted,
so it feels as though there's some sort of resource that the ruby process
is not able to GC, leading to this problem to show up eventually, and then
go away when restarted. I want to confirm this by trying two different
loops, one where the loop is in bash, spinning up a new ruby process each
time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID > SOME_FILE`
to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's out
of commission, so that the router doesn't load balance to it, because it
will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to be
stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for consul_agent
to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in /etc/resolv.conf,
and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0", 53]])
do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for everything to
be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this command,
and the previous `gem install`, use the system gem/ruby, not the ruby
package used by CC, so it maintains some separation. When running this, it
will spit out logs to the terminal, so one can keep an eye on what it's
doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so it
has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and (3)
the problem remains when running Dmitriy's DNS-FAILSERVer instead of consul
on 127.0.0.1:53, to determine that the problem is not specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0
ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0
ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0
ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0
ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: json data and the cli

Koper, Dies <diesk@...>
 

Hi Matthew,

Thank you for raising the issue in GH. We’d welcome a PR with yaml file support to cups. We can extrapolate from there.

Regards,
Dies Koper


From: Matthew Sykes [mailto:matthew.sykes(a)gmail.com]
Sent: Tuesday, October 27, 2015 7:43 PM
To: Discussions about Cloud Foundry projects and the system overall.
Subject: [cf-dev] Re: Re: json data and the cli

Given the scope and cross cutting nature of the issue, I think it's something better handled by the CLI team via prioritized stories. Speaking from experience, large PRs and PRs that involve refactors tend to be painful for all involved.

If you'd like, I can raise an issue in GH. If you want a PR to simply add yaml support to cups (which is what's most painful for me), I can probably do that.

On Tue, Oct 27, 2015 at 2:32 AM, Koper, Dies <diesk(a)fast.au.fujitsu.com<mailto:diesk(a)fast.au.fujitsu.com>> wrote:
Hi Matthew,

The commands you listed were developed over a span of several years, and their options have evolved.
There is no technical reason for the missing options not to be added now and implemented reusing each other’s code.
I think it’s mostly because there haven’t been enough user requests so it hasn’t been prioritised.

PEM credentials for a user provided service is a good example of a use case that’s not well catered for with the current specification.
Please let us know if you’d like to submit a PR to address this!

Regards,
Dies Koper
PM Dojo’er CLI team


From: Matthew Sykes [mailto:matthew.sykes(a)gmail.com<mailto:matthew.sykes(a)gmail.com>]
Sent: Monday, October 26, 2015 12:00 PM
To: Discussions about Cloud Foundry projects and the system overall.
Subject: [cf-dev] json data and the cli

There are a number of places in the cli where a user needs to provide a json payload to the cli. For example:

create-service/update-service
bind-service
create-service-keys
create-security-group/update-security-group

Depending on the command, the user can provide list of keys to be prompted for, inline json, a pointer to a file containing json, or some indeterminate combination of those options. That list bit is the problem - there's no consistency across all of these commands.

Is there a reason why all of these commands can't use the same basic infrastructure in the cli to provide a consistent behavior?

Also, when dealing with json data that contains new lines, it's very painful for a human. (Think PEM encoded certificate chains as credentials in a user provided service.) Given yaml is a superset of json, is there a reason why we shouldn't or can't support yaml in the file representation of the data?

Thanks.

--
Matthew Sykes
matthew.sykes(a)gmail.com<mailto:matthew.sykes(a)gmail.com>



--
Matthew Sykes
matthew.sykes(a)gmail.com<mailto:matthew.sykes(a)gmail.com>


Re: PHP extension 'gettext' doesn't work?

JT Archie <jarchie@...>
 

We've added support for some locales into the rootfs. There are quite a few
locales, of which we don't know if we need to officially support them. Our
current list, is from a consumer level list of *most used* locales.

Please feel free to review this commit
<https://github.com/cloudfoundry/stacks/commit/748cb604ef55f3eeb334d960f496661b58ec50ca>.
We've made it explicit what locales are supported.

With these changes, we've been able to make the PHP `gettext` extension
work with the example. It has been added to our test suite
<https://github.com/cloudfoundry/php-buildpack/tree/develop/cf_spec/fixtures/php_app_testing_locale>
for future support.

Let us know if you need anything else.

Kind Regards,

JT

On Fri, Oct 30, 2015 at 1:41 PM, Mike Dalessio <mdalessio(a)pivotal.io> wrote:

Unfortunately, the apt-buildpack only works for installing staging-time
dependencies, and not runtime dependencies.

It could be made to work, but the core buildpacks team simply have not
done so because nobody has asked (yet). ;)


On Fri, Oct 30, 2015 at 1:37 PM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

I agree the inclusion of the lang pack into linuxfs2 seems best option.

I'm wondering though whether a temporary workaround could be to install
the "locales" debian package using apt-buildpack [1] (no sudo needed) and
combine it with php buildpack using the multi buildpack [2] ? I was
planning to test that for another purpose but had not the chance yet. I'm
interested in hearing the outcome.

Guillaume.

[1] https://github.com/pivotal-cf-experimental/apt-buildpack

[2] https://github.com/ddollar/heroku-buildpack-multi
Le 27 oct. 2015 02:41, "Hiroaki Ukaji" <dt3snow.w(a)gmail.com> a écrit :


Hi.

Thanks to you, I understood why i18n by gettext didn't work on CF.
Certainly, the language pack "ja_JP.utf8" only exists in my local
machine..

Anyway, we're glad to hear that the debian package "locales" will be
added
to the rootfs.
It will resolve this issue and then we will be able to manage i18n by
gettext extension on CF.


Thanks a lot.

Hiroaki UKAJI



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-PHP-extension-gettext-doesn-t-work-tp1984p2450.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: UAA branding and scope descriptions

Matthew Sykes <matthew.sykes@...>
 

No formal extension process currently exists to do what you're asking for.
The topic has been raised at the runtime PMC as others have similar needs.
It sounds like the identity team may have some plans to address that before
too long.

In the meantime, you can build your own login server and configure your
deployment to use it. It won't completely disable all of the branding that
exists in the UAA (another issue raised at the PMC) but my understanding is
that there are plans to address that too.

Sree, can you elaborate on any plans in this space?

Thanks.

On Mon, Nov 2, 2015 at 11:55 AM, john mcteague <john.mcteague(a)gmail.com>
wrote:

I have two ways in which I want to customize the UAA:

- Brand the login screen with my company L&F
- Add descriptons for custom scopes so that the access confirmation
messages are relevant (currently defined in messages.properties [1] )

Do I need to fork the UAA and maintain that or is there an extension
process that I am not aware of?

Thanks,
John

[1] -
https://github.com/cloudfoundry/uaa/blob/bbea63986bbf2de9c42f231668e344a4a321184c/uaa/src/main/resources/messages.properties


--
Matthew Sykes
matthew.sykes(a)gmail.com

6841 - 6860 of 9425