Date   

Re: Does warden container/daemon allow swap?

Matthew Sykes <matthew.sykes@...>
 

It doesn't mean that the process *can't* swap but it does mean that when
memory usage exceeds the limit, the kernel won't swap out any pages to
avoid the oom killer.

There were some discussions about this quite a while back and some folks
expressed concerns about it in relation to existing deployments and how to
get people to update the swap size because (if I recall correctly) of how
bosh manages it.

Might be a good idea to propose a change to garden to allow diego to
specify the memory+swap limit and figure out the right way forward.

On Sun, Nov 1, 2015 at 11:28 AM, Shaozhen Ding <dsz0111(a)gmail.com> wrote:

I looked into the source code of warden.
Realized that warden sets these two cgroup memory settings


https://github.com/cloudfoundry/warden/blob/76010f2ba12e41d9e8755985ec874391fb3c962a/warden/lib/warden/container/features/mem_limit.rb#L108

Both memory.limit_in_bytes and memory.memsw.limit_in_bytes are set to the
same value, which means the memory and memory + swap are the exact same
value.
Does this mean the processes running in warden container can not swap at
all? Since before swap, the container will be killed by the OOM killer.
Wonder if this is a good strategy?

BTW, looking at docker. By default, it set memory.memsw.limit_in_bytes = 2
* memory.limit_in_bytes, which gives the application process some swap room.
--
Matthew Sykes
matthew.sykes(a)gmail.com


UAA branding and scope descriptions

john mcteague <john.mcteague@...>
 

I have two ways in which I want to customize the UAA:

- Brand the login screen with my company L&F
- Add descriptons for custom scopes so that the access confirmation
messages are relevant (currently defined in messages.properties [1] )

Do I need to fork the UAA and maintain that or is there an extension
process that I am not aware of?

Thanks,
John

[1] -
https://github.com/cloudfoundry/uaa/blob/bbea63986bbf2de9c42f231668e344a4a321184c/uaa/src/main/resources/messages.properties


Re: How to Handle the Intersection beween Diego and CF Jobs

Ronak Banka
 

On Monday, November 2, 2015, Ramon Erb <web01(a)web-coach.ch> wrote:

I installed CF and then found out that Diego is not included. I don't get
"generate_deployment_manifest" to work for the Diego installation because I
get "unresolved nodes" and don't know how to handle them. Therefore I tried
to create a Manifest myself because I had similar problems with the CF
installation and I was able to write a running Manifest for the CF setup
with Manifests posted somewhere.

I thought it makes no sense to generate the job "nats" and "etcd" because
they already exist in my running CF (and its manifest).
Is it possible/wise to (re)use the jobs from CF?
For the Diego-Installation I need the jobs template "file_server" how
can I integrate it in CF?

I want to install the Diego-Version:
https://github.com/cloudfoundry-incubator/diego-release/tree/v0.1434.0
And if that was successful switch to:
https://github.com/cloudfoundry-incubator/diego-docker-cache-release
Because I want to use my own Docker-Repository.

I use this manifest for reference:
https://github.com/cppforlife/bosh-diego-cpi-release/blob/master/manifests/diego.yml
Is there another place where I can get complete Diego-Manifests for
reference?

Thank you! Nguinaro


How to Handle the Intersection beween Diego and CF Jobs

Nguinaro Givol
 

I installed CF and then found out that Diego is not included. I don't get "generate_deployment_manifest" to work for the Diego installation because I get "unresolved nodes" and don't know how to handle them. Therefore I tried to create a Manifest myself because I had similar problems with the CF installation and I was able to write a running Manifest for the CF setup with Manifests posted somewhere.

I thought it makes no sense to generate the job "nats" and "etcd" because they already exist in my running CF (and its manifest).
Is it possible/wise to (re)use the jobs from CF?
For the Diego-Installation I need the jobs template "file_server" how can I integrate it in CF?
I want to install the Diego-Version: https://github.com/cloudfoundry-incubator/diego-release/tree/v0.1434.0
And if that was successful switch to: https://github.com/cloudfoundry-incubator/diego-docker-cache-release
Because I want to use my own Docker-Repository.

I use this manifest for reference: https://github.com/cppforlife/bosh-diego-cpi-release/blob/master/manifests/diego.yml
Is there another place where I can get complete Diego-Manifests for reference?

Thank you! Nguinaro


Re: How Warden limit socket queue?

Matthew Sykes <matthew.sykes@...>
 

As a user, you can't. Even though somaxconn is associated with the network
namespace, you need to be privileged (root) to change it.

If you wanted a higher default across all containers, you could modify
warden to do that as part if its net setup (linux/skeleton/net.sh).

On Mon, Nov 2, 2015 at 3:38 AM, yancey0623 <yancey0623(a)163.com> wrote:

Dear all!

I pushed an app with uwsgi, but it crashed. After investigation, the
argument “listen” in uwsgi.ini(uwsgi config file) is cause of crash. it’s
so large, when i reduce it from 256 to 128, it will be ok. here is the
error info


2015-11-02T16:35:11.57+0800 [App/0] ERR Listen queue size is greater
than the system max net.core.somaxconn (128).

2015-11-02T16:35:11.57+0800 [App/0] ERR VACUUM: pidfile removed.

my os argument is:

paas(a)tsh-cf-dev-01:~/hello-python$ cat /proc/sys/net/core/somaxconn

1024

Who knows where can i config these network args.?
--
Matthew Sykes
matthew.sykes(a)gmail.com


How Warden limit socket queue?

Yancey
 

Dear all!

I pushed an app with uwsgi, but it crashed. After investigation, the argument “listen” in uwsgi.ini(uwsgi config file)  is cause of crash. it’s so large, when i reduce it from 256 to 128, it will be ok. here is the error info


2015-11-02T16:35:11.57+0800 [App/0]      ERR Listen queue size is greater than the system max net.core.somaxconn (128).

2015-11-02T16:35:11.57+0800 [App/0]      ERR VACUUM: pidfile removed.

my os argument is:

paas@tsh-cf-dev-01:~/hello-python$ cat /proc/sys/net/core/somaxconn  

1024

Who knows where can i config these network args.?


Re: Error to make a Request to update password in UAA

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Good morning,

The user which I use has the following groups:

[ { value: 'c6032d43-5eb6-4719-8ff5-5ec3b6bf7cf8',
display: 'approvals.me',
type: 'DIRECT' },
{ value: 'dead2fa1-02f3-46a4-9072-5315e7c692ac',
display: 'cloud_controller.read',
type: 'DIRECT' },
{ value: 'efe5e709-3c75-47e2-a921-d7efc1535a7d',
display: 'doppler.firehose',
type: 'DIRECT' },
{ value: '7b545c0e-7cd4-4ca3-87f6-0458594f928d',
display: 'openid',
type: 'DIRECT' },
{ value: '7b2e9bac-606f-4e03-87a8-f8121616521f',
display: 'cloud_controller_service_permissions.read',
type: 'DIRECT' },
{ value: '8605ea52-a0d5-4801-91c8-cf8cb6f79c4b',
display: 'cloud_controller.write',
type: 'DIRECT' },
{ value: '156bb655-4ef4-4068-a0ed-fa877e03eb51',
display: 'uaa.user',
type: 'DIRECT' },
{ value: '67c233cf-5950-4d03-a534-016d1d3baf15',
display: 'scim.read',
type: 'DIRECT' },
{ value: 'e08cbc0b-a032-4c03-86ed-03e6b33a585e',
display: 'notification_preferences.write',
type: 'DIRECT' },
{ value: '084442d8-fa2c-416e-9163-9f26ea928316',
display: 'notification_preferences.read',
type: 'DIRECT' },
{ value: 'a4a76783-2440-46df-bacb-5ea3e3d8cb82',
display: 'cloud_controller.admin',
type: 'DIRECT' },
{ value: '1b48d072-6715-40a4-b01c-f4f8ede67db9',
display: 'password.write',
type: 'DIRECT' },
{ value: 'e7ed28ab-3a6e-429a-9b4b-fcf921e1b5dd',
display: 'oauth.approvals',
type: 'DIRECT' },
{ value: '203a26f5-c022-4b94-8368-16fb8eec2b37',
display: 'scim.write',
type: 'DIRECT' },
{ value: '52ed4af3-1a7b-413d-9189-ab2e2b750d8b',
display: 'scim.me',
type: 'DIRECT' } ]

Is OK to update the password for another user created with this account?

Juan Antonio


Re: CFScaler - CloudFoundry Auto Scaling

Gwenn Etourneau
 

Nice things !
Do you have any documentation ??


Thanks



On Mon, Nov 2, 2015 at 3:57 PM, Nguyen Dang Minh <nguyendangminh(a)gmail.com>
wrote:

Hi CF nuts,

I'm from FPT Software. We've just opened source CFScaler - auto scaling
feature for CloudFoundry. The repository locates here:
https://github.com/cloudfoundry-community/cfscaler

Auto scaling seems a high demand feature in the CF community, but we
didn't find it in any open source CF distribution. So we decided to develop
it ourselves. CFScaler is being used in our some workloads, it serves well
enough.

There's some stuffs need to be done: code cleanup, refactor, document,...
Hope it'll be ready for you guys in one week later.

CFScaler still needs to be improved, we'll public the milestone soon. At
FPT Software we have CF Team and dedicated people for maintaining and
developing CFScaler. All of your contributions are welcomed: code, submit
issue, idea, feature request,...

Enjoy it.

Regards,
MinhND
--
Nguyen Dang Minh - 阮登明
http://www.minhnd.com


CFScaler - CloudFoundry Auto Scaling

Nguyen Dang Minh
 

Hi CF nuts,

I'm from FPT Software. We've just opened source CFScaler - auto scaling
feature for CloudFoundry. The repository locates here:
https://github.com/cloudfoundry-community/cfscaler

Auto scaling seems a high demand feature in the CF community, but we didn't
find it in any open source CF distribution. So we decided to develop it
ourselves. CFScaler is being used in our some workloads, it serves well
enough.

There's some stuffs need to be done: code cleanup, refactor, document,...
Hope it'll be ready for you guys in one week later.

CFScaler still needs to be improved, we'll public the milestone soon. At
FPT Software we have CF Team and dedicated people for maintaining and
developing CFScaler. All of your contributions are welcomed: code, submit
issue, idea, feature request,...

Enjoy it.

Regards,
MinhND
--
Nguyen Dang Minh - 阮登明
http://www.minhnd.com


CloudFoundry scalibility benchmark

harry zhang
 

Hi guys,

We have been using cloud foundry as our first class PaaS layer since it was released, but for now our cluster is still limited to 100 servers.

So I wonder if there's scalibility benchmark[1] for cloud foundry, including deigo?

[1] For example, kubernetes claim that before v1.1 release, their goal is 100 nodes with 30 pod/per node. See: http://blog.kubernetes.io/2015/09/kubernetes-performance-measurements-and.html


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Amit,
It looks like consul isn't configured as a recursive resolver. When running
the above code, resolving fails on the first nameserver and the script
fails. resolv-replace's TCPSocket.open is different from the code http.rb
(and thus api) is using. http.rb is pulling in 'net/protocol'. I changed
the script, replacing the require for 'resolv-replace' to 'net/protocol' to
match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56

On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down where
the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Does warden container/daemon allow swap?

Shaozhen Ding
 

I looked into the source code of warden.
Realized that warden sets these two cgroup memory settings

https://github.com/cloudfoundry/warden/blob/76010f2ba12e41d9e8755985ec874391fb3c962a/warden/lib/warden/container/features/mem_limit.rb#L108

Both memory.limit_in_bytes and memory.memsw.limit_in_bytes are set to the same value, which means the memory and memory + swap are the exact same value.
Does this mean the processes running in warden container can not swap at all? Since before swap, the container will be killed by the OOM killer. Wonder if this is a good strategy?

BTW, looking at docker. By default, it set memory.memsw.limit_in_bytes = 2 * memory.limit_in_bytes, which gives the application process some swap room.


Re: Source IP ACLs

Noburou TANIGUCHI
 

We have proprietarily implemented the feature into Gorouter, but now similar
functionality will probably achieved by Route Service [1]. There seems
little information [2] about it and I also want to know the progress.

[1]
https://docs.google.com/document/d/1bGOQxiKkmaw6uaRWGd-sXpxL0Y28d3QihcluI15FiIA/edit#heading=h.8djffzes9pnb

[2] https://www.pivotaltracker.com/n/projects/966314


Carlo Alberto Ferraris-2 wrote
Is there any provision for restricting the source IPs that are allowed to
access a certain application (or route)? Or the only way to do this is to
place a reverse proxy in front of the gorouter?
In case the reverse proxy is the only way to go, would there be interest
to have something like this implemented inside the gorouter itself? (we're
willing to contribute)




-----
I'm not a ...
noburou taniguchi
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Source-IP-ACLs-tp2518p2544.html
Sent from the CF Dev mailing list archive at Nabble.com.


[abacus] Abacus v0.0.2 available

Jean-Sebastien Delfino
 

I'm happy to announce the availability of CF Abacus v0.0.2 (incubating).

Abacus provides usage metering and aggregation for Cloud Foundry services
and app runtimes.

I'd like to thank the Abacus committers as well as all the contributors who
helped test our v0.0.2-rc1 and rc2 release candidates and provided
feedback, issues and pull requests leading to the release of Abacus v0.0.2.

The release Git tag and release notes can be found on Github:
https://github.com/cloudfoundry-incubator/cf-abacus/releases/tag/v0.0.2

The CI build can be found on Travis CI:
https://travis-ci.org/cloudfoundry-incubator/cf-abacus/builds/88470969

The npm modules can be found on npmjs:
https://www.npmjs.com/search?q=cf-abacus

Please feel free to ask any questions about this release of Abacus on this
list.
Issues or -- even better -- pull requests are welcome on Github as well!

For more info on Abacus please visit:
https://github.com/cloudfoundry-incubator/cf-abacus/tree/v0.0.2

Thanks!

- Jean-Sebastien


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down where
the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that the
UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the corresponding
ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the corresponding
ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: Error to make a Request to update password in UAA

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Many thanks for the reply.

Next monday in the morning, I will update the class to test again.

Cheers


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Ah, my bad. We need to patch the logger to somehow include timestamps when
the net/http library calls << on it instead of calling info:

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
def l.<<(msg); info(msg); end
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

On Fri, Oct 30, 2015 at 7:35 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here are the results:

https://gist.github.com/cholick/b448df07e9e493369d9e

The before and after pictures look pretty similar, nothing jumps out as
interesting.

On Thu, Oct 29, 2015 at 11:28 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Matt, that's awesome, thanks! Mind trying this?

require 'uri'
require 'net/http'
require 'logger'

SYSTEM_DOMAIN = '--CHANGE-ME--'

u = URI.parse('http://uaa.' + SYSTEM_DOMAIN + '/login')
h = Net::HTTP.new(u.host, u.port)
l = Logger.new('/var/vcap/data/tmp/slow-dns.log')
h.set_debug_output(l)

1.step do |i|
l.info('Request number: %04d' % i)
s = Time.now
r = h.head(u.path)
d = Time.now - s
l.info('Duration: %dms' % (d * 1000).round)
l.info('Response code: %d' % r.code)
l.error('!!! SLOW !!!') if d > 5
end

I'd want to know what we see in /var/vcap/data/tmp/slow-dns.log before
and after the DNS slowdown. By having the http object take a debug logger,
we can narrow down what Ruby is doing that's making it uniquely slow.


On Thu, Oct 29, 2015 at 7:39 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
Here's a run with the problem manifesting:

...
00248 [200]: ruby 26ms | curl 33ms | nslookup 21ms
00249 [200]: ruby 20ms | curl 32ms | nslookup 14ms
00250 [200]: ruby 18ms | curl 30ms | nslookup 17ms
00251 [200]: ruby 22ms | curl 31ms | nslookup 16ms
00252 [200]: ruby 23ms | curl 30ms | nslookup 16ms
00253 [200]: ruby 26ms | curl 40ms | nslookup 16ms
00254 [200]: ruby 20ms | curl 40ms | nslookup 14ms
00255 [200]: ruby 20ms | curl 35ms | nslookup 20ms
00256 [200]: ruby 17ms | curl 32ms | nslookup 14ms
00257 [200]: ruby 20ms | curl 37ms | nslookup 14ms
00258 [200]: ruby 25ms | curl 1038ms | nslookup 14ms
00259 [200]: ruby 27ms | curl 37ms | nslookup 13ms
00260 [200]: ruby 4020ms | curl 32ms | nslookup 16ms
00261 [200]: ruby 5032ms | curl 45ms | nslookup 14ms
00262 [200]: ruby 5021ms | curl 30ms | nslookup 14ms
00263 [200]: ruby 5027ms | curl 32ms | nslookup 16ms
00264 [200]: ruby 5025ms | curl 34ms | nslookup 15ms
00265 [200]: ruby 5029ms | curl 31ms | nslookup 14ms
00266 [200]: ruby 5030ms | curl 37ms | nslookup 18ms
00267 [200]: ruby 5022ms | curl 43ms | nslookup 14ms
00268 [200]: ruby 5026ms | curl 31ms | nslookup 17ms
00269 [200]: ruby 5027ms | curl 33ms | nslookup 14ms
00270 [200]: ruby 5025ms | curl 32ms | nslookup 14ms
00271 [200]: ruby 5022ms | curl 36ms | nslookup 15ms
00272 [200]: ruby 5030ms | curl 32ms | nslookup 13ms
00273 [200]: ruby 5024ms | curl 32ms | nslookup 13ms
00274 [200]: ruby 5028ms | curl 34ms | nslookup 14ms
00275 [200]: ruby 5048ms | curl 30ms | nslookup 14ms


It's definitely interesting that Ruby is the only one to manifest the
problem.

And here's the consul output:
https://gist.github.com/cholick/f7e91fb58891cc0d8f5a


On Thu, Oct 29, 2015 at 4:27 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

Dieu's suggestion will fix your problem (you'll have to make the change
on all CC's), although it'll get undone on each redeploy. We do want to
find the root cause, but have not been able to reproduce it in our own
environments. If you're up for some investigation, may I suggest the
following:

* Run the following variation of your script on one of the CCs:

require 'uri'
require 'net/http'

SYSTEM_DOMAIN = '--CHANGE-ME--'

uaa_domain = "uaa.#{SYSTEM_DOMAIN}"
login_url = "https://#{uaa_domain}/login"

curl_command="curl -f #{login_url} 2>&1"
nslookup_command="nslookup #{uaa_domain} 2>&1"

puts 'STARTING SANITY CHECK'
curl_output = `#{curl_command}`
raise "'#{curl_command}' failed with output:\n#{curl_output}" unless
$?.to_i.zero?
puts 'SANITY CHECK PASSED'

def duration_string(start)
"#{((Time.now - start) * 1000).round}ms"
end

puts 'STARTING TEST'
1.step do |i|
uri = URI.parse(login_url)
ruby_start = Time.now
ruby_response = Net::HTTP.get_response(uri)
ruby_duration = duration_string(ruby_start)

curl_start = Time.now
`#{curl_command}`
curl_duration = duration_string(curl_start)

nslookup_start = Time.now
`#{nslookup_command}`
nslookup_duration = duration_string(nslookup_start)

puts "#{"%05d" % i} [#{ruby_response.code}]: ruby #{ruby_duration}
| curl #{curl_duration} | nslookup #{nslookup_duration}"
end

* Send a kill -QUIT <consul_agent_pid> to the consul agent process once
you see the slow DNS manifest itself, you will get a dump of all the
goroutines running in the consul agent process
/var/vcap/sys/log/consul_agent/consul_agent.stderr.log. I would be curious
to see what it spits out.

Amit


On Wed, Oct 28, 2015 at 6:10 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Thanks for taking a look, fingers crossed you can see it happen as
well.

Our 217 install is on stemcell 3026 and our 212 install is on 2989.

IaaS is CenturyLink Cloud.

-Matt

On Wed, Oct 28, 2015 at 6:08 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I got up to 10k on an AWS deployment of HEAD of cf-release with ruby
2.2, then started another loop on the same box with ruby 2.1. In the end,
they got up to 40-50k without showing any signs of change. I had to switch
to resolving the UAA endpoint, eventually google started responding with
302s.

I'm going to try with a cf-release 212 deployment on my bosh lite,
but eventually I want to try on the same stemcell as you're using.

On Wed, Oct 28, 2015 at 5:01 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Thanks Matt, this is awesome.

I'm trying to reproduce this with your script, up at 10k with no
change. I'm also shelling out to curl in the script, to see if both curl
and ruby get affected, and so, if they're affected at the same time.

What IaaS and stemcell are you using?

Thanks,
Amit

On Wed, Oct 28, 2015 at 2:54 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

You might try moving the nameserver entry for the consul_agent in
/etc/resolv.conf on the cloud controller to the end to see if that helps.

-Dieu

On Wed, Oct 28, 2015 at 12:55 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Looks like you're right and we're experiencing the same issue as
you are Amit. We're suffering slow DNS lookups. The code is spending all of
its time here:
/var/vcap/packages/ruby-2.1.6/lib/ruby/2.1.0/net/http.rb.initialize
:879

I've experimented some with the environment and, after narrowing
things down to DNS, here's some minimal demonstrating the problem:

require "net/http"
require "uri"

# uri = URI.parse("http://uaa.example.com/info")
uri = URI.parse("https://www.google.com")

i = 0
while true do
beginning_time = Time.now
response = Net::HTTP.get_response(uri)

end_time = Time.now
i+=1
puts "#{"%04d" % i} Response: [#{response.code}], Elapsed: #{((end_time - beginning_time)*1000).round} ms"
end


I see the issue hitting both UAA and just hitting Google. At some
point, requests start taking 5 second longer, which I assume is a timeout.
One run:

0349 Response: [200], Elapsed: 157 ms
0350 Response: [200], Elapsed: 169 ms
0351 Response: [200], Elapsed: 148 ms
0352 Response: [200], Elapsed: 151 ms
0353 Response: [200], Elapsed: 151 ms
0354 Response: [200], Elapsed: 152 ms
0355 Response: [200], Elapsed: 153 ms
0356 Response: [200], Elapsed: 6166 ms
0357 Response: [200], Elapsed: 5156 ms
0358 Response: [200], Elapsed: 5158 ms
0359 Response: [200], Elapsed: 5156 ms
0360 Response: [200], Elapsed: 5156 ms
0361 Response: [200], Elapsed: 5160 ms
0362 Response: [200], Elapsed: 5172 ms
0363 Response: [200], Elapsed: 5157 ms
0364 Response: [200], Elapsed: 5165 ms
0365 Response: [200], Elapsed: 5157 ms
0366 Response: [200], Elapsed: 5155 ms
0367 Response: [200], Elapsed: 5157 ms

Other runs are the same. How many requests it takes before things
time out varies considerably (one run started in the 10s and another took
20k requests), but it always happens. After that, lookups take an
additional 5 second and never recover to their initial speed. This is why
restarting the cloud controller fixes the issue (temporarily).

The really slow cli calls (in the 1+min range) are simply due to
the amount of paging that a fetching data for a large org does, as that 5
seconds is multiplied out over several calls. Every user is feeling this
delay, it's just that it's only unworkable pulling the large datasets from
UAA.

I was not able to reproduce timeouts using a script calling "dig"
against localhost, only inside a ruby code.

The re-iterate our setup: we're running 212 without a consul
server, just the agents. I also successfully reproduce this problem in
completely different 217 install in a different datacenter. This setup also
didn't have an actual consul server, just the agent. I don't see anything
in the release notes past 217 indicating that this is fixed.

Anyone have thoughts? This is definitely creating some real
headaches for user management in our larger orgs. Amit: is there a bug we
can follow?

-Matt


On Fri, Oct 9, 2015 at 10:52 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

You may not be running any consul servers, but you may have a
consul agent colocated on your CC VM and running there.

On Thu, Oct 8, 2015 at 5:59 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Zack & Swetha,
Thanks for the suggestion, will gather netstat info there next
time.

Amit,
1:20 delay is due to paging. The total call length for each page
is closer to 10s. Just included those two calls with paging by the cf
command line included numbers to demonstrate the dramatic difference after
a restart. Delays disappear after a restart. We're not running consul yet,
so it wouldn't be that.

-Matt



On Thu, Oct 8, 2015 at 10:03 AM, Amit Gupta <agupta(a)pivotal.io>
wrote:

We've seen issues on some environments where requests to cc
that involve cc making a request to uaa or hm9k have a 5s delay while the
local consul agent fails to resolves the DNS for uaa/hm9k, before moving on
to a different resolver.

The expected behavior observed in almost all environments is
that the DNS request to consul agent fails fast and moves on to the next
resolver, we haven't figured out why a couple envs exhibit different
behavior. The impact is a 5 or 10s delay (5 or 10, not 5 to 10). It doesn't
explain your 1:20 delay though. Are you always seeing delays that long?

Amit


On Thursday, October 8, 2015, Zach Robinson <
zrobinson(a)pivotal.io> wrote:

Hey Matt,

I'm trying to think of other things that would affect only the
endpoints that interact with UAA and would be fixed after a CC restart.
I'm wondering if it's possible there are a large number of connections
being kept-alive, or stuck in a wait state or something. Could you take a
look at the netstat information on the CC and UAA next time this happens?

-Zach and Swetha


Re: Problem deploying basic Apps on PWS

Charles Wu
 

You can also download the latest CLI.

Note all new apps deployed on PWS are defaulted to Diego as the app runner
environment. The enable-diego is only needed to switch DEA deployed apps to
use Diego.

br,

Charles


On Fri, Oct 30, 2015 at 1:56 AM, Juan Antonio Breña Moral <
bren(a)juanantonio.info> wrote:

Hi Charles,

You said the clue!!!
Yesterday, I updated the development and I could deploy on PWS.

From environments without Diego, the way to run a Node development is:

var localPort = process.env.VCAP_APP_PORT|| 5000;

With Diego the way is:

var localPort = process.env.PORT || 5000;

If the developer uses GO Cli, it is necessary to indicate the application
that it uses Diego:

cf push APP_XXX --no-start
cf enable-diego APP_XXX
cf start APP_XXX


Re: Permission denied error when unpacking droplet

Noburou TANIGUCHI
 

I am not sure at all but it might be related umask.

What is the umask of the user you deployed your CF (I assume you are talking
about a private CF).

I've been feeling there's an implicit assumption in the cf deployment with
bosh and cf-release that umask is 022 (or 002; Ubuntu default).



-----
I'm not a ...
noburou taniguchi
--
View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Re-Permission-denied-error-when-unpacking-droplet-tp2441p2537.html
Sent from the CF Dev mailing list archive at Nabble.com.

6861 - 6880 of 9425