Date   

Re: How to Handle the Intersection beween Diego and CF Jobs

Eric Malm <emalm@...>
 

Hi, Ramon,

I don't fully understand how you're trying to deploy your CF and Diego
clusters together. Diego does need to interact with some components that
are currently deployed as part of CF, but we've currently structured Diego
as a separate deployment that integrates with those CF components. The
diego-release repo has its own manifest-generation script repo that takes
as inputs the CF deployment manifest and several other specific stubs and
produces a deployment manifest for Diego that should integrate correctly
with that CF deployment.

We also have written the Diego manifest-generation script to have more
structured inputs than the manifest-generation script in cf-release. In
particular, it takes in a defined list of input stubs, one of which
captures only specific information about your infrastructure (networks,
stemcell, cloud properties for resource pools), and another of which
captures only the instance counts of the Diego jobs. If you already have a
working CF manifest, I would suggest you try customizing those two stubs to
match your infrastructure details and desired deployment size and then use
them and your CF manifest as inputs to generate a compatible Diego
deployment manifest.

If you need some complete examples, I would recommend generating CF and
Diego manifests for BOSH-Lite first, following the instructions in the
diego-release README, and comparing them to the BOSH-Lite input stubs
located in
https://github.com/cloudfoundry-incubator/diego-release/tree/develop/manifest-generation/bosh-lite-stubs.
The manifest you mentioned in Dmitriy's Diego CPI release is over a year
old and is quite out of date. The Diego team will be working on publishing
more examples and tooling for deploying to other infrastructures, such as
AWS, in the near future.

Thanks,
Eric Malm, CF Runtime Diego PM

On Mon, Nov 2, 2015 at 5:34 AM, Ramon Erb <web01(a)web-coach.ch> wrote:

I installed CF and then found out that Diego is not included. I don't get
"generate_deployment_manifest" to work for the Diego installation because I
get "unresolved nodes" and don't know how to handle them. Therefore I tried
to create a Manifest myself because I had similar problems with the CF
installation and I was able to write a running Manifest for the CF setup
with Manifests posted somewhere.

I thought it makes no sense to generate the job "nats" and "etcd" because
they already exist in my running CF (and its manifest).
Is it possible/wise to (re)use the jobs from CF?
For the Diego-Installation I need the jobs template "file_server" how
can I integrate it in CF?

I want to install the Diego-Version:
https://github.com/cloudfoundry-incubator/diego-release/tree/v0.1434.0
And if that was successful switch to:
https://github.com/cloudfoundry-incubator/diego-docker-cache-release
Because I want to use my own Docker-Repository.

I use this manifest for reference:
https://github.com/cppforlife/bosh-diego-cpi-release/blob/master/manifests/diego.yml
Is there another place where I can get complete Diego-Manifests for
reference?

Thank you! Nguinaro


Re: cf push docker image on diego bosh-lite fails

Christopher Piraino <cpiraino@...>
 

Ramesh,

I was seeing similar errors in another environment of ours, and after
checking the diego/garden-linux compatibility I found that garden-linux
0.325.0 had not been tested against 0.1439.0 of diego. You can see the
compatibilities here:
https://github.com/cloudfoundry-incubator/diego-cf-compatibility/blob/master/compatibility-v2.csv

I downgraded garden-linux to 0.316.0 and container creation worked for me.

- Chris

On Wed, Nov 4, 2015 at 12:30 PM, Ramesh Sambandan <rsamban(a)gmail.com> wrote:

I am trying to push docker app to diego in bosh lite.
I verified that I docker image cloudfoundry/lattice-app is running fine in
my local docker daemon.
I have enabled diego_docker feature flag.
My cf version is 6.13
I am doing “cf push dockerLattice --docker-image cloudfoundry/lattice-app”
and following is my "cf logs dockerLattice --recent"

************************************************************************************************************************************
2015-11-04T13:06:58.71-0700 [API/0] OUT Created app with guid
05c02bed-464a-4c89-948f-9b1cf12144a0
2015-11-04T13:06:58.81-0700 [API/0] OUT Updated app with guid
05c02bed-464a-4c89-948f-9b1cf12144a0
({"route"=>"b7650962-d25d-446c-a074-c06c1b13d811"})
2015-11-04T13:07:04.01-0700 [API/0] OUT Updated app with guid
05c02bed-464a-4c89-948f-9b1cf12144a0 ({"state"=>"STARTED"})
2015-11-04T13:07:04.03-0700 [STG/0] OUT Creating container
2015-11-04T13:07:04.62-0700 [STG/0] OUT Successfully created container
2015-11-04T13:07:04.71-0700 [STG/0] OUT Staging...
2015-11-04T13:07:04.74-0700 [STG/0] OUT Staging process started ...
2015-11-04T13:07:06.48-0700 [STG/0] OUT Staging process finished
2015-11-04T13:07:06.48-0700 [STG/0] OUT Exit status 0
2015-11-04T13:07:06.48-0700 [STG/0] OUT Staging Complete
2015-11-04T13:07:06.90-0700 [CELL/0] OUT Creating container

2015-11-04T13:07:45.72-0700 [CELL/0] ERR Failed to create container
2015-11-04T13:07:45.74-0700 [API/0] OUT App instance exited with guid
05c02bed-464a-4c89-948f-9b1cf12144a0 payload:
{"instance"=>"c3e4ca30-98ac-4170-45cc-1755f3e6cc01", "index"=>0,
"reason"=>"CRASHED", "exit_description"=>"failed to initialize container",
"crash_count"=>4, "crash_timestamp"=>1446667665731251451,
"version"=>"73c93a4b-b1d6-43c0-90e4-44b5d8fd36ee"}

************************************************************************************************************************************

I tried to run diego acceptance test(smoke test was successful) on my bosh
lite and got following error.(./bin/test -skipPackage ssh)

************************************************************************************************************************************
...
[2015-11-04 19:58:26.24 (UTC)]> cf delete
39621ab3-cdbe-4790-4f90-294ec46a5546 -f
Deleting app 39621ab3-cdbe-4790-4f90-294ec46a5546 in org
CATS-ORG-1-2015_11_04-12h40m44.529s / space
CATS-SPACE-1-2015_11_04-12h40m44.529s as
CATS-USER-1-2015_11_04-12h40m44.529s...
OK

------------------------------
• Failure in Spec Setup (JustBeforeEach) [23.136 seconds]
Docker Application Lifecycle [JustBeforeEach] running the app merges the
garden and docker environment variables

/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:70

No future change is possible. Bailing out early after 17.068s.
Expected
<int>: 1
to match exit code:
<int>: 0


/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:46
------------------------------
SS

Summarizing 1 Failure:

[Fail] Docker Application Lifecycle [JustBeforeEach] running the app
merges the garden and docker environment variables

/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:46

Ran 27 of 29 Specs in 1080.006 seconds
FAIL! -- 26 Passed | 1 Failed | 0 Pending | 2 Skipped --- FAIL:
TestApplications (1080.01s)
FAIL

Ginkgo ran 1 suite in 18m1.688798613s
Test Suite Failed

************************************************************************************************************************************

Following is output from "bosh deployments"

************************************************************************************************************************************

+-----------------+----------------------+-------------------------------------------------+--------------+
| Name | Release(s) | Stemcell(s)
| Cloud Config |

+-----------------+----------------------+-------------------------------------------------+--------------+
| cf-warden | cf/222+dev.1 |
bosh-warden-boshlite-ubuntu-trusty-go_agent/389 | none |

+-----------------+----------------------+-------------------------------------------------+--------------+
| cf-warden-diego | cf/222+dev.1 |
bosh-warden-boshlite-ubuntu-trusty-go_agent/389 | none |
| | diego/0.1439.0+dev.1 |
| |
| | etcd/18 |
| |
| | garden-linux/0.325.0 |
| |

+-----------------+----------------------+-------------------------------------------------+--------------+

************************************************************************************************************************************

Can somebody help me please.

thanks
-Ramesh


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Gotcha. Yeah, the rescue lets that test run; after 425k lookups, it never
got slow.

Here's a bit of the strace:
https://gist.github.com/cholick/88c756760faca77208f8

On Wed, Nov 4, 2015 at 11:59 AM, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the TCPSocket.open to fail when
talking to 127.0.0.1, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the
TCPSocket.open is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:

TCPSocket.open("--UAA-DOMAIN--", 80).close

with

TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only 127.0.0.1 listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to 127.0.0.1, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for
8.8.8.8. The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.

Thanks,
Amit

On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.


On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did
manifest. Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something.
There's a couple variables I'd like to eliminate, so I wonder if you could
try the following. Also, feel free at any point to let me know if you are
not interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of your
CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try
that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby,
not the ruby package used by CC, so it maintains some separation. When
running this, it will spit out logs to the terminal, so one can keep an eye
on what it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `
/var/vcap/bosh/etc/gemrc`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected,
so it has something to do with long running ruby processes, (2) the problem
is independent of the other nameservers listed in /etc/resolv.conf,
and (3) the problem remains when running Dmitriy's DNS-FAILSERVer instead
of consul on 127.0.0.1:53, to determine that the problem is not
specific to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close:
0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close:
0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin
down where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP
that the UAA_DOMAIN resolves to (it will be some load balancer, possibly
the gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


cf push docker image on diego bosh-lite fails

Ramesh Sambandan
 

I am trying to push docker app to diego in bosh lite.
I verified that I docker image cloudfoundry/lattice-app is running fine in my local docker daemon.
I have enabled diego_docker feature flag.
My cf version is 6.13
I am doing “cf push dockerLattice --docker-image cloudfoundry/lattice-app” and following is my "cf logs dockerLattice --recent"
************************************************************************************************************************************
2015-11-04T13:06:58.71-0700 [API/0] OUT Created app with guid 05c02bed-464a-4c89-948f-9b1cf12144a0
2015-11-04T13:06:58.81-0700 [API/0] OUT Updated app with guid 05c02bed-464a-4c89-948f-9b1cf12144a0 ({"route"=>"b7650962-d25d-446c-a074-c06c1b13d811"})
2015-11-04T13:07:04.01-0700 [API/0] OUT Updated app with guid 05c02bed-464a-4c89-948f-9b1cf12144a0 ({"state"=>"STARTED"})
2015-11-04T13:07:04.03-0700 [STG/0] OUT Creating container
2015-11-04T13:07:04.62-0700 [STG/0] OUT Successfully created container
2015-11-04T13:07:04.71-0700 [STG/0] OUT Staging...
2015-11-04T13:07:04.74-0700 [STG/0] OUT Staging process started ...
2015-11-04T13:07:06.48-0700 [STG/0] OUT Staging process finished
2015-11-04T13:07:06.48-0700 [STG/0] OUT Exit status 0
2015-11-04T13:07:06.48-0700 [STG/0] OUT Staging Complete
2015-11-04T13:07:06.90-0700 [CELL/0] OUT Creating container

2015-11-04T13:07:45.72-0700 [CELL/0] ERR Failed to create container
2015-11-04T13:07:45.74-0700 [API/0] OUT App instance exited with guid 05c02bed-464a-4c89-948f-9b1cf12144a0 payload: {"instance"=>"c3e4ca30-98ac-4170-45cc-1755f3e6cc01", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to initialize container", "crash_count"=>4, "crash_timestamp"=>1446667665731251451, "version"=>"73c93a4b-b1d6-43c0-90e4-44b5d8fd36ee"}
************************************************************************************************************************************

I tried to run diego acceptance test(smoke test was successful) on my bosh lite and got following error.(./bin/test -skipPackage ssh)
************************************************************************************************************************************
...
[2015-11-04 19:58:26.24 (UTC)]> cf delete 39621ab3-cdbe-4790-4f90-294ec46a5546 -f
Deleting app 39621ab3-cdbe-4790-4f90-294ec46a5546 in org CATS-ORG-1-2015_11_04-12h40m44.529s / space CATS-SPACE-1-2015_11_04-12h40m44.529s as CATS-USER-1-2015_11_04-12h40m44.529s...
OK

------------------------------
• Failure in Spec Setup (JustBeforeEach) [23.136 seconds]
Docker Application Lifecycle [JustBeforeEach] running the app merges the garden and docker environment variables
/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:70

No future change is possible. Bailing out early after 17.068s.
Expected
<int>: 1
to match exit code:
<int>: 0

/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:46
------------------------------
SS

Summarizing 1 Failure:

[Fail] Docker Application Lifecycle [JustBeforeEach] running the app merges the garden and docker environment variables
/Users/rsamban/cloudFoundry/BoshLite/diego-acceptance-tests/diego/lifecycle_docker_test.go:46

Ran 27 of 29 Specs in 1080.006 seconds
FAIL! -- 26 Passed | 1 Failed | 0 Pending | 2 Skipped --- FAIL: TestApplications (1080.01s)
FAIL

Ginkgo ran 1 suite in 18m1.688798613s
Test Suite Failed
************************************************************************************************************************************

Following is output from "bosh deployments"
************************************************************************************************************************************
+-----------------+----------------------+-------------------------------------------------+--------------+
| Name | Release(s) | Stemcell(s) | Cloud Config |
+-----------------+----------------------+-------------------------------------------------+--------------+
| cf-warden | cf/222+dev.1 | bosh-warden-boshlite-ubuntu-trusty-go_agent/389 | none |
+-----------------+----------------------+-------------------------------------------------+--------------+
| cf-warden-diego | cf/222+dev.1 | bosh-warden-boshlite-ubuntu-trusty-go_agent/389 | none |
| | diego/0.1439.0+dev.1 | | |
| | etcd/18 | | |
| | garden-linux/0.325.0 | | |
+-----------------+----------------------+-------------------------------------------------+--------------+
************************************************************************************************************************************

Can somebody help me please.

thanks
-Ramesh


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

Hey Matt,

I wanted to keep using the uaa.SYSTEM_DOMAIN domain, not the internal
domain, for that experiment. I do expect the TCPSocket.open to fail when
talking to 127.0.0.1, what I wanted to know is, in the presence of no other
nameservers, does it eventually start to fail slow again, or does this
behaviour happen only when there are other nameservers. I imagine the
TCPSocket.open is blowing up on the first iteration in the loop and exiting
the script? My bad, can you replace:

TCPSocket.open("--UAA-DOMAIN--", 80).close

with

TCPSocket.open("--UAA-DOMAIN--", 80).close rescue nil

for the experiment with only 127.0.0.1 listed amongst the nameservers?

Yes, something about the move from the first to second nameserver seems
weird. I have seen strace of one case where it times out polling the FD of
the socket it opened to talk to 127.0.0.1, but in one of your straces it
looked like the poll timeout was on polling the FD for the socket for
8.8.8.8. The fact that the problem persists is interesting too, it seems
like it's not just a one-off race condition where someone messed up with FD
it was supposed to be polling.

Thanks,
Amit

On Wed, Nov 4, 2015 at 11:41 AM, Matt Cholick <cholick(a)gmail.com> wrote:

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the
first to the second nameserver in ruby, when the first nameserver doesn't
know the address.


On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's
a couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step
do |i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts
"#{i}: #{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers.
For this experiment, I would do `monit stop all` to take one of your
CC's out of commission, so that the router doesn't load balance to it,
because it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS
resolver that can be run instead of consul, that will always SERVFAIL (same
as what you see from consul when you nslookup something), so we can try
that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the
`/etc/resolv.conf` (i.e. ensure this experiment is independent of the
previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc
`, and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close:
0 ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close:
0 ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close:
0 ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job
to debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Re: cloud_controller_ng performance degrades slowly over time

Matt Cholick
 

Ah, I misunderstood.

Consul isn't configured as a recursive resolver, so for a test with only
the 127.0.0.1 in resolve.conf I changed the url in the ruby loop to
"uaa.service.cf.internal", which is what uaa is registering for in consul.

I ran through 225k lookups and it never got slow. Here's a bit of the
strace:
https://gist.github.com/cholick/38e02ce3f351847d5fa3

Bother versions of that test definitely pointing to the move from the first
to the second nameserver in ruby, when the first nameserver doesn't know
the address.

On Tue, Nov 3, 2015 at 11:43 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

I looked at the strace, I see you did indeed mean "loop without resolver
on localhost". If you try it with *only* a resolver on localhost, do you
get the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's
a couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes
to be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0
ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0
ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms |
dns_close: 0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms |
dns_close: 0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms |
dns_close: 0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms |
dns_close: 0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt


Invalid Authorization: while deploying application on local CF Instance

Deepak Arn <arn.deepak1@...>
 

While deploying application on the local cf instance(CF_Nise_Installer), its giving unauthorized error, unable to take any logs. Could anyone please help to resolve this issue

C:\Users\umroot\Downloads>cf target -s Components
API endpoint: https://api.10.0.2.15.xip.io (API version: 2.35.0)
User: admin
Org: DevBox
Space: Components

C:\Users\umroot\workspaceKeplerJee>cf push Web2210 -p C:\Users\umroot\Downloads\
war_sample\Web2210.war -b https://github.com/cloudfoundry/java-buildpack#v3.1.1
Creating app Web2210 in org DevBox / space Components as admin...
OK

Creating route web2210.10.0.2.15.xip.io...
OK

Binding web2210.10.0.2.15.xip.io to Web2210...
OK

Uploading Web2210...
Uploading app files from: C:\Users\umroot\Downloads\war_sample\Web2210.war
Uploading 3.6K, 10 files
Done uploading
OK

timeout connecting to log server, no log will be shown
Starting app Web2210 in org DevBox / space Components as admin...
Warning: error tailing logs
Unauthorized error: You are not authorized. Error: Invalid authorization

FAILED
StagingError

TIP: use 'cf logs Web2210 --recent' for more information

C:\Users\umroot\workspaceKeplerJee>cf logs Web2210 --recent
Connected, dumping recent logs for app Web2210 in org DevBox / space Components
as admin...

FAILED
Unauthorized error: You are not authorized. Error: Invalid authorization


Microsoft Dynamics AX users

Ralph Andersen <ralph.andersen@...>
 

Hi,



Would you be interested in acquiring our recently updated list of Microsoft
Dynamics AX users Accounts and their Partners, Resellers and Competitor
Accounts for your marketing initiatives or campaigns?



Other Users: Oracle JD Edwards Enterprise one , Oracle E-Business Suite,
Infor ERP syteline, Sage ERP X3 etc.



Titles:

* C-Level Executives: CTOs, CIOs, CMO, CFOs, CEOs
* Fortune 500 Execs, SMBs
* VPs, Presidents, Chairman, General Manager, Manager, other
executives, etc.



Our list includes: First Name, Last Name, Contact Title, Phone Number Fax
Number, Email address , LinkedIn Profiles, Postal Address and Zip Code,
Company Name, Web Address, SIC Code, etc.



Please let me know if I can help you out with any other database or
marketing requirements and I shall get back to you with more relevant
information.



If this might help a colleague of yours, could you please pass this on to
them?



Thank you and looking forward to hear from you.



Regards,

Ralph Andersen

<mailto:ralph.andersen(a)newmarketreach.com>
ralph.andersen(a)newmarketreach.com

Business Development|

Ph: +1 201- 839- 9577|Armonk, New York| USA


Re: Diff between cf restart and cf restage

Sabha
 

Depends on whether the change can be consumed directly by app or by
buildpack and other stuff things.

If the buildpack needs to read that variable and take some action like for
your agent activation or change stuff like memory or jvm arguments, then
restage, otherwise just restart is fine.


On Wed, Nov 4, 2015 at 10:13 AM, Nikhil Katre <nikhil.katre(a)appdynamics.com>
wrote:

Thanks for the reply!

Lets say, if I add evn variables to the application. In order for the
application to use those env variables which are set using cf-setenv, do I
need to restart or restage the application ?



On Tue, Nov 3, 2015 at 12:48 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

The staging process has access to env variables, etc. so the env can
affect the contents of the droplet.

You might notice that when you do a cf set-env you get a message that
advises you to do a cf restage. Because CF doesn't know whether your
buildpack is affected by env changes, it recommends the more extreme option.

For some (many) apps, a cf restart would be sufficient.

On Tue, Nov 3, 2015 at 9:15 PM, Matthew Sykes <matthew.sykes(a)gmail.com>
wrote:

`restage` will stop your application, run the application bits through
the staging process to create a new droplet, and then start the new
droplet. It's a lot like `push` but without actually pushing new
application bits.

`restart` will simply stop your application and start it with the
existing droplet.

You typically restart when you need your applicaiton's environment
refreshed and you typically restage when you need/want the buildpack to run
without updating the application source.

Hope that helps.

On Tue, Nov 3, 2015 at 2:46 PM, Nikhil Katre <
nikhil.katre(a)appdynamics.com> wrote:

Hi,

Can someone explain in detail what is the difference between cf restart
and cf restage ?

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video |
Try <https://portal.appdynamics.com/account/signup/signupForm>our
FREE Trial | Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


--
Matthew Sykes
matthew.sykes(a)gmail.com

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE Trial
| Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


Re: Diff between cf restart and cf restage

Nikhil Katre <nikhil.katre@...>
 

Thanks for the reply!

Lets say, if I add evn variables to the application. In order for the
application to use those env variables which are set using cf-setenv, do I
need to restart or restage the application ?

On Tue, Nov 3, 2015 at 12:48 PM, Cornelia Davis <cdavis(a)pivotal.io> wrote:

The staging process has access to env variables, etc. so the env can
affect the contents of the droplet.

You might notice that when you do a cf set-env you get a message that
advises you to do a cf restage. Because CF doesn't know whether your
buildpack is affected by env changes, it recommends the more extreme option.

For some (many) apps, a cf restart would be sufficient.

On Tue, Nov 3, 2015 at 9:15 PM, Matthew Sykes <matthew.sykes(a)gmail.com>
wrote:

`restage` will stop your application, run the application bits through
the staging process to create a new droplet, and then start the new
droplet. It's a lot like `push` but without actually pushing new
application bits.

`restart` will simply stop your application and start it with the
existing droplet.

You typically restart when you need your applicaiton's environment
refreshed and you typically restage when you need/want the buildpack to run
without updating the application source.

Hope that helps.

On Tue, Nov 3, 2015 at 2:46 PM, Nikhil Katre <
nikhil.katre(a)appdynamics.com> wrote:

Hi,

Can someone explain in detail what is the difference between cf restart
and cf restage ?

--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE
Trial | Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


--
Matthew Sykes
matthew.sykes(a)gmail.com
--
Thanks,

*Nikhil Katre* | Software Engineer
Mobile: (919) 633 3940 <%28303%29%20946%209911>

AppDynamics
The Application Intelligence Company
Watch <http://appdynamics.wistia.com/medias/56gnkuk6mv>our Video | Try
<https://portal.appdynamics.com/account/signup/signupForm>our FREE Trial |
Twitter <http://www.twitter.com/appdynamics>| Facebook
<http://www.facebook.com/pages/AppDynamics/193264136815?ref=nf>|
appdynamics.com <http://www.appdynamics.com/>


Re: ElasticSearch boshrelease and broker

Amit Kumar Gupta
 

There is at least one Elasticsearch+Logstash+Kibana release:
https://github.com/logsearch/logsearch-boshrelease

I don't know of any broker at the moment though.

On Wed, Nov 4, 2015 at 4:24 AM, Ramon Makkelie <ramon.makkelie(a)klm.com>
wrote:

i was wondering if someone already created/released a eleasticsearch bosh
release with a broker?


Immutability of applications

john mcteague <john.mcteague@...>
 

I had this conversation with a few different people during the berlin
summit and promised one of them I would repeat it on the mailing list today
to get further feedback.

Today, once we push an application, the droplet is immutable. It doesnt
change until you push the application again or restage. I believe the
entire container could change without a new push if you upgrade the rootfs
and restart all the apps (which the CF operator would do).

However, the environment vars and service bindings can be changed on an
application but they would not take affect until the next restart.The CF
API would report these changes as active when you run *cf env *or *cf
services. *There is no distinction between desired state and current state
when using the API.

To me this is a significant gap as we cannot necessarily get a true view of
the world (i call cf set-env but dont restart the app, how do I know from
the API what value of that env var my app is using).

How are people addressing this in their own environments and is it
something that the core API team should be considering (I ask the latter
publicly even though I asked Dieu during the summit :) ).

John


Re: Cloud Foundry deploy on suse

Matthew Sykes <matthew.sykes@...>
 

wshd is simply reporting [1] the pivot_root [2] failure. It looks like
you're getting an EINVAL from the call which implies warden is running in
an unexpected environment.

If I were to guess, I'd say that the container depot does not live on an
expected file system type or location...

As far as I'm aware, no work has been done to make warden run under
anything but Ubuntu or CentOS recently but it's possible someone has. If
nobody else has any hints, you'll likely have to look through the code and
work out what's going on.

[1]:
https://github.com/cloudfoundry/warden/blob/76010f2ba12e41d9e8755985ec874391fb3c962a/warden/src/wsh/wshd.c#L715
[2]: http://man7.org/linux/man-pages/man2/pivot_root.2.html

On Wed, Nov 4, 2015 at 7:27 AM, Youzhi Zhu <zhuyouzhi03(a)gmail.com> wrote:

Hi all
We are trying to deploy cloud foundry on suse, now every CF module can
start successfully, but when I push an app to CF, it occurred error, I
checked the logs and found when start the container, the wshd process throw
error "pivot_root: Invalid argument", anyone has seen this error before or
anyone has deploy CF to other OS successfully except ubuntu?thanks.

CF version is cf-release170
suse version is suse 12 with kernel 3.12.28-4-default
--
Matthew Sykes
matthew.sykes(a)gmail.com


Re: Error to make a Request to update password in UAA

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Hi, I continue doing tests and I checked that I have the required permissions for current user:

{ value: '1b48d072-6715-40a4-b01c-f4f8ede67db9', display: 'password.write', type: 'DIRECT' },

https://github.com/cloudfoundry/uaa/blob/master/docs/UAA-APIs.rst#create-a-user-post-users

And I Updated the request:

uaa_options = {
"schemas":["urn:scim:schemas:core:1.0"],
"password": accountPassword
};

But UAA continue replying with the same message:

<p><b>message</b> <u></u></p><p><b>description</b> <u>The request sent by the c
lient was syntactically incorrect.</u></p><HR size=\"1\" noshade=\"noshade\"><h3
Apache Tomcat/7.0.55</h3></body></html>" was thrown, throw an Error :)
https://github.com/prosociallearnEU/cf-nodejs-client/blob/master/lib/model/UsersUAA.js#L45-L68
https://github.com/prosociallearnEU/cf-nodejs-client/blob/master/test/lib/model/UserUAATests.js#L142-L192

Can you provide a curl example to test with Go CLI using curl tool?

Many thanks in advance.

Juan Antonio


Re: abacus collector doesn't work

Hristo Iliev
 

Hi,

Do you run this in bosh-lite environment of on hosted CF?

In case you have bosh-lite env you have to create a security group that enables the different micro-services to talk to each other. We already provide a "setup" script [1] and sec.group definition [2]

Judging from the rights you have (admin) you must be using bosh-lite, but if this is not the case you may need to change some of the collector's environment variables [3]. This is especially useful in case your Abacus pipeline runs on a different sub-domain than the default CF domain.

[1] https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/bin/cfsetup
[2] https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/etc/secgroup.json
[2] https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/metering/collector/manifest.yml#L9-L14


Cloud Foundry deploy on suse

Youzhi Zhu
 

Hi all
We are trying to deploy cloud foundry on suse, now every CF module can
start successfully, but when I push an app to CF, it occurred error, I
checked the logs and found when start the container, the wshd process throw
error "pivot_root: Invalid argument", anyone has seen this error before or
anyone has deploy CF to other OS successfully except ubuntu?thanks.

CF version is cf-release170
suse version is suse 12 with kernel 3.12.28-4-default


ElasticSearch boshrelease and broker

ramonskie
 

i was wondering if someone already created/released a eleasticsearch bosh release with a broker?


abacus collector doesn't work

MaggieMeng
 

Hi

I am trying to run abacus in my cloudfoundry env. However after successfully push all abacus applications into CF, I found following error from some of the applications:

dmadmin(a)dmadmin-Lenovo-Product:~/cloudfoundry/cf-abacus/cf-abacus$ cf logs abacus-usage-aggregator
Connected, tailing logs for app abacus-usage-aggregator in org cf / space space as admin...

2015-11-04T04:33:36.47-0500 [App/0] OUT 2015-11-04T09:33:36.469Z e-abacus-request 46 Request error { message: 'connect ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT code: 'ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT errno: 'ECONNREFUSED',
2015-11-04T04:33:36.47-0500 [App/0] OUT syscall: 'connect' } - Error: connect ECONNREFUSED
2015-11-04T04:33:36.47-0500 [App/0] OUT at exports._errnoException (util.js:746:11)
2015-11-04T04:33:36.47-0500 [App/0] OUT at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1010:19)

Same as abacus-usage-collector. "npm run demo" also failed which may due to this error. Could it be CF configuration issue? How could I enable verbose log or debug? Any help would be appreciated.

Thanks,
Maggie


Deploy on OpenNebula

Yancey
 

Any on deploy CloudFoundry on OpenNebula? I can only find the cpi for OpenStack, VMWare etc...


Re: cloud_controller_ng performance degrades slowly over time

Amit Kumar Gupta
 

I looked at the strace, I see you did indeed mean "loop without resolver on
localhost". If you try it with *only* a resolver on localhost, do you get
the eventually consistent DNS slowdown?

On Tue, Nov 3, 2015 at 8:33 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Thanks Matt!

When you say "the loop without the resolver on local host" did you mean
"the loop with only a resolver on local host"? Sorry if my setup wasn't
clear, but my intention was to only have 127.0.0.1 in etc/resolv.conf.


On Tuesday, November 3, 2015, Matt Cholick <cholick(a)gmail.com> wrote:

Here are the results of the ruby loop with strace:
https://gist.github.com/cholick/e7e122e34b524cae5fa1

As expected, things eventually get slow. The bash version of the loop
with a new vm each time didn't get slow.

For the loop without a resolver on localhost, it never did get slow.
Though it's hard to prove with something so inconsistent, it hadn't
happened after 100k requests. Here's some of the strace:
https://gist.github.com/cholick/81e58f58e82bfe0a1489

On the final loop, with the SERVFAIL resolver, the issue did manifest.
Here's the trace of that run:
https://gist.github.com/cholick/bd2af46795911cb9f63c

Thanks for digging in on this.


On Mon, Nov 2, 2015 at 6:53 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Okay, interesting, hopefully we're narrowing in on something. There's a
couple variables I'd like to eliminate, so I wonder if you could try the
following. Also, feel free at any point to let me know if you are not
interesting in digging further.

Try all things as sudo, on one of the CCs.

1. It appears that the problem goes away when the CC process is
restarted, so it feels as though there's some sort of resource that the
ruby process is not able to GC, leading to this problem to show up
eventually, and then go away when restarted. I want to confirm this by
trying two different loops, one where the loop is in bash, spinning up a
new ruby process each time, and one where the loop is in ruby.

* bash loop:

while true; do time /var/vcap/packages/ruby-VERSION/bin/ruby
-r'net/protocol' -e 'TCPSocket.open("--UAA-DOMAIN--", 80).close'; done

* ruby loop

/var/vcap/packages/ruby-VERSION/bin/ruby -r'net/protocol' -e '1.step do
|i|; t = Time.now; TCPSocket.open("--UAA-DOMAIN--", 80).close; puts "#{i}:
#{(1000*(Time.now - t)).round}ms"; end'

For each loop, it might also be useful to run `strace -f -p PID >
SOME_FILE` to see what system calls are going on before and after.

2. Another variable is the interaction with the other nameservers. For
this experiment, I would do `monit stop all` to take one of your CC's
out of commission, so that the router doesn't load balance to it, because
it will likely fail requests given the following changes:

* monit stop all && watch monit summary # wait for all the processes to
be stopped, then ctrl+c to stop the watch
* monit start consul_agent && watch monit summary # wait for
consul_agent to be running, then ctrl+c to stop the watch
* Remove nameservers other than 127.0.0.1 from /etc/resolv.conf
* Run the "ruby loop", and see if it still eventually gets slow
* When it's all done, put the original nameservers back in
/etc/resolv.conf, and `monit restart all`

Again, strace-ing the ruby loop would be interesting here.

3. Finally, consul itself. Dmitriy (BOSH PM) has a little DNS resolver
that can be run instead of consul, that will always SERVFAIL (same as what
you see from consul when you nslookup something), so we can try that:

* Modify `/var/vcap/bosh/etc/gemrc` to remove the `--local` flag
* Run `gem install rubydns`
* Dump the following into a file, say `/var/vcap/data/tmp/dns.rb`:

#!/usr/bin/env ruby

require "rubydns"

RubyDNS.run_server(listen: [[:udp, "0.0.0.0", 53], [:tcp, "0.0.0.0",
53]]) do
otherwise do |transaction|
transaction.fail!(:ServFail)
end
end

* monit stop all && watch monit summary # and again, wait for
everything to be stopped
* Run it with `ruby /var/vcap/data/tmp/dns.rb`. Note that this
command, and the previous `gem install`, use the system gem/ruby, not
the ruby package used by CC, so it maintains some separation. When running
this, it will spit out logs to the terminal, so one can keep an eye on what
it's doing, make sure it all looks reasonable
* Make sure the original nameservers are back in the `/etc/resolv.conf`
(i.e. ensure this experiment is independent of the previous experiment).
* Run the "ruby loop" (in a separate shell session on the CC)
* After it's all done, add back `--local` to `/var/vcap/bosh/etc/gemrc`,
and `monit restart all`

Again, run strace on the ruby process.

What I hope we find out is that (1) only the ruby loop is affected, so
it has something to do with long running ruby processes, (2) the problem is
independent of the other nameservers listed in /etc/resolv.conf, and
(3) the problem remains when running Dmitriy's DNS-FAILSERVer instead of
consul on 127.0.0.1:53, to determine that the problem is not specific
to consul.

On Sun, Nov 1, 2015 at 5:18 PM, Matt Cholick <cholick(a)gmail.com> wrote:

Amit,
It looks like consul isn't configured as a recursive resolver. When
running the above code, resolving fails on the first nameserver and the
script fails. resolv-replace's TCPSocket.open is different from the code
http.rb (and thus api) is using. http.rb is pulling in 'net/protocol'. I
changed the script, replacing the require for 'resolv-replace' to
'net/protocol' to match the cloud controller.

Results:

3286 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4 ms | dns_close: 0
ms
3287 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3288 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 6 ms | dns_close: 0
ms
3289 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3290 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3291 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3292 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3293 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 5 ms | dns_close: 0
ms
3294 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 2008 ms | dns_close:
0 ms
3295 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3296 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3297 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4006 ms | dns_close:
0 ms
3298 -- ip_open: 2 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3299 -- ip_open: 3 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3300 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms
3301 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4011 ms | dns_close:
0 ms
3302 -- ip_open: 1 ms | ip_close: 0 ms | dns_open: 4010 ms | dns_close:
0 ms

And the consul logs, though there's nothing interesting there:
https://gist.github.com/cholick/03d74f7f012e54c50b56


On Fri, Oct 30, 2015 at 5:51 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Yup, that's what I was suspecting. Can you try the following now:

1. Add something like the following to your cf manifest:

...
jobs:
...
- name: cloud_controller_z1
...
properties:
consul:
agent:
...
log_level: debug
...

This will set the debug level for the consul agents on your CC job to
debug, so we might be able to see more for its logs. It only sets it on
the job that matters, so when you redeploy, it won't have to roll the whole
deployment. It's okay if you can't/don't want to do this, I'm not sure how
much you want to play around with your environment, but it could be helpful.

2. Add the following line to the bottom of your /etc/resolv.conf

options timeout:4

Let's see if the slow DNS is on the order of 4000ms now, to pin down
where the 5s is exactly coming from.

3. Run the following script on your CC box:

require 'resolv-replace'

UAA_DOMAIN = '--CHANGE-ME--' # e.g. 'uaa.run.pivotal.io'
UAA_IP = '--CHANGE-ME-TOO--' # e.g. '52.21.135.158'

def dur(start_time, end_time)
"#{(1000*(end_time-start_time)).round} ms"
end

1.step do |i|
ip_start = Time.now
s = TCPSocket.open(UAA_IP, 80)
ip_open = Time.now
s.close
ip_close = Time.now

dns_start = Time.now
s = TCPSocket.open(UAA_DOMAIN, 80)
dns_open = Time.now
s.close
dns_close = Time.now

ip_open_dur = dur(ip_start, ip_open)
ip_close_dur = dur(ip_open, ip_close)
dns_open_dur = dur(dns_start, dns_open)
dns_close_dur = dur(dns_open, dns_close)

puts "#{"%04d" % i} -- ip_open: #{ip_open_dur} | ip_close:
#{ip_close_dur} | dns_open: #{dns_open_dur} | dns_close: #{dns_close_dur}"
end

You will need to first nslookup (or otherwise determine) the IP that
the UAA_DOMAIN resolves to (it will be some load balancer, possibly the
gorouter, ha_proxy, or your own upstream LB)

4. Grab the files in /var/vcap/sys/log/consul_agent/

Cheers,
Amit

On Fri, Oct 30, 2015 at 4:29 PM, Matt Cholick <cholick(a)gmail.com>
wrote:

Here's the results:

https://gist.github.com/cholick/1325fe0f592b1805eba5

The time all between opening connection and opened, with the
corresponding ruby source in http.rb's connect method:

D "opening connection to #{conn_address}:#{conn_port}..."

s = Timeout.timeout(@open_timeout, Net::OpenTimeout) {
TCPSocket.open(conn_address, conn_port, @local_host, @local_port)
}
s.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
D "opened"

I don't know much ruby, so that's as far I drilled down.

-Matt

6821 - 6840 of 9422