`api_z1/0' is not running after update to CF v231


Wayne Ha <wayne.h.ha@...>
 

Sorry for the late response. I didn't get a chance to try again until today. It turned out by setting require_https to false will let me run "cf login".

Properties
uaa
+ require_https: false
Meta
No changes
Deploying
---------
Director task 10
Started preparing deployment
Started preparing deployment > Binding deployment. Done (00:00:00)
Started preparing deployment > Binding releases. Done (00:00:00)
Started preparing deployment > Binding existing deployment. Done (00:00:00)
Started preparing deployment > Binding resource pools. Done (00:00:00)
Started preparing deployment > Binding stemcells. Done (00:00:00)
Started preparing deployment > Binding templates. Done (00:00:00)
Started preparing deployment > Binding properties. Done (00:00:00)
Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
Started preparing deployment > Binding instance networks. Done (00:00:00)
Done preparing deployment (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started preparing dns > Binding DNS. Done (00:00:00)
Started preparing configuration > Binding configuration. Done (00:00:03)
Started updating job uaa_z1 > uaa_z1/0. Done (00:01:09)


Filip Hanik
 

follow the steps in here

https://github.com/cloudfoundry/bosh-lite/blob/master/bin/provision_cf

for a working bosh-lite. key is to start with a virtualbox image. the
fusion/workstation are not up to date.

Filip

On Monday, March 7, 2016, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

Filip,

I am running with the latest CF v231. Initially, I ran it with older
stemcell and got:

`api_z1/0' is not running after update

After running with the latest stemcell, I got a successful deployment but
failed to login with error:

Error performing request: Get https://login.bosh-lite.com/login: stopped
after 1 redirect

Could there be some configurations that I missed? Note that I am using
default bosh-lite-v231.yml.

Thanks,


Amit Kumar Gupta
 

Hey Wayne,

What command did you run to generate your manifest?

Amit

On Mon, Mar 7, 2016 at 8:03 PM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

Filip,

I am running with the latest CF v231. Initially, I ran it with older
stemcell and got:

`api_z1/0' is not running after update

After running with the latest stemcell, I got a successful deployment but
failed to login with error:

Error performing request: Get https://login.bosh-lite.com/login: stopped
after 1 redirect

Could there be some configurations that I missed? Note that I am using
default bosh-lite-v231.yml.

Thanks,


Wayne Ha <wayne.h.ha@...>
 

Filip,

I am running with the latest CF v231. Initially, I ran it with older stemcell and got:

`api_z1/0' is not running after update

After running with the latest stemcell, I got a successful deployment but failed to login with error:

Error performing request: Get https://login.bosh-lite.com/login: stopped after 1 redirect

Could there be some configurations that I missed? Note that I am using default bosh-lite-v231.yml.

Thanks,


Filip Hanik
 

Error performing request: Get https://login.bosh-lite.com/login: stopped
after 1 redirect


that's the error right there. it's a redirect loop

what version of CF is this? upgrade to the latest.

On Monday, March 7, 2016, sridhar vennela <sridhar.vennela(a)gmail.com> wrote:

Hi Wayne,

I am not seeing any errors in above. To capture UAA errors, It is better
to open 2 terminals in one terminal you can do tail -f uaa.log and in
another terminal try to do cf login -a api.bosh-lite.com -u admin -p
admin --skip-ssl-validation.

Thank you,
Sridhar


sridhar vennela
 

Hi Wayne,

I am not seeing any errors in above. To capture UAA errors, It is better to open 2 terminals in one terminal you can do tail -f uaa.log and in another terminal try to do cf login -a api.bosh-lite.com -u admin -p admin --skip-ssl-validation.

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Zach,

After using the latest stemcell, I got a successful deployment. But after that, cf login fails:

vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin
API endpoint: api.bosh-lite.com
FAILED
Invalid SSL Cert for api.bosh-lite.com
TIP: Use 'cf login --skip-ssl-validation' to continue with an insecure API endpoint

vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin --skip-ssl-validation
API endpoint: api.bosh-lite.com
FAILED
Error performing request: Get https://login.bosh-lite.com/login: stopped after 1 redirect
API endpoint: https://api.bosh-lite.com (API version: 2.51.0)
Not logged in. Use 'cf login' to log in.

I saw the following in uaa.log:

root(a)d142fabc-f823-40df-b9ea-97d306bf7209:/var/vcap/sys/log/uaa# grep -A9 -i error uaa.log | cut -c 65-650
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/error'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_sent'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/create_account*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/accounts/email_sent'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/invalid_request'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/saml_error'
DEBUG --- UaaRequestMatcher: [loginAuthenticateRequestMatcher] Checking match of request : '/login'; '/authenticate' with parameters={} and headers {Authorization=[bearer ], accept=[application/json]}
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/authenticate/**'
DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcher] Checking match of request : '/login'; '/oauth/authorize' with parameters={source=login} and headers {accept=[application/json]}
DEBUG --- UaaRequestMatcher: [loginTokenRequestMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={source=login, grant_type=password, add_new=} and headers {Authorization=[bearer ], accept=[application/json]}
DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcherOld] Checking match of request : '/login'; '/oauth/authorize' with parameters={login={} and headers {accept=[application/json]}
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/password_*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/oauth/token/revoke/**'
DEBUG --- UaaRequestMatcher: [passcodeTokenMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={grant_type=password, passcode=} and headers {accept=[application/json, application/x-www-form-urlencoded]}

But I don't know what the above mean.

Thanks,


Wayne Ha <wayne.h.ha@...>
 

Zach,

Thanks for the hints. You are right, I am not using latest stemcell:

vagrant(a)agent-id-bosh-0:~$ bosh stemcells
+---------------------------------------------+---------+--------------------------------------+
| Name | Version | CID |
+---------------------------------------------+---------+--------------------------------------+
| bosh-warden-boshlite-ubuntu-trusty-go_agent | 389* | cb6ee28c-a703-4a7e-581b-b63be2302e3d |

I will try the stemcell you recommended to see if it helps.

Thanks,


Zach Robinson
 

Wayne,

Can you verify that you are using the latest bosh-lite stemcell 3147? Older stemcells are known to have issues with consul which is what many of the CF components use for service discovery.

Latest bosh-lite stemcells can be found at http://bosh.io Just search for lite.

See this similar issue: https://github.com/cloudfoundry/cf-release/issues/919

-Zach


Amit Kumar Gupta
 

As of cf v231, CC has switched from using NFS to WebDav as the default
blobstore. There are more details in the release notes:
https://github.com/cloudfoundry/cf-release/releases/tag/v231. I don't know
off-hand how to debug the issue you're seeing, but I will reach out to some
folks with more knowledge of Cloud Controller.

Best,
Amit

On Mon, Mar 7, 2016 at 8:48 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

Kayode,

I am using the default bosh-lite-v231.yml file and the instances for nfs
server is set to 0:

vagrant(a)agent-id-bosh-0:~$ egrep -i "name:.*nfs|instances"
bosh-lite-v231.yml.1603041454
etc...
- instances: 0
- instances: 0
- instances: 0
name: nfs_z1
- name: debian_nfs_server
- instances: 1
- instances: 1
- instances: 1
etc...

So it is not running.

Thanks,


Wayne Ha <wayne.h.ha@...>
 

Kayode,

I am using the default bosh-lite-v231.yml file and the instances for nfs server is set to 0:

vagrant(a)agent-id-bosh-0:~$ egrep -i "name:.*nfs|instances" bosh-lite-v231.yml.1603041454
etc...
- instances: 0
- instances: 0
- instances: 0
name: nfs_z1
- name: debian_nfs_server
- instances: 1
- instances: 1
- instances: 1
etc...

So it is not running.

Thanks,


Paul Bakare
 

Wayne, is the nfs_server-partition running?

On Mon, Mar 7, 2016 at 1:43 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

I checked the blobstore is running:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 4h 14m
Process 'consul_agent' running
Process 'metron_agent' running
Process 'blobstore_nginx' running
Process 'route_registrar' running
System 'system_e83575d2-dfbf-4f7c-97ee-5112560fa137' running

But there are thousands of errors saying DopplerForwarder: can't forward
message, loggregator client pool is empty:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# find . -name
"*.log" | xargs grep -i error | cut -c 73-500 | sort -u
,"process_id":246,"source":"metron","log_level":
"error","message":"DopplerForwarder: can't forward message","data":{
"error":"loggregator client pool is empty"},

"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/dopplerforwarder/doppler_forwarder.go",
"line":104,

"method":"metron/writers/dopplerforwarder.(*DopplerForwarder).networkWrite"}

Not sure what is wrong.


Wayne Ha <wayne.h.ha@...>
 

I checked the blobstore is running:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 4h 14m
Process 'consul_agent' running
Process 'metron_agent' running
Process 'blobstore_nginx' running
Process 'route_registrar' running
System 'system_e83575d2-dfbf-4f7c-97ee-5112560fa137' running

But there are thousands of errors saying DopplerForwarder: can't forward message, loggregator client pool is empty:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# find . -name "*.log" | xargs grep -i error | cut -c 73-500 | sort -u
,"process_id":246,"source":"metron","log_level":
"error","message":"DopplerForwarder: can't forward message","data":{
"error":"loggregator client pool is empty"},
"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/dopplerforwarder/doppler_forwarder.go",
"line":104,
"method":"metron/writers/dopplerforwarder.(*DopplerForwarder).networkWrite"}

Not sure what is wrong.


Wayne Ha <wayne.h.ha@...>
 

Amit,

Thanks for letting me know I might have looked at the wrong log files. I
saw the following in cloud_controller log files:

root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail
./cloud_controller_ng_ctl.log
[2016-03-06 22:40:28+0000] ------------ STARTING cloud_controller_ng_ctl at
Sun Mar 6 22:40:28 UTC 2016 --------------
[2016-03-06 22:40:28+0000] Checking for blobstore availability
[2016-03-06 22:41:03+0000] Blobstore is not available

root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail
./cloud_controller_worker_ctl.log
[2016-03-06 22:41:13+0000] Killing
/var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: 12145
[2016-03-06 22:41:13+0000] .Stopped
[2016-03-06 22:41:36+0000] Blobstore is not available
[2016-03-06 22:41:48+0000] ------------ STARTING
cloud_controller_worker_ctl at Sun Mar 6 22:41:48 UTC 2016 --------------
[2016-03-06 22:41:48+0000] Checking for blobstore availability
[2016-03-06 22:41:48+0000] Removing stale pidfile...

So maybe the cause is Blobstore is not available?

Thanks,

On Sun, Mar 6, 2016 at 1:15 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

The log lines saying "/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock
is not found" is probably just a symptom of the problem, not the root
cause. You're probably seeing those in the nginx logs? Cloud Controller
is failing to start, hence it is not establishing a connection on the
socket. You need to dig deeper into failures in logs in
/var/vcap/sys/log/cloud_controller_ng.

On Sun, Mar 6, 2016 at 10:00 AM, sridhar vennela <
sridhar.vennela(a)gmail.com> wrote:

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.


https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


Amit Kumar Gupta
 

The log lines saying
"/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock
is not found" is probably just a symptom of the problem, not the root
cause. You're probably seeing those in the nginx logs? Cloud Controller
is failing to start, hence it is not establishing a connection on the
socket. You need to dig deeper into failures in logs in
/var/vcap/sys/log/cloud_controller_ng.

On Sun, Mar 6, 2016 at 10:00 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.


https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


sridhar vennela
 

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.

https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Since it is complaining /var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock is not found, I thought I would just touch that file. Now I get:

2016/03/06 17:14:11 [error] 18497#0: *5 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (111: Connection refused) while connecting to upstream, client: <bosh director>,
server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com"

Maybe there is network configuration problem in my environment?


Wayne Ha <wayne.h.ha@...>
 

Sridhar,

Thanks for your response. I have tried your suggestion and it doesn't
help. But I might have misled you with the consul error. That error only
got logged once at the beginning. So like you said, maybe VM was not able
to join consul server before it came up. But after that, the following
error keeps logging every minute or so:

2016/03/06 17:04:41 [crit] 11480#0: *4 connect() to
unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (2:
No such file or directory) while connecting to upstream,
server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1",
upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000",
host: "api.bosh-lite.com"

So maybe the above is the cause of the problem?

Thanks,

On Sun, Mar 6, 2016 at 12:51 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Somehow VM is not able to join consul server. You can try below steps.

ps -ef | grep consul

kill consul-serverpid

monit restart <consul-job>

Thank you,
Sridhar


sridhar vennela
 

Hi Wayne,

Somehow VM is not able to join consul server. You can try below steps.

ps -ef | grep consul

kill consul-serverpid

monit restart <consul-job>

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Sridhar,

Thanks for your response. I found the VM is listening to port 8500:

root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp |
grep LISTEN
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:2822 0.0.0.0:*
LISTEN 72/monit
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 31/sshd
tcp 0 0 10.244.0.138:8301 0.0.0.0:*
LISTEN 18162/consul

If I run "monit stop all" then it only listens to the following:

root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp |
grep LISTEN
tcp 0 0 127.0.0.1:2822 0.0.0.0:*
LISTEN 72/monit
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 31/sshd

Note that 10.244.0.138 is the IP of this VM.

Thanks,

On Sat, Mar 5, 2016 at 12:58 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Can you please verify port 8500 listening? Maybe output of netstat -anp
will help.

{"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get
http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500:
getsockopt: connection refused","wan":false}}

Thank you,
Sridhar