`api_z1/0' is not running after update to CF v231


Wayne Ha <wayne.h.ha@...>
 

Hi,

I tried to deploy CF v231 and got the following error:

Started updating job api_z1 > api_z1/0. Failed: `api_z1/0' is not running after update (00:03:09)

The monit summary shows:

root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 12m
Process 'consul_agent' running
Process 'cloud_controller_ng' Execution failed
Process 'cloud_controller_worker_local_1' initializing
Process 'cloud_controller_worker_local_2' initializing
Process 'nginx_cc' initializing
Process 'cloud_controller_migration' running
Process 'cloud_controller_clock' running
Process 'cloud_controller_worker_1' Does not exist
Process 'metron_agent' running
Process 'statsd-injector' running
Process 'route_registrar' running
System 'system_5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc' running

Can someone give me pointers how to fix this problem?

Thanks,

Here is output of bosh vms:

vagrant(a)agent-id-bosh-0:~$ bosh vms --details
+------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+
| Job/index | State | Resource Pool | IPs | CID | Agent ID | Resurrection |
+------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+
| api_z1/0 | failing | large_z1 | 10.244.0.138 | 3091ce64-8b95-463b-61c7-a6094359c7f7 | 5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc | active |
| blobstore_z1/0 | running | medium_z1 | 10.244.0.130 | c5715c82-9622-416b-5766-32c528786fb1 | 518aa4d3-1296-4e48-b033-d57ad4baf75a | active |
| consul_z1/0 | running | small_z1 | 10.244.0.54 | 11f444db-673a-4243-6ac8-a750b0dd8733 | c9ca7c9e-525a-4604-9fb0-60cc67d59709 | active |
| doppler_z1/0 | running | medium_z1 | 10.244.0.146 | 73882f35-714b-4613-64cf-6491ebed04f3 | e4299f6f-04b5-4373-a00b-f4d7f21fab2e | active |
| etcd_z1/0 | running | medium_z1 | 10.244.0.42 | 7a1da145-5296-47df-55ac-160248a023ed | 1c414c9d-75cc-40b0-928c-6cf674a6a6f2 | active |
| ha_proxy_z1/0 | running | router_z1 | 10.244.0.34 | 84799c83-48b8-407e-6014-1bcf85a9dd91 | 320c8887-68ec-42e9-b00d-d5ab6468a433 | active |
| hm9000_z1/0 | running | medium_z1 | 10.244.0.142 | 4369a61f-5193-4d95-7bfc-a091abe93bb9 | c9cc2239-969e-4cc2-81e9-b808e8fe0f71 | active |
| loggregator_trafficcontroller_z1/0 | running | small_z1 | 10.244.0.150 | 9c17e6f3-c3bf-4ad7-68f6-59f0527ac2f6 | ee8a3439-6852-496e-a497-68607575641a | active |
| nats_z1/0 | running | medium_z1 | 10.244.0.6 | 41fa153c-2973-4748-6ae3-ed6f5a1ee0a4 | e88b340d-9d06-4bbe-bc5a-f0ac3ff1159d | active |
| postgres_z1/0 | running | medium_z1 | 10.244.0.30 | 8465ab2f-0593-485c-7eb6-11f200fc331e | ade6219c-caf1-4aef-8804-8dd7262fc25d | active |
| router_z1/0 | running | router_z1 | 10.244.0.22 | 782da9f1-212f-4c05-5835-56c5f6979c3c | 5d71bb3d-a6f3-4677-bc09-801e99188c38 | active |
| runner_z1/0 | running | runner_z1 | 10.244.0.26 | 64d55ecb-961e-4c0c-5263-188609d235c0 | a85a3958-e701-4b6d-8dc3-2a80bd33c01b | active |
| uaa_z1/0 | running | medium_z1 | 10.244.0.134 | f105c13e-0311-41fd-6d42-0176cf50274a | 5c5af35b-ab34-42fb-b1d5-dd3c927b192b | active |
+------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+


sridhar vennela
 

Hi Wayne,

Below jobs are failing. You will find more details about the errors in below dir.

/var/vcap/sys/logs/<job_name>/

Process 'cloud_controller_ng' Execution failed
Process 'cloud_controller_worker_1' Does not exist

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Sridhar,

Thanks for your response. I see the following in the log file:

2016/03/05 00:02:38 [crit] 5082#0: *3 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (2: No such file or directory) while connecting to upstream,
server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com"

But I don't know what is the missing file or directory. If I remove cloud_controller_ng from yml:

$ diff -w bosh-lite-v231.yml.1603041454 bosh-lite-v231.yml
etc...
422,423c420,421
< # name: cloud_controller_ng
< # release: cf
---
- name: cloud_controller_ng
release: cf
etc...

The problem will go away:

root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 0m
Process 'consul_agent' running
Process 'cloud_controller_clock' running
Process 'cloud_controller_worker_1' running
Process 'metron_agent' running
Process 'statsd-injector' running
Process 'route_registrar' running
System 'system_5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc' running

But I think the environment also becomes unusable.

Thanks,


sridhar vennela
 

That's correct. You need that job, otherwise, the environment is unusable. Do you know which checklist you are following or using to install bosh-lite? any idea on host configuration- api.bosh-lite.com?


Wayne Ha <wayne.h.ha@...>
 

I am able to ping that host:

root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 api.bosh-lite.com
PING api.bosh-lite.com (10.244.0.34) 56(84) bytes of data.
64 bytes from 10.244.0.34: icmp_seq=1 ttl=63 time=0.088 ms
--- api.bosh-lite.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.088/0.088/0.088/0.000 ms

I just use the yml file came with CF 231:

$ grep api.bosh-lite.com bosh-lite-v231.yml | sort -u
api: api.bosh-lite.com
- api.bosh-lite.com
srv_api_uri: https://api.bosh-lite.com

I also found the following in the log file:

{"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500: getsockopt: connection refused","wan":false}}
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: 'api-z1-0'
Datacenter: 'dc1'
Server: false (bootstrap: false)
Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400)
Cluster Addr: 10.244.0.138 (LAN: 8301, WAN: 8302)
Gossip encrypt: true, RPC-TLS: true, TLS-Incoming: true
Atlas: <disabled>

I can also ping the above IP's:

root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 10.244.0.138
PING 10.244.0.138 (10.244.0.138) 56(84) bytes of data.
64 bytes from 10.244.0.138: icmp_seq=1 ttl=64 time=0.044 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms

root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.049 ms
1 packets transmitted, 1 received, 0% packet loss, time 0ms

Not sure why there is connection refused.


sridhar vennela
 

Hi Wayne,

Can you please verify port 8500 listening? Maybe output of netstat -anp will help.

{"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500: getsockopt: connection refused","wan":false}}

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Sridhar,

Thanks for your response. I found the VM is listening to port 8500:

root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp |
grep LISTEN
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 18162/consul
tcp 0 0 127.0.0.1:2822 0.0.0.0:*
LISTEN 72/monit
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 31/sshd
tcp 0 0 10.244.0.138:8301 0.0.0.0:*
LISTEN 18162/consul

If I run "monit stop all" then it only listens to the following:

root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp |
grep LISTEN
tcp 0 0 127.0.0.1:2822 0.0.0.0:*
LISTEN 72/monit
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 31/sshd

Note that 10.244.0.138 is the IP of this VM.

Thanks,

On Sat, Mar 5, 2016 at 12:58 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Can you please verify port 8500 listening? Maybe output of netstat -anp
will help.

{"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get
http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500:
getsockopt: connection refused","wan":false}}

Thank you,
Sridhar


sridhar vennela
 

Hi Wayne,

Somehow VM is not able to join consul server. You can try below steps.

ps -ef | grep consul

kill consul-serverpid

monit restart <consul-job>

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Sridhar,

Thanks for your response. I have tried your suggestion and it doesn't
help. But I might have misled you with the consul error. That error only
got logged once at the beginning. So like you said, maybe VM was not able
to join consul server before it came up. But after that, the following
error keeps logging every minute or so:

2016/03/06 17:04:41 [crit] 11480#0: *4 connect() to
unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (2:
No such file or directory) while connecting to upstream,
server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1",
upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000",
host: "api.bosh-lite.com"

So maybe the above is the cause of the problem?

Thanks,

On Sun, Mar 6, 2016 at 12:51 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Somehow VM is not able to join consul server. You can try below steps.

ps -ef | grep consul

kill consul-serverpid

monit restart <consul-job>

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Since it is complaining /var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock is not found, I thought I would just touch that file. Now I get:

2016/03/06 17:14:11 [error] 18497#0: *5 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (111: Connection refused) while connecting to upstream, client: <bosh director>,
server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com"

Maybe there is network configuration problem in my environment?


sridhar vennela
 

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.

https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


Amit Kumar Gupta
 

The log lines saying
"/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock
is not found" is probably just a symptom of the problem, not the root
cause. You're probably seeing those in the nginx logs? Cloud Controller
is failing to start, hence it is not establishing a connection on the
socket. You need to dig deeper into failures in logs in
/var/vcap/sys/log/cloud_controller_ng.

On Sun, Mar 6, 2016 at 10:00 AM, sridhar vennela <sridhar.vennela(a)gmail.com>
wrote:

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.


https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

Amit,

Thanks for letting me know I might have looked at the wrong log files. I
saw the following in cloud_controller log files:

root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail
./cloud_controller_ng_ctl.log
[2016-03-06 22:40:28+0000] ------------ STARTING cloud_controller_ng_ctl at
Sun Mar 6 22:40:28 UTC 2016 --------------
[2016-03-06 22:40:28+0000] Checking for blobstore availability
[2016-03-06 22:41:03+0000] Blobstore is not available

root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail
./cloud_controller_worker_ctl.log
[2016-03-06 22:41:13+0000] Killing
/var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: 12145
[2016-03-06 22:41:13+0000] .Stopped
[2016-03-06 22:41:36+0000] Blobstore is not available
[2016-03-06 22:41:48+0000] ------------ STARTING
cloud_controller_worker_ctl at Sun Mar 6 22:41:48 UTC 2016 --------------
[2016-03-06 22:41:48+0000] Checking for blobstore availability
[2016-03-06 22:41:48+0000] Removing stale pidfile...

So maybe the cause is Blobstore is not available?

Thanks,

On Sun, Mar 6, 2016 at 1:15 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

The log lines saying "/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock
is not found" is probably just a symptom of the problem, not the root
cause. You're probably seeing those in the nginx logs? Cloud Controller
is failing to start, hence it is not establishing a connection on the
socket. You need to dig deeper into failures in logs in
/var/vcap/sys/log/cloud_controller_ng.

On Sun, Mar 6, 2016 at 10:00 AM, sridhar vennela <
sridhar.vennela(a)gmail.com> wrote:

Hi Wayne,

Looks like it, It is trying to connect to loggregator and failing I guess.


https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb

Thank you,
Sridhar


Wayne Ha <wayne.h.ha@...>
 

I checked the blobstore is running:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 4h 14m
Process 'consul_agent' running
Process 'metron_agent' running
Process 'blobstore_nginx' running
Process 'route_registrar' running
System 'system_e83575d2-dfbf-4f7c-97ee-5112560fa137' running

But there are thousands of errors saying DopplerForwarder: can't forward message, loggregator client pool is empty:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# find . -name "*.log" | xargs grep -i error | cut -c 73-500 | sort -u
,"process_id":246,"source":"metron","log_level":
"error","message":"DopplerForwarder: can't forward message","data":{
"error":"loggregator client pool is empty"},
"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/dopplerforwarder/doppler_forwarder.go",
"line":104,
"method":"metron/writers/dopplerforwarder.(*DopplerForwarder).networkWrite"}

Not sure what is wrong.


Paul Bakare
 

Wayne, is the nfs_server-partition running?

On Mon, Mar 7, 2016 at 1:43 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

I checked the blobstore is running:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 4h 14m
Process 'consul_agent' running
Process 'metron_agent' running
Process 'blobstore_nginx' running
Process 'route_registrar' running
System 'system_e83575d2-dfbf-4f7c-97ee-5112560fa137' running

But there are thousands of errors saying DopplerForwarder: can't forward
message, loggregator client pool is empty:

root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# find . -name
"*.log" | xargs grep -i error | cut -c 73-500 | sort -u
,"process_id":246,"source":"metron","log_level":
"error","message":"DopplerForwarder: can't forward message","data":{
"error":"loggregator client pool is empty"},

"file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/dopplerforwarder/doppler_forwarder.go",
"line":104,

"method":"metron/writers/dopplerforwarder.(*DopplerForwarder).networkWrite"}

Not sure what is wrong.


Wayne Ha <wayne.h.ha@...>
 

Kayode,

I am using the default bosh-lite-v231.yml file and the instances for nfs server is set to 0:

vagrant(a)agent-id-bosh-0:~$ egrep -i "name:.*nfs|instances" bosh-lite-v231.yml.1603041454
etc...
- instances: 0
- instances: 0
- instances: 0
name: nfs_z1
- name: debian_nfs_server
- instances: 1
- instances: 1
- instances: 1
etc...

So it is not running.

Thanks,


Amit Kumar Gupta
 

As of cf v231, CC has switched from using NFS to WebDav as the default
blobstore. There are more details in the release notes:
https://github.com/cloudfoundry/cf-release/releases/tag/v231. I don't know
off-hand how to debug the issue you're seeing, but I will reach out to some
folks with more knowledge of Cloud Controller.

Best,
Amit

On Mon, Mar 7, 2016 at 8:48 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:

Kayode,

I am using the default bosh-lite-v231.yml file and the instances for nfs
server is set to 0:

vagrant(a)agent-id-bosh-0:~$ egrep -i "name:.*nfs|instances"
bosh-lite-v231.yml.1603041454
etc...
- instances: 0
- instances: 0
- instances: 0
name: nfs_z1
- name: debian_nfs_server
- instances: 1
- instances: 1
- instances: 1
etc...

So it is not running.

Thanks,


Zach Robinson
 

Wayne,

Can you verify that you are using the latest bosh-lite stemcell 3147? Older stemcells are known to have issues with consul which is what many of the CF components use for service discovery.

Latest bosh-lite stemcells can be found at http://bosh.io Just search for lite.

See this similar issue: https://github.com/cloudfoundry/cf-release/issues/919

-Zach


Wayne Ha <wayne.h.ha@...>
 

Zach,

Thanks for the hints. You are right, I am not using latest stemcell:

vagrant(a)agent-id-bosh-0:~$ bosh stemcells
+---------------------------------------------+---------+--------------------------------------+
| Name | Version | CID |
+---------------------------------------------+---------+--------------------------------------+
| bosh-warden-boshlite-ubuntu-trusty-go_agent | 389* | cb6ee28c-a703-4a7e-581b-b63be2302e3d |

I will try the stemcell you recommended to see if it helps.

Thanks,


Wayne Ha <wayne.h.ha@...>
 

Zach,

After using the latest stemcell, I got a successful deployment. But after that, cf login fails:

vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin
API endpoint: api.bosh-lite.com
FAILED
Invalid SSL Cert for api.bosh-lite.com
TIP: Use 'cf login --skip-ssl-validation' to continue with an insecure API endpoint

vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin --skip-ssl-validation
API endpoint: api.bosh-lite.com
FAILED
Error performing request: Get https://login.bosh-lite.com/login: stopped after 1 redirect
API endpoint: https://api.bosh-lite.com (API version: 2.51.0)
Not logged in. Use 'cf login' to log in.

I saw the following in uaa.log:

root(a)d142fabc-f823-40df-b9ea-97d306bf7209:/var/vcap/sys/log/uaa# grep -A9 -i error uaa.log | cut -c 65-650
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/error'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_sent'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/create_account*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/accounts/email_sent'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/invalid_request'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/saml_error'
DEBUG --- UaaRequestMatcher: [loginAuthenticateRequestMatcher] Checking match of request : '/login'; '/authenticate' with parameters={} and headers {Authorization=[bearer ], accept=[application/json]}
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/authenticate/**'
DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcher] Checking match of request : '/login'; '/oauth/authorize' with parameters={source=login} and headers {accept=[application/json]}
DEBUG --- UaaRequestMatcher: [loginTokenRequestMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={source=login, grant_type=password, add_new=} and headers {Authorization=[bearer ], accept=[application/json]}
DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcherOld] Checking match of request : '/login'; '/oauth/authorize' with parameters={login={} and headers {accept=[application/json]}
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/password_*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_*'
DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/oauth/token/revoke/**'
DEBUG --- UaaRequestMatcher: [passcodeTokenMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={grant_type=password, passcode=} and headers {accept=[application/json, application/x-www-form-urlencoded]}

But I don't know what the above mean.

Thanks,