Date
1 - 20 of 26
`api_z1/0' is not running after update to CF v231
Wayne Ha <wayne.h.ha@...>
Hi,
I tried to deploy CF v231 and got the following error: Started updating job api_z1 > api_z1/0. Failed: `api_z1/0' is not running after update (00:03:09) The monit summary shows: root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# monit summary The Monit daemon 5.2.4 uptime: 12m Process 'consul_agent' running Process 'cloud_controller_ng' Execution failed Process 'cloud_controller_worker_local_1' initializing Process 'cloud_controller_worker_local_2' initializing Process 'nginx_cc' initializing Process 'cloud_controller_migration' running Process 'cloud_controller_clock' running Process 'cloud_controller_worker_1' Does not exist Process 'metron_agent' running Process 'statsd-injector' running Process 'route_registrar' running System 'system_5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc' running Can someone give me pointers how to fix this problem? Thanks, Here is output of bosh vms: vagrant(a)agent-id-bosh-0:~$ bosh vms --details +------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+ | Job/index | State | Resource Pool | IPs | CID | Agent ID | Resurrection | +------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+ | api_z1/0 | failing | large_z1 | 10.244.0.138 | 3091ce64-8b95-463b-61c7-a6094359c7f7 | 5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc | active | | blobstore_z1/0 | running | medium_z1 | 10.244.0.130 | c5715c82-9622-416b-5766-32c528786fb1 | 518aa4d3-1296-4e48-b033-d57ad4baf75a | active | | consul_z1/0 | running | small_z1 | 10.244.0.54 | 11f444db-673a-4243-6ac8-a750b0dd8733 | c9ca7c9e-525a-4604-9fb0-60cc67d59709 | active | | doppler_z1/0 | running | medium_z1 | 10.244.0.146 | 73882f35-714b-4613-64cf-6491ebed04f3 | e4299f6f-04b5-4373-a00b-f4d7f21fab2e | active | | etcd_z1/0 | running | medium_z1 | 10.244.0.42 | 7a1da145-5296-47df-55ac-160248a023ed | 1c414c9d-75cc-40b0-928c-6cf674a6a6f2 | active | | ha_proxy_z1/0 | running | router_z1 | 10.244.0.34 | 84799c83-48b8-407e-6014-1bcf85a9dd91 | 320c8887-68ec-42e9-b00d-d5ab6468a433 | active | | hm9000_z1/0 | running | medium_z1 | 10.244.0.142 | 4369a61f-5193-4d95-7bfc-a091abe93bb9 | c9cc2239-969e-4cc2-81e9-b808e8fe0f71 | active | | loggregator_trafficcontroller_z1/0 | running | small_z1 | 10.244.0.150 | 9c17e6f3-c3bf-4ad7-68f6-59f0527ac2f6 | ee8a3439-6852-496e-a497-68607575641a | active | | nats_z1/0 | running | medium_z1 | 10.244.0.6 | 41fa153c-2973-4748-6ae3-ed6f5a1ee0a4 | e88b340d-9d06-4bbe-bc5a-f0ac3ff1159d | active | | postgres_z1/0 | running | medium_z1 | 10.244.0.30 | 8465ab2f-0593-485c-7eb6-11f200fc331e | ade6219c-caf1-4aef-8804-8dd7262fc25d | active | | router_z1/0 | running | router_z1 | 10.244.0.22 | 782da9f1-212f-4c05-5835-56c5f6979c3c | 5d71bb3d-a6f3-4677-bc09-801e99188c38 | active | | runner_z1/0 | running | runner_z1 | 10.244.0.26 | 64d55ecb-961e-4c0c-5263-188609d235c0 | a85a3958-e701-4b6d-8dc3-2a80bd33c01b | active | | uaa_z1/0 | running | medium_z1 | 10.244.0.134 | f105c13e-0311-41fd-6d42-0176cf50274a | 5c5af35b-ab34-42fb-b1d5-dd3c927b192b | active | +------------------------------------+---------+---------------+--------------+--------------------------------------+--------------------------------------+--------------+ |
|
sridhar vennela
Hi Wayne,
Below jobs are failing. You will find more details about the errors in below dir. /var/vcap/sys/logs/<job_name>/ Process 'cloud_controller_ng' Execution failed Process 'cloud_controller_worker_1' Does not exist Thank you, Sridhar |
|
Wayne Ha <wayne.h.ha@...>
Sridhar,
Thanks for your response. I see the following in the log file: 2016/03/05 00:02:38 [crit] 5082#0: *3 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (2: No such file or directory) while connecting to upstream, server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com" But I don't know what is the missing file or directory. If I remove cloud_controller_ng from yml: $ diff -w bosh-lite-v231.yml.1603041454 bosh-lite-v231.yml etc... 422,423c420,421 < # name: cloud_controller_ng < # release: cf --- - name: cloud_controller_ngetc... The problem will go away: root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# monit summary The Monit daemon 5.2.4 uptime: 0m Process 'consul_agent' running Process 'cloud_controller_clock' running Process 'cloud_controller_worker_1' running Process 'metron_agent' running Process 'statsd-injector' running Process 'route_registrar' running System 'system_5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc' running But I think the environment also becomes unusable. Thanks, |
|
sridhar vennela
That's correct. You need that job, otherwise, the environment is unusable. Do you know which checklist you are following or using to install bosh-lite? any idea on host configuration- api.bosh-lite.com?
|
|
Wayne Ha <wayne.h.ha@...>
I am able to ping that host:
root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 api.bosh-lite.com PING api.bosh-lite.com (10.244.0.34) 56(84) bytes of data. 64 bytes from 10.244.0.34: icmp_seq=1 ttl=63 time=0.088 ms --- api.bosh-lite.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.088/0.088/0.088/0.000 ms I just use the yml file came with CF 231: $ grep api.bosh-lite.com bosh-lite-v231.yml | sort -u api: api.bosh-lite.com - api.bosh-lite.com srv_api_uri: https://api.bosh-lite.com I also found the following in the log file: {"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500: getsockopt: connection refused","wan":false}} ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'api-z1-0' Datacenter: 'dc1' Server: false (bootstrap: false) Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400) Cluster Addr: 10.244.0.138 (LAN: 8301, WAN: 8302) Gossip encrypt: true, RPC-TLS: true, TLS-Incoming: true Atlas: <disabled> I can also ping the above IP's: root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 10.244.0.138 PING 10.244.0.138 (10.244.0.138) 56(84) bytes of data. 64 bytes from 10.244.0.138: icmp_seq=1 ttl=64 time=0.044 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms root(a)5f0f5c52-01a8-4cc4-afac-d6a25edd8dcc:/var/vcap/sys/log# ping -c 1 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.049 ms 1 packets transmitted, 1 received, 0% packet loss, time 0ms Not sure why there is connection refused. |
|
sridhar vennela
Hi Wayne,
Can you please verify port 8500 listening? Maybe output of netstat -anp will help. {"timestamp":"1457136496.397377968","source":"confab","message":"confab.agent-client.verify-joined.members.request.failed","log_level":2,"data":{"error":"Get http://127.0.0.1:8500/v1/agent/members: dial tcp 127.0.0.1:8500: getsockopt: connection refused","wan":false}} Thank you, Sridhar |
|
Wayne Ha <wayne.h.ha@...>
Sridhar,
Thanks for your response. I found the VM is listening to port 8500: root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp | grep LISTEN tcp 0 0 127.0.0.1:8400 0.0.0.0:* LISTEN 18162/consul tcp 0 0 127.0.0.1:8500 0.0.0.0:* LISTEN 18162/consul tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 18162/consul tcp 0 0 127.0.0.1:2822 0.0.0.0:* LISTEN 72/monit tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 31/sshd tcp 0 0 10.244.0.138:8301 0.0.0.0:* LISTEN 18162/consul If I run "monit stop all" then it only listens to the following: root(a)c6822dcb-fb02-4858-ae5d-3ab45d593896:/var/vcap/sys/log# netstat -anp | grep LISTEN tcp 0 0 127.0.0.1:2822 0.0.0.0:* LISTEN 72/monit tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 31/sshd Note that 10.244.0.138 is the IP of this VM. Thanks, On Sat, Mar 5, 2016 at 12:58 AM, sridhar vennela <sridhar.vennela(a)gmail.com> wrote: Hi Wayne, |
|
sridhar vennela
Hi Wayne,
Somehow VM is not able to join consul server. You can try below steps. ps -ef | grep consul kill consul-serverpid monit restart <consul-job> Thank you, Sridhar |
|
Wayne Ha <wayne.h.ha@...>
Sridhar,
Thanks for your response. I have tried your suggestion and it doesn't help. But I might have misled you with the consul error. That error only got logged once at the beginning. So like you said, maybe VM was not able to join consul server before it came up. But after that, the following error keeps logging every minute or so: 2016/03/06 17:04:41 [crit] 11480#0: *4 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (2: No such file or directory) while connecting to upstream, server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com" So maybe the above is the cause of the problem? Thanks, On Sun, Mar 6, 2016 at 12:51 AM, sridhar vennela <sridhar.vennela(a)gmail.com> wrote: Hi Wayne, |
|
Wayne Ha <wayne.h.ha@...>
Since it is complaining /var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock is not found, I thought I would just touch that file. Now I get:
2016/03/06 17:14:11 [error] 18497#0: *5 connect() to unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock failed (111: Connection refused) while connecting to upstream, client: <bosh director>, server: _, request: "GET /v2/syslog_drain_urls?batch_size=1000 HTTP/1.1", upstream: "http://unix:/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock:/v2/syslog_drain_urls?batch_size=1000", host: "api.bosh-lite.com" Maybe there is network configuration problem in my environment? |
|
sridhar vennela
Hi Wayne,
Looks like it, It is trying to connect to loggregator and failing I guess. https://github.com/cloudfoundry/cloud_controller_ng/blob/master/app/controllers/runtime/syslog_drain_urls_controller.rb Thank you, Sridhar |
|
Amit Kumar Gupta
The log lines saying
"/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock is not found" is probably just a symptom of the problem, not the root cause. You're probably seeing those in the nginx logs? Cloud Controller is failing to start, hence it is not establishing a connection on the socket. You need to dig deeper into failures in logs in /var/vcap/sys/log/cloud_controller_ng. On Sun, Mar 6, 2016 at 10:00 AM, sridhar vennela <sridhar.vennela(a)gmail.com> wrote: Hi Wayne, |
|
Wayne Ha <wayne.h.ha@...>
Amit,
toggle quoted message
Show quoted text
Thanks for letting me know I might have looked at the wrong log files. I saw the following in cloud_controller log files: root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail ./cloud_controller_ng_ctl.log [2016-03-06 22:40:28+0000] ------------ STARTING cloud_controller_ng_ctl at Sun Mar 6 22:40:28 UTC 2016 -------------- [2016-03-06 22:40:28+0000] Checking for blobstore availability [2016-03-06 22:41:03+0000] Blobstore is not available root(a)7a1f2221-c31a-494b-b16c-d4a97c16c9ab:/var/vcap/sys/log# tail ./cloud_controller_worker_ctl.log [2016-03-06 22:41:13+0000] Killing /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: 12145 [2016-03-06 22:41:13+0000] .Stopped [2016-03-06 22:41:36+0000] Blobstore is not available [2016-03-06 22:41:48+0000] ------------ STARTING cloud_controller_worker_ctl at Sun Mar 6 22:41:48 UTC 2016 -------------- [2016-03-06 22:41:48+0000] Checking for blobstore availability [2016-03-06 22:41:48+0000] Removing stale pidfile... So maybe the cause is Blobstore is not available? Thanks, On Sun, Mar 6, 2016 at 1:15 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
The log lines saying "/var/vcap/sys/run/cloud_controller_ng/cloud_controller.sock |
|
Wayne Ha <wayne.h.ha@...>
I checked the blobstore is running:
root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# monit summary The Monit daemon 5.2.4 uptime: 4h 14m Process 'consul_agent' running Process 'metron_agent' running Process 'blobstore_nginx' running Process 'route_registrar' running System 'system_e83575d2-dfbf-4f7c-97ee-5112560fa137' running But there are thousands of errors saying DopplerForwarder: can't forward message, loggregator client pool is empty: root(a)e83575d2-dfbf-4f7c-97ee-5112560fa137:/var/vcap/sys/log# find . -name "*.log" | xargs grep -i error | cut -c 73-500 | sort -u ,"process_id":246,"source":"metron","log_level": "error","message":"DopplerForwarder: can't forward message","data":{ "error":"loggregator client pool is empty"}, "file":"/var/vcap/data/compile/metron_agent/loggregator/src/metron/writers/dopplerforwarder/doppler_forwarder.go", "line":104, "method":"metron/writers/dopplerforwarder.(*DopplerForwarder).networkWrite"} Not sure what is wrong. |
|
Paul Bakare
Wayne, is the nfs_server-partition running?
toggle quoted message
Show quoted text
On Mon, Mar 7, 2016 at 1:43 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:
I checked the blobstore is running: |
|
Wayne Ha <wayne.h.ha@...>
Kayode,
I am using the default bosh-lite-v231.yml file and the instances for nfs server is set to 0: vagrant(a)agent-id-bosh-0:~$ egrep -i "name:.*nfs|instances" bosh-lite-v231.yml.1603041454 etc... - instances: 0 - instances: 0 - instances: 0 name: nfs_z1 - name: debian_nfs_server - instances: 1 - instances: 1 - instances: 1 etc... So it is not running. Thanks, |
|
Amit Kumar Gupta
As of cf v231, CC has switched from using NFS to WebDav as the default
toggle quoted message
Show quoted text
blobstore. There are more details in the release notes: https://github.com/cloudfoundry/cf-release/releases/tag/v231. I don't know off-hand how to debug the issue you're seeing, but I will reach out to some folks with more knowledge of Cloud Controller. Best, Amit On Mon, Mar 7, 2016 at 8:48 AM, Wayne Ha <wayne.h.ha(a)gmail.com> wrote:
Kayode, |
|
Zach Robinson
Wayne,
Can you verify that you are using the latest bosh-lite stemcell 3147? Older stemcells are known to have issues with consul which is what many of the CF components use for service discovery. Latest bosh-lite stemcells can be found at http://bosh.io Just search for lite. See this similar issue: https://github.com/cloudfoundry/cf-release/issues/919 -Zach |
|
Wayne Ha <wayne.h.ha@...>
Zach,
Thanks for the hints. You are right, I am not using latest stemcell: vagrant(a)agent-id-bosh-0:~$ bosh stemcells +---------------------------------------------+---------+--------------------------------------+ | Name | Version | CID | +---------------------------------------------+---------+--------------------------------------+ | bosh-warden-boshlite-ubuntu-trusty-go_agent | 389* | cb6ee28c-a703-4a7e-581b-b63be2302e3d | I will try the stemcell you recommended to see if it helps. Thanks, |
|
Wayne Ha <wayne.h.ha@...>
Zach,
After using the latest stemcell, I got a successful deployment. But after that, cf login fails: vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin API endpoint: api.bosh-lite.com FAILED Invalid SSL Cert for api.bosh-lite.com TIP: Use 'cf login --skip-ssl-validation' to continue with an insecure API endpoint vagrant(a)agent-id-bosh-0:~$ cf login -a api.bosh-lite.com -u admin -p admin --skip-ssl-validation API endpoint: api.bosh-lite.com FAILED Error performing request: Get https://login.bosh-lite.com/login: stopped after 1 redirect API endpoint: https://api.bosh-lite.com (API version: 2.51.0) Not logged in. Use 'cf login' to log in. I saw the following in uaa.log: root(a)d142fabc-f823-40df-b9ea-97d306bf7209:/var/vcap/sys/log/uaa# grep -A9 -i error uaa.log | cut -c 65-650 DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/error' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_sent' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/create_account*' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/accounts/email_sent' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/invalid_request' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/saml_error' DEBUG --- UaaRequestMatcher: [loginAuthenticateRequestMatcher] Checking match of request : '/login'; '/authenticate' with parameters={} and headers {Authorization=[bearer ], accept=[application/json]} DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/authenticate/**' DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcher] Checking match of request : '/login'; '/oauth/authorize' with parameters={source=login} and headers {accept=[application/json]} DEBUG --- UaaRequestMatcher: [loginTokenRequestMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={source=login, grant_type=password, add_new=} and headers {Authorization=[bearer ], accept=[application/json]} DEBUG --- UaaRequestMatcher: [loginAuthorizeRequestMatcherOld] Checking match of request : '/login'; '/oauth/authorize' with parameters={login={} and headers {accept=[application/json]} DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/password_*' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/email_*' DEBUG --- AntPathRequestMatcher: Checking match of request : '/login'; against '/oauth/token/revoke/**' DEBUG --- UaaRequestMatcher: [passcodeTokenMatcher] Checking match of request : '/login'; '/oauth/token' with parameters={grant_type=password, passcode=} and headers {accept=[application/json, application/x-www-form-urlencoded]} But I don't know what the above mean. Thanks, |
|