Date
1 - 2 of 2
Gateways fail to start
Ulrik Sandberg
I try to deploy the community cf-services-contrib version 6 on a local Vagrant-based bosh-lite (that otherwise works fine), following the instructions on https://github.com/cloudfoundry-community/cf-services-contrib-release, but all gateways fail to start. The nodes are running OK.
``` $ bosh vms Acting as user 'admin' on 'Bosh Lite Director' Deployment `cf-services-contrib' Director task 68 Task 68 done +----------------------+---------+---------------+-------------+ | Job/index | State | Resource Pool | IPs | +----------------------+---------+---------------+-------------+ | mongodb_gateway/0 | failing | gateway_z1 | 10.244.1.2 | | mongodb_node/0 | running | node_z1 | 10.244.1.82 | | postgresql_gateway/0 | failing | gateway_z1 | 10.244.1.10 | | postgresql_node/0 | running | node_z1 | 10.244.1.90 | | rabbit_gateway/0 | failing | gateway_z1 | 10.244.1.6 | | rabbit_node/0 | running | node_z1 | 10.244.1.86 | | redis_gateway/0 | failing | gateway_z1 | 10.244.1.14 | | redis_node/0 | running | node_z1 | 10.244.1.94 | +----------------------+---------+---------------+-------------+ VMs total: 8 Deployment `cf-warden' Director task 69 Task 69 done +------------------------------------+---------+---------------+--------------+ | Job/index | State | Resource Pool | IPs | +------------------------------------+---------+---------------+--------------+ | api_z1/0 | running | large_z1 | 10.244.0.134 | | consul_z1/0 | running | small_z1 | 10.244.0.54 | | doppler_z1/0 | running | medium_z1 | 10.244.0.142 | | etcd_z1/0 | running | medium_z1 | 10.244.0.42 | | ha_proxy_z1/0 | running | router_z1 | 10.244.0.34 | | hm9000_z1/0 | running | medium_z1 | 10.244.0.138 | | loggregator_trafficcontroller_z1/0 | running | small_z1 | 10.244.0.146 | | nats_z1/0 | running | medium_z1 | 10.244.0.6 | | postgres_z1/0 | running | medium_z1 | 10.244.0.30 | | router_z1/0 | running | router_z1 | 10.244.0.22 | | runner_z1/0 | running | runner_z1 | 10.244.0.26 | | uaa_z1/0 | running | medium_z1 | 10.244.0.130 | +------------------------------------+---------+---------------+--------------+ VMs total: 12 ``` Interestingly, after the first deploy, three of the four gateways were actually started (only mongodb_gateway failed). Then I deployed again, thinking it was just a temporary glitch, and after that, all four gateways fail to start. I have tried several times deleting the cf-services-contrib deployment and deploying again, but the gateways now always fail to start. Looking at the debug log from the deploy task, I only see that no gateway is started: ``` E, [2015-12-30 11:04:29 #21405] [canary_update(rabbit_gateway/f854d16a-965a-4aca-a52b-ba370427de77 (0))] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::AgentJobNotRunning: `rabbit_gateway/0' is not running after update> ... E, [2015-12-30 11:04:31 #21405] [canary_update(mongodb_gateway/c65ae9a9-a039-47b1-abef-b76f2ebc5c82 (0))] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::AgentJobNotRunning: `mongodb_gateway/0' is not running after update> ... ``` I logged in to mongodb_gateway and found this in the `/var/vcap/sys/log/mongodb_gateway.log`: ``` Exiting due to NATS error: Could not connect to server on nats://nats:nats(a)10.244.0.6:4222 ``` I can ping nats from the mongodb_gateway: ``` $ sudo ping 10.244.0.6 PING 10.244.0.6 (10.244.0.6) 56(84) bytes of data. 64 bytes from 10.244.0.6: icmp_seq=1 ttl=63 time=0.149 ms 64 bytes from 10.244.0.6: icmp_seq=2 ttl=63 time=0.059 ms ... ``` I also seem to be able to connect to port 4222: ``` $ nc 10.244.0.6 4222 INFO {"server_id":"d6297ffe9307eead6bbe02005deb47aa","version":"0.5.6","host":"10.244.0.6","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576} ``` Looking in the deployment file `tmp/contrib-services-warden-manifest.yml`, I see: ``` nats: address: 10.244.0.6 authorization_timeout: 5 password: nats port: "4222" user: nats ``` That seems to match the credentials in the attempt to connect from the mongodb_gateway above. Anything else I can provide? The cf-services-contrib-release README says: "NOTE: The currently supported BOSH Lite stemcell for cf-services-contrib-release is version 388 which can be found [here](https://s3.amazonaws.com/bosh-jenkins-artifacts/bosh-stemcell/warden/bosh-stemcell-388-warden-boshlite-ubuntu-trusty-go_agent.tgz)." Not sure what I'm supposed to do with that information. |
|
James Hunt <james@...>
On Dec 30, 2015, at 7:15 AM, Ulrik Sandberg <ulrik.sandberg(a)jayway.com> wrote:[snip] I logged in to mongodb_gateway and found this in the `/var/vcap/sys/log/mongodb_gateway.log`:When reviewing log files, take note of timestamps. I've seen similar failures occur before the deployment stabilizes (i.e., the client tries connecting before the server is finished provisioning / installing / starting). [snip] Anything else I can provide?Can you `bosh ssh` into a failing gateway, sudo to root, and run `monit summary`? That should confirm that mongodb_gateway is what's failing. If it is, try `monit restart all` and then watch the logs (again, taking note of timestamps) -- jrh |
|