Gateways fail to start


Ulrik Sandberg
 

I try to deploy the community cf-services-contrib version 6 on a local Vagrant-based bosh-lite (that otherwise works fine), following the instructions on https://github.com/cloudfoundry-community/cf-services-contrib-release, but all gateways fail to start. The nodes are running OK.

```
$ bosh vms
Acting as user 'admin' on 'Bosh Lite Director'
Deployment `cf-services-contrib'

Director task 68

Task 68 done

+----------------------+---------+---------------+-------------+
| Job/index | State | Resource Pool | IPs |
+----------------------+---------+---------------+-------------+
| mongodb_gateway/0 | failing | gateway_z1 | 10.244.1.2 |
| mongodb_node/0 | running | node_z1 | 10.244.1.82 |
| postgresql_gateway/0 | failing | gateway_z1 | 10.244.1.10 |
| postgresql_node/0 | running | node_z1 | 10.244.1.90 |
| rabbit_gateway/0 | failing | gateway_z1 | 10.244.1.6 |
| rabbit_node/0 | running | node_z1 | 10.244.1.86 |
| redis_gateway/0 | failing | gateway_z1 | 10.244.1.14 |
| redis_node/0 | running | node_z1 | 10.244.1.94 |
+----------------------+---------+---------------+-------------+

VMs total: 8
Deployment `cf-warden'

Director task 69

Task 69 done

+------------------------------------+---------+---------------+--------------+
| Job/index | State | Resource Pool | IPs |
+------------------------------------+---------+---------------+--------------+
| api_z1/0 | running | large_z1 | 10.244.0.134 |
| consul_z1/0 | running | small_z1 | 10.244.0.54 |
| doppler_z1/0 | running | medium_z1 | 10.244.0.142 |
| etcd_z1/0 | running | medium_z1 | 10.244.0.42 |
| ha_proxy_z1/0 | running | router_z1 | 10.244.0.34 |
| hm9000_z1/0 | running | medium_z1 | 10.244.0.138 |
| loggregator_trafficcontroller_z1/0 | running | small_z1 | 10.244.0.146 |
| nats_z1/0 | running | medium_z1 | 10.244.0.6 |
| postgres_z1/0 | running | medium_z1 | 10.244.0.30 |
| router_z1/0 | running | router_z1 | 10.244.0.22 |
| runner_z1/0 | running | runner_z1 | 10.244.0.26 |
| uaa_z1/0 | running | medium_z1 | 10.244.0.130 |
+------------------------------------+---------+---------------+--------------+

VMs total: 12
```

Interestingly, after the first deploy, three of the four gateways were actually started (only mongodb_gateway failed). Then I deployed again, thinking it was just a temporary glitch, and after that, all four gateways fail to start. I have tried several times deleting the cf-services-contrib deployment and deploying again, but the gateways now always fail to start.

Looking at the debug log from the deploy task, I only see that no gateway is started:

```
E, [2015-12-30 11:04:29 #21405] [canary_update(rabbit_gateway/f854d16a-965a-4aca-a52b-ba370427de77 (0))] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::AgentJobNotRunning: `rabbit_gateway/0' is not running after update>
...
E, [2015-12-30 11:04:31 #21405] [canary_update(mongodb_gateway/c65ae9a9-a039-47b1-abef-b76f2ebc5c82 (0))] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::AgentJobNotRunning: `mongodb_gateway/0' is not running after update>
...
```

I logged in to mongodb_gateway and found this in the `/var/vcap/sys/log/mongodb_gateway.log`:

```
Exiting due to NATS error: Could not connect to server on nats://nats:nats(a)10.244.0.6:4222
```

I can ping nats from the mongodb_gateway:

```
$ sudo ping 10.244.0.6
PING 10.244.0.6 (10.244.0.6) 56(84) bytes of data.
64 bytes from 10.244.0.6: icmp_seq=1 ttl=63 time=0.149 ms
64 bytes from 10.244.0.6: icmp_seq=2 ttl=63 time=0.059 ms
...
```

I also seem to be able to connect to port 4222:

```
$ nc 10.244.0.6 4222
INFO {"server_id":"d6297ffe9307eead6bbe02005deb47aa","version":"0.5.6","host":"10.244.0.6","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

```

Looking in the deployment file `tmp/contrib-services-warden-manifest.yml`, I see:

```
nats:
address: 10.244.0.6
authorization_timeout: 5
password: nats
port: "4222"
user: nats
```

That seems to match the credentials in the attempt to connect from the mongodb_gateway above.

Anything else I can provide?

The cf-services-contrib-release README says: "NOTE: The currently supported BOSH Lite stemcell for cf-services-contrib-release is version 388 which can be found [here](https://s3.amazonaws.com/bosh-jenkins-artifacts/bosh-stemcell/warden/bosh-stemcell-388-warden-boshlite-ubuntu-trusty-go_agent.tgz)." Not sure what I'm supposed to do with that information.


James Hunt <james@...>
 

On Dec 30, 2015, at 7:15 AM, Ulrik Sandberg <ulrik.sandberg(a)jayway.com> wrote:

I try to deploy the community cf-services-contrib version 6 on a local Vagrant-based bosh-lite (that otherwise works fine), following the instructions on https://github.com/cloudfoundry-community/cf-services-contrib-release, but all gateways fail to start. The nodes are running OK.
[snip]

I logged in to mongodb_gateway and found this in the `/var/vcap/sys/log/mongodb_gateway.log`:

```
Exiting due to NATS error: Could not connect to server on nats://nats:nats(a)10.244.0.6:4222
```
When reviewing log files, take note of timestamps. I've seen similar failures occur before the deployment stabilizes (i.e., the client tries connecting before the server is finished provisioning / installing / starting).

[snip]

Anything else I can provide?
Can you `bosh ssh` into a failing gateway, sudo to root, and run `monit summary`?

That should confirm that mongodb_gateway is what's failing. If it is, try `monit restart all` and then watch the logs (again, taking note of timestamps)

--
jrh