Diego docker app launch issue with Diego's v0.1443.0


Anuj Jain <anuj17280@...>
 

Hi,

I deployed the latest CF v226 with Diego v0.1443.0 - I was able to
successfully upgrade both deployments and verified that CF is working as
expected. currently seeing problem with Diego while trying to deploy any
docker app - I am getting *'Server error, status code: 500, error code:
170016, message: Runner error: stop app failed: 503' *- below you can see
the CF_TRACE output of last few lines.

I also notice that while trying to upgrade diego v0.1443.0 - it gave me
the error while trying to upgrade database job - the fix which I applied
(changed debug2 to debug from diego manifest file - path: properties =>
consul => log_level: debug)


RESPONSE: [2015-12-10T09:35:07-05:00]
HTTP/1.1 500 Internal Server Error
Content-Length: 110
Content-Type: application/json;charset=utf-8
Date: Thu, 10 Dec 2015 14:35:07 GMT
Server: nginx
X-Cf-Requestid: 8328f518-4847-41ec-5836-507d4bb054bb
X-Content-Type-Options: nosniff
X-Vcap-Request-Id:
324d0fc0-2146-48f0-6265-755efb556e23::5c869046-8803-4dac-a620-8ca701f5bd22

{
"code": 170016,
"description": "Runner error: stop app failed: 503",
"error_code": "CF-RunnerError"
}

FAILED
Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503
FAILED
Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503
FAILED
Error: Error executing cli core command
Starting app testing89 in org PAAS / space dev as admin...

FAILED

Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503


- Anuj


Eric Malm <emalm@...>
 

Hi, Anuj,

Thanks for the info, and sorry to hear you've run into some difficulties.
It sounds like Cloud Controller is getting a 503 error from the
nsync-listener service on the CC-Bridge. That most likely means it's
encountering some sort of error in communicating with Diego's BBS API. You
mentioned that you had some problems with the database jobs when upgrading
as well. Does BOSH now report that all the VMs in the Diego deployment are
running correctly?

One next step to try would be to tail logs from the nsync-listener
processes on the CC-Bridge VMs with `tail -f
/var/vcap/sys/log/nsync/nsync_listener.stdout.log`, and from the BBS
processes on the database VMs with `tail -f
/var/vcap/sys/log/bbs/bbs.stdout.log`, then try restarting your app that
targets Diego, and see if there are any errors in the logs. It may also
help to filter the logs to contain only the ones with your app guid, which
you can get from the CF CLI via `cf app APP_NAME --guid`.

Also, are you able to run a buildpack-based app on the Diego backend, or do
you get the same error as with this Docker-based app?

Best,
Eric

On Thu, Dec 10, 2015 at 6:45 AM, Anuj Jain <anuj17280(a)gmail.com> wrote:

Hi,

I deployed the latest CF v226 with Diego v0.1443.0 - I was able to
successfully upgrade both deployments and verified that CF is working as
expected. currently seeing problem with Diego while trying to deploy any
docker app - I am getting *'Server error, status code: 500, error code:
170016, message: Runner error: stop app failed: 503' *- below you can see
the CF_TRACE output of last few lines.

I also notice that while trying to upgrade diego v0.1443.0 - it gave me
the error while trying to upgrade database job - the fix which I applied
(changed debug2 to debug from diego manifest file - path: properties =>
consul => log_level: debug)


RESPONSE: [2015-12-10T09:35:07-05:00]
HTTP/1.1 500 Internal Server Error
Content-Length: 110
Content-Type: application/json;charset=utf-8
Date: Thu, 10 Dec 2015 14:35:07 GMT
Server: nginx
X-Cf-Requestid: 8328f518-4847-41ec-5836-507d4bb054bb
X-Content-Type-Options: nosniff
X-Vcap-Request-Id:
324d0fc0-2146-48f0-6265-755efb556e23::5c869046-8803-4dac-a620-8ca701f5bd22

{
"code": 170016,
"description": "Runner error: stop app failed: 503",
"error_code": "CF-RunnerError"
}

FAILED
Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503
FAILED
Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503
FAILED
Error: Error executing cli core command
Starting app testing89 in org PAAS / space dev as admin...

FAILED

Server error, status code: 500, error code: 170016, message: Runner error:
stop app failed: 503


- Anuj


Anuj Jain <anuj17280@...>
 

Hi Eric – Thanks for trying to help me to resolve my issues – please check
comments inline:

On Mon, Dec 14, 2015 at 12:02 AM, Eric Malm <emalm(a)pivotal.io> wrote:

Hi, Anuj,

Thanks for the info, and sorry to hear you've run into some difficulties.
It sounds like Cloud Controller is getting a 503 error from the
nsync-listener service on the CC-Bridge. That most likely means it's
encountering some sort of error in communicating with Diego's BBS API. You
mentioned that you had some problems with the database jobs when upgrading
as well. Does BOSH now report that all the VMs in the Diego deployment are
running correctly?

=> All VMs under (CF, Diego and Diego docker cache) showing running
=> Database issue was in manifest, which got resolved once I update/changed
the log_level value from debug2 to debug


One next step to try would be to tail logs from the nsync-listener
processes on the CC-Bridge VMs with `tail -f
/var/vcap/sys/log/nsync/nsync_listener.stdout.log`, and from the BBS
processes on the database VMs with `tail -f
/var/vcap/sys/log/bbs/bbs.stdout.log`, then try restarting your app that
targets Diego, and see if there are any errors in the logs. It may also
help to filter the logs to contain only the ones with your app guid, which
you can get from the CF CLI via `cf app APP_NAME --guid`.

=> It could not able to see any logs for my app on CC-Bridge - checked
stager and/or Nsync job logs
=> I checked CC host and run 'netstat -anp | grep consul' couple of time
and found that sometime it showing established connection with one consul
server and sometime not - below is the sample output:

# netstat -anp | grep consul
tcp 0 0 10.5.139.156:8301 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 4559/consul
udp 0 0 127.0.0.1:53 0.0.0.0:*
4559/consul
udp 0 0 10.5.139.156:8301 0.0.0.0:*
4559/consul
root(a)07c40eae-6a8c-4fb1-996d-e638637b5caa:/var/vcap/bosh_ssh/bosh_eerpetbrz#
netstat -anp | grep consul
tcp 0 0 10.5.139.156:8301 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 10.5.139.156:60316 10.5.139.140:8300
ESTABLISHED 4559/consul
udp 0 0 127.0.0.1:53 0.0.0.0:*
4559/consul
udp 0 0 10.5.139.156:8301 0.0.0.0:*
4559/consul

=> After that I also checked/verify consul agent logs on CC which showing
EventmemberFailed and EventmemberJoin messages

========================================================================================
logs:
2015/12/14 09:32:43 [INFO] serf: EventMemberFailed: api-z1-0
10.5.139.156
2015/12/14 09:32:44 [INFO] serf: EventMemberJoin: docker-cache-0
10.5.139.252
2015/12/14 09:32:44 [INFO] serf: EventMemberJoin: api-z1-0 10.5.139.156
2015/12/14 09:32:49 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:33:32 [INFO] memberlist: Suspect ha-proxy-z1-1 has
failed, no acks received
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: database-z1-2
10.5.139.194
2015/12/14 09:33:40 [INFO] memberlist: Marking ha-proxy-z1-0 as failed,
suspect timeout reached
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:33:41 [INFO] serf: EventMemberJoin: database-z1-2
10.5.139.194
2015/12/14 09:33:42 [INFO] serf: EventMemberFailed: cell-z1-3
10.5.139.199
2015/12/14 09:33:43 [INFO] serf: EventMemberJoin: cell-z1-3 10.5.139.199
2015/12/14 09:33:43 [INFO] serf: EventMemberFailed: api-worker-z1-0
10.5.139.159
2015/12/14 09:33:44 [INFO] serf: EventMemberJoin: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:33:44 [INFO] memberlist: Marking uaa-z1-1 as failed,
suspect timeout reached
2015/12/14 09:33:44 [INFO] serf: EventMemberFailed: uaa-z1-1
10.5.139.155
2015/12/14 09:33:46 [INFO] serf: EventMemberFailed: cell-z1-1
10.5.139.197
2015/12/14 09:33:46 [INFO] serf: EventMemberJoin: uaa-z1-1 10.5.139.155
2015/12/14 09:33:47 [INFO] memberlist: Marking cc-bridge-z1-0 as
failed, suspect timeout reached
2015/12/14 09:33:47 [INFO] serf: EventMemberFailed: cc-bridge-z1-0
10.5.139.200
2015/12/14 09:33:49 [INFO] serf: EventMemberFailed: database-z1-1
10.5.139.193
2015/12/14 09:33:58 [INFO] serf: EventMemberJoin: api-worker-z1-0
10.5.139.159
2015/12/14 09:33:58 [INFO] serf: EventMemberJoin: database-z1-1
10.5.139.193
2015/12/14 09:33:59 [INFO] serf: EventMemberJoin: cell-z1-1 10.5.139.197
2015/12/14 09:33:59 [INFO] serf: EventMemberJoin: cc-bridge-z1-0
10.5.139.200
2015/12/14 09:34:01 [INFO] serf: EventMemberFailed: database-z1-1
10.5.139.193
2015/12/14 09:34:07 [INFO] serf: EventMemberJoin: database-z1-1
10.5.139.193
2015/12/14 09:34:09 [INFO] memberlist: Marking cell-z1-1 as failed,
suspect timeout reached
2015/12/14 09:34:09 [INFO] serf: EventMemberFailed: cell-z1-1
10.5.139.197
2015/12/14 09:34:20 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:28 [INFO] serf: EventMemberJoin: cell-z1-1 10.5.139.197
2015/12/14 09:34:38 [INFO] serf: EventMemberFailed: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:42 [INFO] serf: EventMemberFailed: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:34:44 [INFO] memberlist: Marking api-z1-0 as failed,
suspect timeout reached
2015/12/14 09:34:44 [INFO] serf: EventMemberFailed: api-z1-0
10.5.139.156
2015/12/14 09:34:48 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:48 [INFO] memberlist: Marking api-worker-z1-0 as
failed, suspect timeout reached
2015/12/14 09:34:48 [INFO] serf: EventMemberFailed: api-worker-z1-0
10.5.139.159
2015/12/14 09:34:49 [INFO] serf: EventMemberJoin: api-z1-0 10.5.139.156
2015/12/14 09:34:52 [INFO] serf: EventMemberJoin: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:34:58 [INFO] serf: EventMemberJoin: api-worker-z1-0
10.5.139.159
================================================================================



Also, are you able to run a buildpack-based app on the Diego backend, or
do you get the same error as with this Docker-based app?

=> No, I am also not able to run buildpack-based app on Diego backend -
verified that by enable-diego on one of the app and then tried starting it
- got same 500 Error.


Best,
Eric

On Thu, Dec 10, 2015 at 6:45 AM, Anuj Jain <anuj17280(a)gmail.com> wrote:

Hi,

I deployed the latest CF v226 with Diego v0.1443.0 - I was able to
successfully upgrade both deployments and verified that CF is working as
expected. currently seeing problem with Diego while trying to deploy any
docker app - I am getting *'Server error, status code: 500, error code:
170016, message: Runner error: stop app failed: 503' *- below you can
see the CF_TRACE output of last few lines.

I also notice that while trying to upgrade diego v0.1443.0 - it gave
me the error while trying to upgrade database job - the fix which I applied
(changed debug2 to debug from diego manifest file - path: properties =>
consul => log_level: debug)


RESPONSE: [2015-12-10T09:35:07-05:00]
HTTP/1.1 500 Internal Server Error
Content-Length: 110
Content-Type: application/json;charset=utf-8
Date: Thu, 10 Dec 2015 14:35:07 GMT
Server: nginx
X-Cf-Requestid: 8328f518-4847-41ec-5836-507d4bb054bb
X-Content-Type-Options: nosniff
X-Vcap-Request-Id:
324d0fc0-2146-48f0-6265-755efb556e23::5c869046-8803-4dac-a620-8ca701f5bd22

{
"code": 170016,
"description": "Runner error: stop app failed: 503",
"error_code": "CF-RunnerError"
}

FAILED
Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503
FAILED
Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503
FAILED
Error: Error executing cli core command
Starting app testing89 in org PAAS / space dev as admin...

FAILED

Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503


- Anuj


Anuj Jain <anuj17280@...>
 

Hi Eric - please ignore last message - I found the issue - I installed
diego with collocated VM as well as other VMs - which make CC confuse and
collocated VM was not having all the functionality/jobs - now I am not
getting 500 message and can run buildpack apps post diego enabled.

On Mon, Dec 14, 2015 at 5:23 PM, Anuj Jain <anuj17280(a)gmail.com> wrote:

Hi Eric – Thanks for trying to help me to resolve my issues – please check
comments inline:

On Mon, Dec 14, 2015 at 12:02 AM, Eric Malm <emalm(a)pivotal.io> wrote:

Hi, Anuj,

Thanks for the info, and sorry to hear you've run into some difficulties.
It sounds like Cloud Controller is getting a 503 error from the
nsync-listener service on the CC-Bridge. That most likely means it's
encountering some sort of error in communicating with Diego's BBS API. You
mentioned that you had some problems with the database jobs when upgrading
as well. Does BOSH now report that all the VMs in the Diego deployment are
running correctly?

=> All VMs under (CF, Diego and Diego docker cache) showing running
=> Database issue was in manifest, which got resolved once I
update/changed the log_level value from debug2 to debug


One next step to try would be to tail logs from the nsync-listener
processes on the CC-Bridge VMs with `tail -f
/var/vcap/sys/log/nsync/nsync_listener.stdout.log`, and from the BBS
processes on the database VMs with `tail -f
/var/vcap/sys/log/bbs/bbs.stdout.log`, then try restarting your app that
targets Diego, and see if there are any errors in the logs. It may also
help to filter the logs to contain only the ones with your app guid, which
you can get from the CF CLI via `cf app APP_NAME --guid`.

=> It could not able to see any logs for my app on CC-Bridge - checked
stager and/or Nsync job logs
=> I checked CC host and run 'netstat -anp | grep consul' couple of time
and found that sometime it showing established connection with one consul
server and sometime not - below is the sample output:

# netstat -anp | grep consul
tcp 0 0 10.5.139.156:8301 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 4559/consul
udp 0 0 127.0.0.1:53 0.0.0.0:*
4559/consul
udp 0 0 10.5.139.156:8301 0.0.0.0:*
4559/consul
root(a)07c40eae-6a8c-4fb1-996d-e638637b5caa:/var/vcap/bosh_ssh/bosh_eerpetbrz#
netstat -anp | grep consul
tcp 0 0 10.5.139.156:8301 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8400 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN 4559/consul
tcp 0 0 10.5.139.156:60316 10.5.139.140:8300
ESTABLISHED 4559/consul
udp 0 0 127.0.0.1:53 0.0.0.0:*
4559/consul
udp 0 0 10.5.139.156:8301 0.0.0.0:*
4559/consul

=> After that I also checked/verify consul agent logs on CC which showing
EventmemberFailed and EventmemberJoin messages


========================================================================================
logs:
2015/12/14 09:32:43 [INFO] serf: EventMemberFailed: api-z1-0
10.5.139.156
2015/12/14 09:32:44 [INFO] serf: EventMemberJoin: docker-cache-0
10.5.139.252
2015/12/14 09:32:44 [INFO] serf: EventMemberJoin: api-z1-0 10.5.139.156
2015/12/14 09:32:49 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:33:32 [INFO] memberlist: Suspect ha-proxy-z1-1 has
failed, no acks received
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: database-z1-2
10.5.139.194
2015/12/14 09:33:40 [INFO] memberlist: Marking ha-proxy-z1-0 as
failed, suspect timeout reached
2015/12/14 09:33:40 [INFO] serf: EventMemberFailed: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:33:41 [INFO] serf: EventMemberJoin: database-z1-2
10.5.139.194
2015/12/14 09:33:42 [INFO] serf: EventMemberFailed: cell-z1-3
10.5.139.199
2015/12/14 09:33:43 [INFO] serf: EventMemberJoin: cell-z1-3
10.5.139.199
2015/12/14 09:33:43 [INFO] serf: EventMemberFailed: api-worker-z1-0
10.5.139.159
2015/12/14 09:33:44 [INFO] serf: EventMemberJoin: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:33:44 [INFO] memberlist: Marking uaa-z1-1 as failed,
suspect timeout reached
2015/12/14 09:33:44 [INFO] serf: EventMemberFailed: uaa-z1-1
10.5.139.155
2015/12/14 09:33:46 [INFO] serf: EventMemberFailed: cell-z1-1
10.5.139.197
2015/12/14 09:33:46 [INFO] serf: EventMemberJoin: uaa-z1-1 10.5.139.155
2015/12/14 09:33:47 [INFO] memberlist: Marking cc-bridge-z1-0 as
failed, suspect timeout reached
2015/12/14 09:33:47 [INFO] serf: EventMemberFailed: cc-bridge-z1-0
10.5.139.200
2015/12/14 09:33:49 [INFO] serf: EventMemberFailed: database-z1-1
10.5.139.193
2015/12/14 09:33:58 [INFO] serf: EventMemberJoin: api-worker-z1-0
10.5.139.159
2015/12/14 09:33:58 [INFO] serf: EventMemberJoin: database-z1-1
10.5.139.193
2015/12/14 09:33:59 [INFO] serf: EventMemberJoin: cell-z1-1
10.5.139.197
2015/12/14 09:33:59 [INFO] serf: EventMemberJoin: cc-bridge-z1-0
10.5.139.200
2015/12/14 09:34:01 [INFO] serf: EventMemberFailed: database-z1-1
10.5.139.193
2015/12/14 09:34:07 [INFO] serf: EventMemberJoin: database-z1-1
10.5.139.193
2015/12/14 09:34:09 [INFO] memberlist: Marking cell-z1-1 as failed,
suspect timeout reached
2015/12/14 09:34:09 [INFO] serf: EventMemberFailed: cell-z1-1
10.5.139.197
2015/12/14 09:34:20 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:28 [INFO] serf: EventMemberJoin: cell-z1-1
10.5.139.197
2015/12/14 09:34:38 [INFO] serf: EventMemberFailed: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:42 [INFO] serf: EventMemberFailed: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:34:44 [INFO] memberlist: Marking api-z1-0 as failed,
suspect timeout reached
2015/12/14 09:34:44 [INFO] serf: EventMemberFailed: api-z1-0
10.5.139.156
2015/12/14 09:34:48 [INFO] serf: EventMemberJoin: ha-proxy-z1-0
10.5.103.103
2015/12/14 09:34:48 [INFO] memberlist: Marking api-worker-z1-0 as
failed, suspect timeout reached
2015/12/14 09:34:48 [INFO] serf: EventMemberFailed: api-worker-z1-0
10.5.139.159
2015/12/14 09:34:49 [INFO] serf: EventMemberJoin: api-z1-0 10.5.139.156
2015/12/14 09:34:52 [INFO] serf: EventMemberJoin: ha-proxy-z1-1
10.5.103.104
2015/12/14 09:34:58 [INFO] serf: EventMemberJoin: api-worker-z1-0
10.5.139.159

================================================================================



Also, are you able to run a buildpack-based app on the Diego backend, or
do you get the same error as with this Docker-based app?

=> No, I am also not able to run buildpack-based app on Diego backend -
verified that by enable-diego on one of the app and then tried starting it
- got same 500 Error.


Best,
Eric

On Thu, Dec 10, 2015 at 6:45 AM, Anuj Jain <anuj17280(a)gmail.com> wrote:

Hi,

I deployed the latest CF v226 with Diego v0.1443.0 - I was able to
successfully upgrade both deployments and verified that CF is working as
expected. currently seeing problem with Diego while trying to deploy any
docker app - I am getting *'Server error, status code: 500, error code:
170016, message: Runner error: stop app failed: 503' *- below you can
see the CF_TRACE output of last few lines.

I also notice that while trying to upgrade diego v0.1443.0 - it gave
me the error while trying to upgrade database job - the fix which I applied
(changed debug2 to debug from diego manifest file - path: properties =>
consul => log_level: debug)


RESPONSE: [2015-12-10T09:35:07-05:00]
HTTP/1.1 500 Internal Server Error
Content-Length: 110
Content-Type: application/json;charset=utf-8
Date: Thu, 10 Dec 2015 14:35:07 GMT
Server: nginx
X-Cf-Requestid: 8328f518-4847-41ec-5836-507d4bb054bb
X-Content-Type-Options: nosniff
X-Vcap-Request-Id:
324d0fc0-2146-48f0-6265-755efb556e23::5c869046-8803-4dac-a620-8ca701f5bd22

{
"code": 170016,
"description": "Runner error: stop app failed: 503",
"error_code": "CF-RunnerError"
}

FAILED
Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503
FAILED
Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503
FAILED
Error: Error executing cli core command
Starting app testing89 in org PAAS / space dev as admin...

FAILED

Server error, status code: 500, error code: 170016, message: Runner
error: stop app failed: 503


- Anuj