api and api_worker jobs fail to bosh update, but monit start OK


Guillaume Berche
 

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail to
update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is
not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before
the bosh director ends up timeouting and calling the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC
May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22
12:33:17] info : Reinitializing monit - Control file
'/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting
down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP
server stopped[UTC May 22 12:33:18] info : Starting monit HTTP
server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP
server started[UTC May 22 12:33:18] info :
'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May
22 12:33:23] info : start service 'consul_agent' on user
request[UTC May 22 12:33:23] info : monit daemon at 1050
awakened[UTC May 22 12:33:23] info : Awakened by User defined
signal 1[UTC May 22 12:33:23] info : 'consul_agent' start:
/var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info
: start service 'nfs_mounter' on user request[UTC May 22 12:33:23]
info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info
: start service 'metron_agent' on user request[UTC May 22 12:33:23]
info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info
: start service 'cloud_controller_worker_1' on user request[UTC May 22
12:33:23] info : monit daemon at 1050 awakened[UTC May 22
12:33:24] info : 'consul_agent' start action done[UTC May 22
12:33:24] info : 'nfs_mounter' start:
/var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24]
info : 'cloud_controller_worker_1' start:
/var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC
May 22 12:33:25] info : 'cloud_controller_worker_1' start action
done
*[UTC May 22 12:33:25] info : 'metron_agent' start:
/var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26]
info : 'metron_agent' start action done*[UTC May 22 12:33:26] info
: 'cloud_controller_worker_1' stop:
/var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC
May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23
DEBUG - Starting service
cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client]
2015/05/22 12:33:23 DEBUG - Monit request:
url='http://127.0.0.1:2822/cloud_controller_worker_1'
body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy]
2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609
[clientRetryable] 2015/05/22 12:33:23 DEBUG -
[requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting
(attempt=1): Request{ Method: 'POST', URL:
'http://127.0.0.1:2822/cloud_controller_worker_1'
}2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG
- [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded
(attempts=1), response: Response{ StatusCode: 200, Status: '200
OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO
- Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22
12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650
********************2015-05-22_12:33:23.73651
{"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG -
Message Payload2015-05-22_12:33:36.69397
********************2015-05-22_12:33:36.69397
{"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting cc_ng
jobs because of too much ram usage, or failed http health check (along with
the short time window in which the extra stop is requested: ~15s). I also
discarded possibility of multiple monit instances, or pid inconsistency
with cc_ng process. I'm now suspecting either the bosh agent to send extra
stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works, I'm
suprised with the following process tree, where ruby seems to call the ctl
script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller
-m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_
/bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl
start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 |
\_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 |
| \_ /bin/bash
/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 |
| \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 |
\_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d
%H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_
/bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl
start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00
\_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00
| \_ /bin/bash
/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00
| \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00
\_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d
%H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF with a
more recent stemcell version (2922) than what the cf release notes are
mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.