Re: api and api_worker jobs fail to bosh update, but monit start OK


Mike Youngstrom
 

We discovered this fix out of sheer luck. So, no help there sorry. :)

Mike

On Thu, May 28, 2015 at 1:56 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with
the community (of course hiding sensitive data), as to help the community
better understand which case went through the automated cases and report
issues on different settings? E.g. a public concourse job for the pipeline
runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we
have this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until we
get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail
to update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at*
12:33:26) *before the bosh director ends up timeouting and calling the
canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF
with a more recent stemcell version (2922) than what the cf release notes
are mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.