We have environments on stemcell 2977 that are running well. We have an environment using NFS that ran into that same issue and we have this bug open. [1] Specifying the nfs_mounter job last should work in the mean time until we get the order switched. This was apparently introduced when we added consul_agent to the cloud controller jobs. I'll update the release notes for the affected releases. -Dieu CF Runtime PM [1] https://www.pivotaltracker.com/story/show/94152506
toggle quoted message
Show quoted text
On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com> wrote: We recently experienced a similar issue. Not sure if it is the same. But it was caused when we moved the nfs_mounter job template to the first item in the list of templates for the CC job. We moved nfs_mounter to the last job template in the list and we haven't had a problem since. It was really strange cause you think you'd want nfs_mounter first. Anyway, something to try.
Mike
On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Hi,
I'm experiencing a weird situation where api and api_worker jobs fail to update through bosh and end up being reported as "not running". However, manually running "monit start cloud_controller_ng" (or rebooting the vm), the faulty jobs starts fine, and bosh deployment proceeds without errors. Looking at monit logs, it seems that there is an extra monit stop request for the cc_ng.
Below are detailed traces illustrating the issue.
$ bosh deploy
[..] Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39) Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)
When instructing bosh to update the job (in this case only a config change), we indeed see the bosh agent asking monit to stop jobs, restart monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before the bosh director ends up timeouting and calling the canary failed.
$ less /var/vcap/monit/monit.log
[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done *[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl *[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1
There is no associated traces of the bosh agent asking this extra stop:
$ less /var/vcap/bosh/log/current 2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ******************** 2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}
This is reproducing systematically on our set up using bosh release 152 with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf release 207 running stemcell 2889.
Enabling monit verbose logs discarded the theory of monit restarting cc_ng jobs because of too much ram usage, or failed http health check (along with the short time window in which the extra stop is requested: ~15s). I also discarded possibility of multiple monit instances, or pid inconsistency with cc_ng process. I'm now suspecting either the bosh agent to send extra stop request, or something with the cc_ng ctl scripts.
As a side question, can someone explain how the cc_ng ctl script works, I'm suprised with the following process tree, where ruby seems to call the ctl script. Is the cc spawning it self ?
$ ps auxf --cols=2000 | less [...] vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63 root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) } root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63 root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
I was wondering whether this could come from our setting running CF with a more recent stemcell version (2922) than what the cf release notes are mentionning as "tested configuration". Are the latest stemcells tested against latest CF release ? Is there any way to see what stemcell version the runtime team pipelines is using [1] seemed to accept env vars and [2] required logging in ? I scanned through the bosh agent commit logs to spot something related but without luck so far.
Thanks in advance for your help,
Guillaume.
[1] https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh <https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh> [2] https://concourse.diego-ci.cf-app.com/ <https://concourse.diego-ci.cf-app.com/>
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|