api and api_worker jobs fail to bosh update, but monit start OK


Guillaume Berche
 

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail to
update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is
not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before
the bosh director ends up timeouting and calling the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC
May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22
12:33:17] info : Reinitializing monit - Control file
'/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting
down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP
server stopped[UTC May 22 12:33:18] info : Starting monit HTTP
server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP
server started[UTC May 22 12:33:18] info :
'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May
22 12:33:23] info : start service 'consul_agent' on user
request[UTC May 22 12:33:23] info : monit daemon at 1050
awakened[UTC May 22 12:33:23] info : Awakened by User defined
signal 1[UTC May 22 12:33:23] info : 'consul_agent' start:
/var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info
: start service 'nfs_mounter' on user request[UTC May 22 12:33:23]
info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info
: start service 'metron_agent' on user request[UTC May 22 12:33:23]
info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info
: start service 'cloud_controller_worker_1' on user request[UTC May 22
12:33:23] info : monit daemon at 1050 awakened[UTC May 22
12:33:24] info : 'consul_agent' start action done[UTC May 22
12:33:24] info : 'nfs_mounter' start:
/var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24]
info : 'cloud_controller_worker_1' start:
/var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC
May 22 12:33:25] info : 'cloud_controller_worker_1' start action
done
*[UTC May 22 12:33:25] info : 'metron_agent' start:
/var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26]
info : 'metron_agent' start action done*[UTC May 22 12:33:26] info
: 'cloud_controller_worker_1' stop:
/var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC
May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23
DEBUG - Starting service
cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client]
2015/05/22 12:33:23 DEBUG - Monit request:
url='http://127.0.0.1:2822/cloud_controller_worker_1'
body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy]
2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609
[clientRetryable] 2015/05/22 12:33:23 DEBUG -
[requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting
(attempt=1): Request{ Method: 'POST', URL:
'http://127.0.0.1:2822/cloud_controller_worker_1'
}2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG
- [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded
(attempts=1), response: Response{ StatusCode: 200, Status: '200
OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO
- Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22
12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650
********************2015-05-22_12:33:23.73651
{"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG -
Message Payload2015-05-22_12:33:36.69397
********************2015-05-22_12:33:36.69397
{"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting cc_ng
jobs because of too much ram usage, or failed http health check (along with
the short time window in which the extra stop is requested: ~15s). I also
discarded possibility of multiple monit instances, or pid inconsistency
with cc_ng process. I'm now suspecting either the bosh agent to send extra
stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works, I'm
suprised with the following process tree, where ruby seems to call the ctl
script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller
-m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_
/bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl
start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 |
\_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 |
| \_ /bin/bash
/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 |
| \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 |
\_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d
%H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_
/bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl
start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00
\_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00
| \_ /bin/bash
/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00
| \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00
\_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d
%H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF with a
more recent stemcell version (2922) than what the cf release notes are
mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


Mike Youngstrom
 

We recently experienced a similar issue. Not sure if it is the same. But
it was caused when we moved the nfs_mounter job template to the first item
in the list of templates for the CC job. We moved nfs_mounter to the last
job template in the list and we haven't had a problem since. It was really
strange cause you think you'd want nfs_mounter first. Anyway, something to
try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail to
update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before
the bosh director ends up timeouting and calling the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting cc_ng
jobs because of too much ram usage, or failed http health check (along with
the short time window in which the extra stop is requested: ~15s). I also
discarded possibility of multiple monit instances, or pid inconsistency
with cc_ng process. I'm now suspecting either the bosh agent to send extra
stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF with a
more recent stemcell version (2922) than what the cf release notes are
mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Dieu Cao <dcao@...>
 

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we have
this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until we
get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506

On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:

We recently experienced a similar issue. Not sure if it is the same. But
it was caused when we moved the nfs_mounter job template to the first item
in the list of templates for the CC job. We moved nfs_mounter to the last
job template in the list and we haven't had a problem since. It was really
strange cause you think you'd want nfs_mounter first. Anyway, something to
try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail to
update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before
the bosh director ends up timeouting and calling the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF with
a more recent stemcell version (2922) than what the cf release notes are
mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Guillaume Berche
 

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log,
/var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with the
community (of course hiding sensitive data), as to help the community
better understand which case went through the automated cases and report
issues on different settings? E.g. a public concourse job for the pipeline
runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we have
this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until we
get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail to
update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at* 12:33:26)
*before the bosh director ends up timeouting and calling the canary
failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF with
a more recent stemcell version (2922) than what the cf release notes are
mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Guillaume Berche
 

Oups, sorry, I had probably overlooked this faulty trace that should have
hinted me.

[2015-05-26 16:37:51+0000] ------------ STARTING cloud_controller_ng_ctl at
Tue May 26 16:37:51 UTC 2015 --------------
[2015-05-26 16:37:51+0000] chown: changing ownership of
‘/var/vcap/nfs/shared’: Operation not permitted

On Thu, May 28, 2015 at 9:56 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with
the community (of course hiding sensitive data), as to help the community
better understand which case went through the automated cases and report
issues on different settings? E.g. a public concourse job for the pipeline
runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we
have this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until we
get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail
to update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at*
12:33:26) *before the bosh director ends up timeouting and calling the
canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF
with a more recent stemcell version (2922) than what the cf release notes
are mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Nicholas Calugar
 

Hey Guillaume,

To be fair, this did take some time to figure out on our end. It has to do
with the interaction between BOSH and Monit. As Dieu mentioned, runtime has
a story to potentially determine a better solution than "nfs_mounter must
be last in the list."


Nick

On Thursday, May 28, 2015, Guillaume Berche <bercheg(a)gmail.com> wrote:

Oups, sorry, I had probably overlooked this faulty trace that should have
hinted me.

[2015-05-26 16:37:51+0000] ------------ STARTING cloud_controller_ng_ctl
at Tue May 26 16:37:51 UTC 2015 --------------
[2015-05-26 16:37:51+0000] chown: changing ownership of
‘/var/vcap/nfs/shared’: Operation not permitted


On Thu, May 28, 2015 at 9:56 AM, Guillaume Berche <bercheg(a)gmail.com
<javascript:_e(%7B%7D,'cvml','bercheg(a)gmail.com');>> wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with
the community (of course hiding sensitive data), as to help the
community better understand which case went through the automated cases and
report issues on different settings? E.g. a public concourse job for the
pipeline runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io
<javascript:_e(%7B%7D,'cvml','dcao(a)pivotal.io');>> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we
have this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until
we get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com
<javascript:_e(%7B%7D,'cvml','youngm(a)gmail.com');>> wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com
<javascript:_e(%7B%7D,'cvml','bercheg(a)gmail.com');>> wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail
to update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at*
12:33:26) *before the bosh director ends up timeouting and calling
the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release
152 with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and
cf release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script
works, I'm suprised with the following process tree, where ruby seems to
call the ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF
with a more recent stemcell version (2922) than what the cf release notes
are mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
<javascript:_e(%7B%7D,'cvml','cf-dev(a)lists.cloudfoundry.org');>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
<javascript:_e(%7B%7D,'cvml','cf-dev(a)lists.cloudfoundry.org');>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
<javascript:_e(%7B%7D,'cvml','cf-dev(a)lists.cloudfoundry.org');>
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Mike Youngstrom
 

We discovered this fix out of sheer luck. So, no help there sorry. :)

Mike

On Thu, May 28, 2015 at 1:56 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with
the community (of course hiding sensitive data), as to help the community
better understand which case went through the automated cases and report
issues on different settings? E.g. a public concourse job for the pipeline
runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we
have this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until we
get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail
to update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at*
12:33:26) *before the bosh director ends up timeouting and calling the
canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release 152
with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf
release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script works,
I'm suprised with the following process tree, where ruby seems to call the
ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF
with a more recent stemcell version (2922) than what the cf release notes
are mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Guillaume Berche
 

Thanks Nick and Mike for the details. I am curious to know if the issue had
been pinpointed by the bosh team, and whether the fix would be purely in
bosh or require adaptation of cf-release.

Looking at related bosh tickets, the following ones seem relevant:

https://www.pivotaltracker.com/story/show/62354622 Agent should poll monit
process state on "STOP" message (i.e. race condition between stop and start
when applying)
https://www.pivotaltracker.com/n/projects/942192/stories/67521532 audit
process start up/shutdown via monit (process state machine, with "invalid
pid check" leading to a monit stop. Would be worth checking a debug trace
is indeed printed in this case.)

https://www.pivotaltracker.com/story/show/73360712 Bosh reports instances
as failing during a healthy deploy triggering many alerts
https://www.pivotaltracker.com/story/show/88572424 PR 'Update monit to
5.2.5' (monit bug fixes)

Guillaume.

On Thu, May 28, 2015 at 7:57 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

We discovered this fix out of sheer luck. So, no help there sorry. :)

Mike

On Thu, May 28, 2015 at 1:56 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log

@Dieu, is there a way the runtime pipelines output could be shared with
the community (of course hiding sensitive data), as to help the
community better understand which case went through the automated cases and
report issues on different settings? E.g. a public concourse job for the
pipeline runing stemcell 2977 (runtime-bb-2 ?).

Thanks,

Guillaume.

On Wed, May 27, 2015 at 7:55 PM, Dieu Cao <dcao(a)pivotal.io> wrote:

We have environments on stemcell 2977 that are running well.

We have an environment using NFS that ran into that same issue and we
have this bug open. [1]
Specifying the nfs_mounter job last should work in the mean time until
we get the order switched.
This was apparently introduced when we added consul_agent to the cloud
controller jobs.
I'll update the release notes for the affected releases.

-Dieu
CF Runtime PM

[1] https://www.pivotaltracker.com/story/show/94152506


On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

We recently experienced a similar issue. Not sure if it is the same.
But it was caused when we moved the nfs_mounter job template to the first
item in the list of templates for the CC job. We moved nfs_mounter to the
last job template in the list and we haven't had a problem since. It was
really strange cause you think you'd want nfs_mounter first. Anyway,
something to try.

Mike

On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Hi,

I'm experiencing a weird situation where api and api_worker jobs fail
to update through bosh and end up being reported as "not running". However,
manually running "monit start cloud_controller_ng" (or rebooting the vm),
the faulty jobs starts fine, and bosh deployment proceeds without errors.
Looking at monit logs, it seems that there is an extra monit stop request
for the cc_ng.

Below are detailed traces illustrating the issue.

$ bosh deploy

[..]
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39)
Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44)


When instructing bosh to update the job (in this case only a config
change), we indeed see the bosh agent asking monit to stop jobs, restart
monit itself, start jobs, and then we see the extra stop (at*
12:33:26) *before the bosh director ends up timeouting and calling
the canary failed.

$ less /var/vcap/monit/monit.log

[UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done
*[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl
*[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1

There is no associated traces of the bosh agent asking this extra stop:

$ less /var/vcap/bosh/log/current
2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ********************
2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}}


This is reproducing systematically on our set up using bosh release
152 with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and
cf release 207 running stemcell 2889.

Enabling monit verbose logs discarded the theory of monit restarting
cc_ng jobs because of too much ram usage, or failed http health check
(along with the short time window in which the extra stop is requested:
~15s). I also discarded possibility of multiple monit instances, or pid
inconsistency with cc_ng process. I'm now suspecting either the bosh agent
to send extra stop request, or something with the cc_ng ctl scripts.

As a side question, can someone explain how the cc_ng ctl script
works, I'm suprised with the following process tree, where ruby seems to
call the ctl script. Is the cc spawning it self ?

$ ps auxf --cols=2000 | less
[...]
vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63
root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout
root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }
root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63
root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start
root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr
root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) }


I was wondering whether this could come from our setting running CF
with a more recent stemcell version (2922) than what the cf release notes
are mentionning as "tested configuration". Are the latest stemcells tested
against latest CF release ? Is there any way to see what stemcell version
the runtime team pipelines is using [1] seemed to accept env vars and [2]
required logging in ? I scanned through the bosh agent commit logs to spot
something related but without luck so far.

Thanks in advance for your help,

Guillaume.

[1]
https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh
<https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh>
[2] https://concourse.diego-ci.cf-app.com/
<https://concourse.diego-ci.cf-app.com/>


_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


Guillaume Berche
 

For similar further diagnostics, I've updated docs with a working way to be
able to invoke steno-prettify to get cc_ng logs as timestamps, cf
https://github.com/cloudfoundry/docs-running-cf/pull/7/files

BTW, has anyone tried a log conversion for cc_ng logs collected into a
syslog endpoint (such as a logstash filter that invokes the steno-prettify
command or an equivalent ) ?

Besides, I ran into the bin/console pry shell that seems to open
interesting diagnostic/db interaction opportunities. If there are example
of common cases using it, I'd be happy to test and report them in cf docs,
to refine [2]

[1] https://github.com/cloudfoundry/docs-running-cf/pull/7/files
[2]
https://github.com/Orange-OpenSource/docs-running-cf/commit/cc46acc26eff50988a652acc3582c7573d6c8c8e


On Thu, May 28, 2015 at 7:57 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

We discovered this fix out of sheer luck. So, no help there sorry. :)

Mike

On Thu, May 28, 2015 at 1:56 AM, Guillaume Berche <bercheg(a)gmail.com>
wrote:

Thanks a lot Mike and Dieu. Indeed moving the nfs_mounter last seemed
indeed to fix the issue in v207. If ever this reproduces on master and can
help, I submitted https://github.com/cloudfoundry/cf-release/pull/689
against develop branch.

Out of curiosity, and for improving my next diagnostic task, how was the
root cause diagnosed? I was not observing any faulty output traces into
jobs outputs:
[...]/cloud_controller_worker_ctl.log, /var/vcap/sys/log/cloud_controller_ng_ctl.err.log
or [...]/cloud_controller_ng/cloud_controller_ng.log