Re: Error 400007: `stats_z1/0' is not running after update


Amit Kumar Gupta
 

I often take the following approach to debugging issues like this:

* Open two shell sessions to your failing VM using bosh ssh, and switch to
superuser
* In one session, `watch monit summary`. You might see collector going back
and forth between initializing and not monitored, but please report
anything else of interest you see here
* In the other session, `cd /var/vcap/sys/log` and then `watch
--differences=cumulative ls -altr **/*` to see which files are being
written to while the startup processes are thrashing. Then `tail -f FILE_1
FILE_2 ...` listing all the files that were being written to, and seem
relevant to the thrashing process(es) in monit


On Wed, Sep 23, 2015 at 12:21 AM, Guangcai Wang <guangcai.wang(a)gmail.com>
wrote:

It frequently logs the message below. It seems not helpful.


{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}

the only possible error message from the bosh debug log is
"ntp":{"message":"bad ntp server"}

But I don't think, it is related to the failure of stats_z1 updating.

I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO --
DirectorJobRunner: Checking if stats_z1/0 has been updated after
63.333333333333336 seconds
D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG --
DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd
{"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"}
D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED:
director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a
{"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad
ntp server"}}}


On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Please check the file collector/collector.log, it's in a subdirectory of
the unpacked log tarball.

On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com>
wrote:

Actually, I checked the two files in status_z1 job VM. I did not find
any clues. Attached for reference.

On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

If you do "bosh logs stats_z1 0 --job" you will get a tarball of all
the logs for the relevant processes running on the stats_z1/0 VM. You will
likely find some error messages in the collectors stdout or stderr logs.

On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <
guangcai.wang(a)gmail.com> wrote:

It does not help.

I always see the "collector" process bouncing between "running" and
"does not exit" when I use "monit summary" in a while loop.

Who knows how to get the real error when the "collector" process is
not failed? Thanks.

On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com>
wrote:

My approach is to login on the stats vm and sudo, then
run "monit status" and restart the failed processes or simply restart
all
processes by running "monit restart all"

wait for a while(5~10 minutes at most)
If there is still some failed process, e.g. collector
then run ps -ef | grep collector
and kill the processes in the list(may be you need to run kill -9
sometimes)

then "monit restart all"

Normally, it will fix the issue "Failed: `XXX' is not running after
update"



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.