Re: Error 400007: `stats_z1/0' is not running after update


iamflying
 

It frequently logs the message below. It seems not helpful.

{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}

the only possible error message from the bosh debug log is
"ntp":{"message":"bad ntp server"}

But I don't think, it is related to the failure of stats_z1 updating.

I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO --
DirectorJobRunner: Checking if stats_z1/0 has been updated after
63.333333333333336 seconds
D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG --
DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd
{"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"}
D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED:
director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a
{"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad
ntp server"}}}

On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Please check the file collector/collector.log, it's in a subdirectory of
the unpacked log tarball.

On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com>
wrote:

Actually, I checked the two files in status_z1 job VM. I did not find any
clues. Attached for reference.

On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

If you do "bosh logs stats_z1 0 --job" you will get a tarball of all
the logs for the relevant processes running on the stats_z1/0 VM. You will
likely find some error messages in the collectors stdout or stderr logs.

On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <guangcai.wang(a)gmail.com
wrote:
It does not help.

I always see the "collector" process bouncing between "running" and
"does not exit" when I use "monit summary" in a while loop.

Who knows how to get the real error when the "collector" process is not
failed? Thanks.

On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com>
wrote:

My approach is to login on the stats vm and sudo, then
run "monit status" and restart the failed processes or simply restart
all
processes by running "monit restart all"

wait for a while(5~10 minutes at most)
If there is still some failed process, e.g. collector
then run ps -ef | grep collector
and kill the processes in the list(may be you need to run kill -9
sometimes)

then "monit restart all"

Normally, it will fix the issue "Failed: `XXX' is not running after
update"



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.