Hi all,
I am installing cf 212 with stemcell 2989. However, I got the failure on status_z1.
Started preparing configuration > Binding configuration. Done (00:00:04)
Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:55) Started updating job nats_z1 > nats_z1/0 (canary). Done (00:00:56) Started updating job etcd_z1 > etcd_z1/0 (canary). Done (00:01:35) Started updating job stats_z1 > stats_z1/0 (canary). Failed: `stats_z1/0' is not running after update (00:10:26)
Error 400007: `stats_z1/0' is not running after update.
I checked the debug log. But I cannot find any useful information. Attached for reference.
I also checked the log in stats_z1 job VM. But I cannot find any error message there.
Any suggestion to investigate the issue? Thanks.
|
|
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all" wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes) then "monit restart all" Normally, it will fix the issue "Failed: `XXX' is not running after update" -- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.htmlSent from the CF Dev mailing list archive at Nabble.com.
|
|
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
toggle quoted message
Show quoted text
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote: My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs. On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote: It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
toggle quoted message
Show quoted text
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote: If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
Please check the file collector/collector.log, it's in a subdirectory of the unpacked log tarball. On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote: Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
It frequently logs the message below. It seems not helpful.
{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}
the only possible error message from the bosh debug log is "ntp":{"message":"bad ntp server"}
But I don't think, it is related to the failure of stats_z1 updating.
I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO -- DirectorJobRunner: Checking if stats_z1/0 has been updated after 63.333333333333336 seconds D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG -- DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd {"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"} D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED: director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a {"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad ntp server"}}}
toggle quoted message
Show quoted text
On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote: Please check the file collector/collector.log, it's in a subdirectory of the unpacked log tarball.
On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang <guangcai.wang(a)gmail.com
wrote: It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
I often take the following approach to debugging issues like this: * Open two shell sessions to your failing VM using bosh ssh, and switch to superuser * In one session, `watch monit summary`. You might see collector going back and forth between initializing and not monitored, but please report anything else of interest you see here * In the other session, `cd /var/vcap/sys/log` and then `watch --differences=cumulative ls -altr **/*` to see which files are being written to while the startup processes are thrashing. Then `tail -f FILE_1 FILE_2 ...` listing all the files that were being written to, and seem relevant to the thrashing process(es) in monit On Wed, Sep 23, 2015 at 12:21 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote: It frequently logs the message below. It seems not helpful.
{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}
the only possible error message from the bosh debug log is "ntp":{"message":"bad ntp server"}
But I don't think, it is related to the failure of stats_z1 updating.
I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO -- DirectorJobRunner: Checking if stats_z1/0 has been updated after 63.333333333333336 seconds D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG -- DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd {"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"} D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED: director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a {"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad ntp server"}}}
On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Please check the file collector/collector.log, it's in a subdirectory of the unpacked log tarball.
On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang < guangcai.wang(a)gmail.com> wrote:
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
That did help. It showed us the real error.
==> metron_agent/metron_agent.stdout.log <== {"timestamp":1443054247.927488327,"process_id":23472,"source":"metron","log_level":"warn","message":"Failed to create client: Could not connect to NATS: dial tcp 192.168.110.202:4222: i/o timeout","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/ github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/collectorregistrar/collector_registrar.go ","line":51,"method":" github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/collectorregistrar.(*CollectorRegistrar).Run "}
I checked the security rule. It seems to have some problems.
toggle quoted message
Show quoted text
On Thu, Sep 24, 2015 at 2:47 AM, Amit Gupta <agupta(a)pivotal.io> wrote: I often take the following approach to debugging issues like this:
* Open two shell sessions to your failing VM using bosh ssh, and switch to superuser * In one session, `watch monit summary`. You might see collector going back and forth between initializing and not monitored, but please report anything else of interest you see here * In the other session, `cd /var/vcap/sys/log` and then `watch --differences=cumulative ls -altr **/*` to see which files are being written to while the startup processes are thrashing. Then `tail -f FILE_1 FILE_2 ...` listing all the files that were being written to, and seem relevant to the thrashing process(es) in monit
On Wed, Sep 23, 2015 at 12:21 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
It frequently logs the message below. It seems not helpful.
{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}
the only possible error message from the bosh debug log is "ntp":{"message":"bad ntp server"}
But I don't think, it is related to the failure of stats_z1 updating.
I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO -- DirectorJobRunner: Checking if stats_z1/0 has been updated after 63.333333333333336 seconds D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG -- DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd {"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"} D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED: director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a {"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad ntp server"}}}
On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Please check the file collector/collector.log, it's in a subdirectory of the unpacked log tarball.
On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang <guangcai.wang(a)gmail.com
wrote: Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang < guangcai.wang(a)gmail.com> wrote:
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
Okay, please let me know if you are able to fix your security group settings and whether the original problem gets resolved. Amit On Wed, Sep 23, 2015 at 7:03 PM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote: That did help. It showed us the real error.
==> metron_agent/metron_agent.stdout.log <== {"timestamp":1443054247.927488327,"process_id":23472,"source":"metron","log_level":"warn","message":"Failed to create client: Could not connect to NATS: dial tcp 192.168.110.202:4222: i/o timeout","data":null,"file":"/var/vcap/data/compile/metron_agent/loggregator/src/ github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/collectorregistrar/collector_registrar.go ","line":51,"method":" github.com/cloudfoundry/loggregatorlib/cfcomponent/registrars/collectorregistrar.(*CollectorRegistrar).Run "}
I checked the security rule. It seems to have some problems.
On Thu, Sep 24, 2015 at 2:47 AM, Amit Gupta <agupta(a)pivotal.io> wrote:
I often take the following approach to debugging issues like this:
* Open two shell sessions to your failing VM using bosh ssh, and switch to superuser * In one session, `watch monit summary`. You might see collector going back and forth between initializing and not monitored, but please report anything else of interest you see here * In the other session, `cd /var/vcap/sys/log` and then `watch --differences=cumulative ls -altr **/*` to see which files are being written to while the startup processes are thrashing. Then `tail -f FILE_1 FILE_2 ...` listing all the files that were being written to, and seem relevant to the thrashing process(es) in monit
On Wed, Sep 23, 2015 at 12:21 AM, Guangcai Wang <guangcai.wang(a)gmail.com> wrote:
It frequently logs the message below. It seems not helpful.
{"timestamp":1442987404.9433253,"message":"collector.started","log_level":"info","source":"collector","data":{},"thread_id":70132569199380,"fiber_id":70132570371720,"process_id":19392,"file":"/var/vcap/packages/collector/lib/collector/config.rb","lineno":45,"method":"setup_logging"}
the only possible error message from the bosh debug log is "ntp":{"message":"bad ntp server"}
But I don't think, it is related to the failure of stats_z1 updating.
I, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] INFO -- DirectorJobRunner: Checking if stats_z1/0 has been updated after 63.333333333333336 seconds D, [2015-09-23 04:55:59 #2392] [canary_update(stats_z1/0)] DEBUG -- DirectorJobRunner: SENT: agent.7d3452bd-679e-4a97-8514-63a373a54ffd {"method":"get_state","arguments":[],"reply_to":"director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a"} D, [2015-09-23 04:55:59 #2392] [] DEBUG -- DirectorJobRunner: RECEIVED: director.c5b97fc1-b972-47ec-9412-a83ad240823b.473fda64-6ac3-4a53-9ebc-321fc7eabd7a {"value":{"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"stats_z1","release":"","template":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d","templates":[{"name":"fluentd","version":"4c71c87bbf0144428afacd470e2a5e32b91932fc","sha1":"b141c6037d429d732bf3d67f7b79f8d7d80aac5d","blobstore_id":"d8451d63-2e4f-4664-93a8-a77e5419621d"},{"name":"collector","version":"889b187e2f6adc453c61fd8f706525b60e4b85ed","sha1":"f5ae15a8fa2417bf984513e5c4269f8407a274dc","blobstore_id":"3eeb0166-a75c-49fb-9f28-c29788dbf64d"},{"name":"metron_agent","version":"e6df4c316b71af68dfc4ca476c8d1a4885e82f5b","sha1":"42b6d84ad9368eba0508015d780922a43a86047d","blobstore_id":"e578bfb0-9726-4754-87ae-b54c8940e41a"},{"name":"apaas_collector","version":"8808f0ae627a54706896a784dba47570c92e0c8b","sha1":"b9a63da925b40910445d592c70abcf4d23ffe84d","blobstore_id":"3e6fa71a-07f7-446a-96f4-3caceea02f2f"}]},"packages":{"apaas_collector":{"name":"apaas_collector","version":"f294704d51d4517e4df3d8417a3d7c71699bc04d.1","sha1":"5af77ceb01b7995926dbd4ad7481dcb7c3d94faf","blobstore_id":"fa0e96b9-71a6-4828-416e-dde3427a73a9"},"collector":{"name":"collector","version":"ba47450ce83b8f2249b75c79b38397db249df48b.1","sha1":"0bf8ee0d69b3f21cf1878a43a9616cb7e14f6f25","blobstore_id":"722a5455-f7f7-427d-7e8d-e562552857bc"},"common":{"name":"common","version":"99c756b71550530632e393f5189220f170a69647.1","sha1":"90159de912c9bfc71740324f431ddce1a5fede00","blobstore_id":"37be6f28-c340-4899-7fd3-3517606491bb"},"fluentd-0.12.13":{"name":"fluentd-0.12.13","version":"71d8decbba6c863bff6c325f1f8df621a91eb45f.1","sha1":"2bd32b3d3de59e5dbdd77021417359bb5754b1cf","blobstore_id":"7bc81ac6-7c24-4a94-74d1-bb9930b07751"},"metron_agent":{"name":"metron_agent","version":"997d87534f57cad148d56c5b8362b72e726424e4.1","sha1":"a21404c50562de75000d285a02cd43bf098bfdb9","blobstore_id":"6c7cf72c-9ace-40a1-4632-c27946bf631e"},"ruby-2.1.6":{"name":"ruby-2.1.6","version":"41d0100ffa4b21267bceef055bc84dc37527fa35.1","sha1":"8a9867197682cabf2bc784f71c4d904bc479c898","blobstore_id":"536bc527-3225-43f6-7aad-71f36addec80"}},"configuration_hash":"a73c7d06b0257746e95aaa2ca994c11629cbd324","networks":{"private_cf_subnet":{"cloud_properties":{"name":"random","net_id":"1e1c9aca-0b5a-4a8f-836a-54c18c21c9b9","security_groups":["az1_cf_management_secgroup_bosh_cf_ssh_cf2","az1_cf_management_secgroup_cf_private_cf2","az1_cf_management_secgroup_cf_public_cf2"]},"default":["dns","gateway"],"dns":["192.168.110.8","133.162.193.10","133.162.193.9","192.168.110.10"],"dns_record_name":"0.stats-z1.private-cf-subnet.cf-apaas.microbosh","gateway":"192.168.110.11","ip":"192.168.110.204","netmask":"255.255.255.0"}},"resource_pool":{"cloud_properties":{"instance_type":"S-1"},"name":"small_z1","stemcell":{"name":"bosh-openstack-kvm-ubuntu-trusty-go_agent","version":"2989"}},"deployment":"cf-apaas","index":0,"persistent_disk":0,"persistent_disk_pool":null,"rendered_templates_archive":{"sha1":"0ffd89fa41e02888c9f9b09c6af52ea58265a8ec","blobstore_id":"4bd01ae7-a69a-4fe5-932b-d98137585a3b"},"agent_id":"7d3452bd-679e-4a97-8514-63a373a54ffd","bosh_protocol":"1","job_state":"failing","vm":{"name":"vm-12d45510-096d-4b8b-9547-73ea5fda00c2"},"ntp":{"message":"bad ntp server"}}}
On Wed, Sep 23, 2015 at 5:13 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Please check the file collector/collector.log, it's in a subdirectory of the unpacked log tarball.
On Wed, Sep 23, 2015 at 12:01 AM, Guangcai Wang < guangcai.wang(a)gmail.com> wrote:
Actually, I checked the two files in status_z1 job VM. I did not find any clues. Attached for reference.
On Wed, Sep 23, 2015 at 4:54 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
If you do "bosh logs stats_z1 0 --job" you will get a tarball of all the logs for the relevant processes running on the stats_z1/0 VM. You will likely find some error messages in the collectors stdout or stderr logs.
On Tue, Sep 22, 2015 at 11:30 PM, Guangcai Wang < guangcai.wang(a)gmail.com> wrote:
It does not help.
I always see the "collector" process bouncing between "running" and "does not exit" when I use "monit summary" in a while loop.
Who knows how to get the real error when the "collector" process is not failed? Thanks.
On Wed, Sep 23, 2015 at 4:11 PM, Tony <Tonyl(a)fast.au.fujitsu.com> wrote:
My approach is to login on the stats vm and sudo, then run "monit status" and restart the failed processes or simply restart all processes by running "monit restart all"
wait for a while(5~10 minutes at most) If there is still some failed process, e.g. collector then run ps -ef | grep collector and kill the processes in the list(may be you need to run kill -9 sometimes)
then "monit restart all"
Normally, it will fix the issue "Failed: `XXX' is not running after update"
-- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Error-400007-stats-z1-0-is-not-running-after-update-tp1901p1902.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|