Re: Job is not running after update - agent/monit race issue?


Danny Berger <dpb587@...>
 

Thanks for the suggestions. No `depends on` directives though, and
canary/update watch times were set to `30000-120000`.

I was thinking it was more an issue of monit not getting a chance to finish
responding to and executing the start call before the monit process
reloaded itself.

On Thu, Jun 4, 2015 at 6:42 PM, Dmitriy Kalinin <dkalinin(a)pivotal.io> wrote:

Do you use 'depends on' directives? Are you sure you have configured
`update` options for your deployment giving enough time for the monit to
spin up processes and have them running?

On Thu, Jun 4, 2015 at 4:31 PM, Danny Berger <dpb587(a)gmail.com> wrote:

Frequently when doing a deploy (happens in multiple deployments) a job
will randomly fail with "job/0 is not running after update" for no logical
reason. I can just rerun `bosh deploy` and it will succeed on that job and
move onto the next job for update (which might also fail). Alternatively, I
can SSH in and monit will show one or more processes as "not monitored",
yet if I run `monit start all` it does start the remaining processes
without fail. Looking into this behavior more today, I think it might be
some strange interaction between bosh-agent and monit.

In a good job, everything updates/restarts as expected (logs here [1]),
but on a problem job, I've noticed a key difference: monit receives `start
service` very early [2] but never actually invokes the start action for it.
In the bad log [3] you'll see there are only 3 "start: " and "start action
done" messages, yet there are 4 "start service" messages. In the good job
logs, there would always be 4 of each of those messages. Here is a second
example [4] where two services fail to start.

In all cases that I'm seeing, if the "start service" call(s) are logged
before those final "monit HTTP server stopped/started" occur, then they
appear to get lost and the start command never run. Theorizing... is it
possible that bosh-agent is asynchronously sending start commands alongside
SIGHUP? Or perhaps that monit is randomly, strangely slow to process the
SIGHUP vs HTTP request? Or perhaps those monit starts are just sent to
quickly after a reload?

These logs were from a deployment using
bosh-aws-xen-ubuntu-trusty-go_agent/2798 with the logsearch +
logsearch-shipper releases. Sorry the stemcell isn't newer - looking
through bosh-agent and bosh commit logs though I don't see messages which
reference a fix for this sort of thing, so hopefully the log details are
still relevant. I don't think it's release or deployment specific given the
log message, but I don't have much experience deploying many other things
to know for sure.

If anybody has any insight into this strangeness, I'd definitely
appreciate it. The while loop workaround we've been using works, but it's
not so great for automation.

Thanks!

Danny

[1]
https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-good-summary-log
[2]
https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad-log-L44
[3]
https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad-log
[4]
https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad2-log


--
Danny Berger
http://dpb587.me

_______________________________________________
cf-bosh mailing list
cf-bosh(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-bosh

_______________________________________________
cf-bosh mailing list
cf-bosh(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-bosh

--
Danny Berger
http://dpb587.me

Join cf-bosh@lists.cloudfoundry.org to automatically receive all group messages.