Re: Job is not running after update - agent/monit race issue?

Danny Berger <dpb587@...>

Hi again - I still haven't been able to track down why the monit processes
are sometimes not restarted. Does anybody have ideas for other things I can
try or do to further debug this issue?

Any help is appreciated - thanks!


On Thu, Jun 4, 2015 at 5:31 PM, Danny Berger <dpb587(a)> wrote:

Frequently when doing a deploy (happens in multiple deployments) a job
will randomly fail with "job/0 is not running after update" for no logical
reason. I can just rerun `bosh deploy` and it will succeed on that job and
move onto the next job for update (which might also fail). Alternatively, I
can SSH in and monit will show one or more processes as "not monitored",
yet if I run `monit start all` it does start the remaining processes
without fail. Looking into this behavior more today, I think it might be
some strange interaction between bosh-agent and monit.

In a good job, everything updates/restarts as expected (logs here [1]),
but on a problem job, I've noticed a key difference: monit receives `start
service` very early [2] but never actually invokes the start action for it.
In the bad log [3] you'll see there are only 3 "start: " and "start action
done" messages, yet there are 4 "start service" messages. In the good job
logs, there would always be 4 of each of those messages. Here is a second
example [4] where two services fail to start.

In all cases that I'm seeing, if the "start service" call(s) are logged
before those final "monit HTTP server stopped/started" occur, then they
appear to get lost and the start command never run. Theorizing... is it
possible that bosh-agent is asynchronously sending start commands alongside
SIGHUP? Or perhaps that monit is randomly, strangely slow to process the
SIGHUP vs HTTP request? Or perhaps those monit starts are just sent to
quickly after a reload?

These logs were from a deployment using
bosh-aws-xen-ubuntu-trusty-go_agent/2798 with the logsearch +
logsearch-shipper releases. Sorry the stemcell isn't newer - looking
through bosh-agent and bosh commit logs though I don't see messages which
reference a fix for this sort of thing, so hopefully the log details are
still relevant. I don't think it's release or deployment specific given the
log message, but I don't have much experience deploying many other things
to know for sure.

If anybody has any insight into this strangeness, I'd definitely
appreciate it. The while loop workaround we've been using works, but it's
not so great for automation.




Danny Berger
