Job is not running after update - agent/monit race issue?
Danny Berger <dpb587@...>
Frequently when doing a deploy (happens in multiple deployments) a job will
randomly fail with "job/0 is not running after update" for no logical reason. I can just rerun `bosh deploy` and it will succeed on that job and move onto the next job for update (which might also fail). Alternatively, I can SSH in and monit will show one or more processes as "not monitored", yet if I run `monit start all` it does start the remaining processes without fail. Looking into this behavior more today, I think it might be some strange interaction between bosh-agent and monit. In a good job, everything updates/restarts as expected (logs here [1]), but on a problem job, I've noticed a key difference: monit receives `start service` very early [2] but never actually invokes the start action for it. In the bad log [3] you'll see there are only 3 "start: " and "start action done" messages, yet there are 4 "start service" messages. In the good job logs, there would always be 4 of each of those messages. Here is a second example [4] where two services fail to start. In all cases that I'm seeing, if the "start service" call(s) are logged before those final "monit HTTP server stopped/started" occur, then they appear to get lost and the start command never run. Theorizing... is it possible that bosh-agent is asynchronously sending start commands alongside SIGHUP? Or perhaps that monit is randomly, strangely slow to process the SIGHUP vs HTTP request? Or perhaps those monit starts are just sent to quickly after a reload? These logs were from a deployment using bosh-aws-xen-ubuntu-trusty-go_agent/2798 with the logsearch + logsearch-shipper releases. Sorry the stemcell isn't newer - looking through bosh-agent and bosh commit logs though I don't see messages which reference a fix for this sort of thing, so hopefully the log details are still relevant. I don't think it's release or deployment specific given the log message, but I don't have much experience deploying many other things to know for sure. If anybody has any insight into this strangeness, I'd definitely appreciate it. The while loop workaround we've been using works, but it's not so great for automation. Thanks! Danny [1] https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-good-summary-log [2] https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad-log-L44 [3] https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad-log [4] https://gist.github.com/dpb587/ad44bb34aabab1c4a98e#file-monit-bad2-log -- Danny Berger http://dpb587.me |
|