A hanged etcd used by hm9000 makes an impact on the delayed detection time of crashed application instances

Masumi Ito


I found that one of etcds hanged up delayed the detection of crashed
application instances, resulting in the slow recovery time. Although this
depended on the condition of which hm9000 processes were connecting to the
each etcd VM, it approximately took up to 15min to recover and I think it
too long delayed.

Does anyone know how to calculate time for hm9000 to detect a hanged etcd VM
and switch to healthy etcds? I have encounted two different scenarios as

1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner
was connecting to the normal etcd. (About 8 min for analyzer to be
recovered. The other hm9000 analyzer took over instead.)
The analyzer seemed to be hanged up accidentally just after the connected
etcd was hanged because "Analyzer completed succesfully" was not found in
the log.
After approximately 8 min passed, the other hm9000 analyzer acquired the
lock and started to work instead. And then it identified crashed instance
and enqueued start message. the crashed app was relaunched within ten min
after the detection.

2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner
was connecting to the hanged etcd. (About 15 min for listener to be
recovered. The same hm9000 listener seemed to be recovered somehow.)
The listener started to fail to sync heartbeats just after the connected
etcd was hanged. After 15min, "Save took too long. Not bumping freshness."
was showed in the listner's log and then analyzer also complained about the
old actual state: "Analyzer failed with error - Error:Actual state is not
fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had
recovered somehow and started to bump freshness periodically then analyzer
also started to analyze actual state and desied state and raised the request
to start a crashed instance.


View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html
Sent from the CF Dev mailing list archive at Nabble.com.

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.