Date
1 - 5 of 5
A hanged etcd used by hm9000 makes an impact on the delayed detection time of crashed application instances
Masumi Ito
Hi,
I found that one of etcds hanged up delayed the detection of crashed application instances, resulting in the slow recovery time. Although this depended on the condition of which hm9000 processes were connecting to the each etcd VM, it approximately took up to 15min to recover and I think it too long delayed. Does anyone know how to calculate time for hm9000 to detect a hanged etcd VM and switch to healthy etcds? I have encounted two different scenarios as follows. 1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner was connecting to the normal etcd. (About 8 min for analyzer to be recovered. The other hm9000 analyzer took over instead.) The analyzer seemed to be hanged up accidentally just after the connected etcd was hanged because "Analyzer completed succesfully" was not found in the log. After approximately 8 min passed, the other hm9000 analyzer acquired the lock and started to work instead. And then it identified crashed instance and enqueued start message. the crashed app was relaunched within ten min after the detection. 2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner was connecting to the hanged etcd. (About 15 min for listener to be recovered. The same hm9000 listener seemed to be recovered somehow.) The listener started to fail to sync heartbeats just after the connected etcd was hanged. After 15min, "Save took too long. Not bumping freshness." was showed in the listner's log and then analyzer also complained about the old actual state: "Analyzer failed with error - Error:Actual state is not fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had recovered somehow and started to bump freshness periodically then analyzer also started to analyze actual state and desied state and raised the request to start a crashed instance. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
Gwenn Etourneau
Just one question what about moving to Diego to get ride of HM9000 / DEA ?
toggle quoted message
Show quoted text
On Tue, Dec 15, 2015 at 9:44 PM, Masumi Ito <msmi10f(a)gmail.com> wrote:
Hi, |
|
Masumi Ito
I understand your suggestion however it is not easy to dispose the current
runtime and replace a new environment. So I would really appreciate it if you give advice of how hm9000 detect and fail over the healthy etcd. First of all, I thought it might be related to the tcp keep alive setting of the hm9000 etcd client(coreos/etcd-go). Once I investigated it, the keep alive interval seems to be one second, resulting in 10 sec (TCP_KEEPIDLE(1 sec)+TCP_KEEPINTVL(1 sec)*TCP_KEEPCNT(9)) required to detect a hunged connection. Note that /proc/sys/net/ipv4/tcp_keepalive_probes value is 9 in each hm9000 VM. According to the tcpdump results on the hm9000, I found that [TCP Retransmission] packets for original http request (i.e. PUT /v2/keys/hm/locks/Analyzer or GET /v2/keys/hm/v4/desired-fresh) to the hanged etcd was sent 9 times however it took more than one min totally. Even after this connection is closed, the next trail to reconnect to a healthy etcd VM seems not to be started immediately. Therefore I am wondering what caused this totall recovery time (15min). Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096p3135.html Sent from the CF Dev mailing list archive at Nabble.com. |
|
CF Runtime
Before we dive into low-level specifics, we would like to get some high-level information.
What version of CF are you using? Which IAAS is this running on? What are the stemcell and bosh versions? Please also post a (sanitized) copy of the manifest. Thanks, Rob & Zak CF release integration Pivotal |
|
Masumi Ito
Hi Rob & Zak ,
What version of CF are you using?cf v212 Which IAAS is this running on?OpenStack (Icehouse) What are the stemcell and bosh versions?stemcell 2989 BOSH 1.3008.0 Please also post a (sanitized) copy of the manifest.We will share a sanitized one. Regards, Masumi -- View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096p3143.html Sent from the CF Dev mailing list archive at Nabble.com. |
|