A hanged etcd used by hm9000 makes an impact on the delayed detection time of crashed application instances


Masumi Ito
 

Hi,

I found that one of etcds hanged up delayed the detection of crashed
application instances, resulting in the slow recovery time. Although this
depended on the condition of which hm9000 processes were connecting to the
each etcd VM, it approximately took up to 15min to recover and I think it
too long delayed.

Does anyone know how to calculate time for hm9000 to detect a hanged etcd VM
and switch to healthy etcds? I have encounted two different scenarios as
follows.

1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner
was connecting to the normal etcd. (About 8 min for analyzer to be
recovered. The other hm9000 analyzer took over instead.)
The analyzer seemed to be hanged up accidentally just after the connected
etcd was hanged because "Analyzer completed succesfully" was not found in
the log.
After approximately 8 min passed, the other hm9000 analyzer acquired the
lock and started to work instead. And then it identified crashed instance
and enqueued start message. the crashed app was relaunched within ten min
after the detection.

2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner
was connecting to the hanged etcd. (About 15 min for listener to be
recovered. The same hm9000 listener seemed to be recovered somehow.)
The listener started to fail to sync heartbeats just after the connected
etcd was hanged. After 15min, "Save took too long. Not bumping freshness."
was showed in the listner's log and then analyzer also complained about the
old actual state: "Analyzer failed with error - Error:Actual state is not
fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had
recovered somehow and started to bump freshness periodically then analyzer
also started to analyze actual state and desied state and raised the request
to start a crashed instance.

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html
Sent from the CF Dev mailing list archive at Nabble.com.


Gwenn Etourneau
 

Just one question what about moving to Diego to get ride of HM9000 / DEA ?

On Tue, Dec 15, 2015 at 9:44 PM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi,

I found that one of etcds hanged up delayed the detection of crashed
application instances, resulting in the slow recovery time. Although this
depended on the condition of which hm9000 processes were connecting to the
each etcd VM, it approximately took up to 15min to recover and I think it
too long delayed.

Does anyone know how to calculate time for hm9000 to detect a hanged etcd
VM
and switch to healthy etcds? I have encounted two different scenarios as
follows.

1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner
was connecting to the normal etcd. (About 8 min for analyzer to be
recovered. The other hm9000 analyzer took over instead.)
The analyzer seemed to be hanged up accidentally just after the
connected
etcd was hanged because "Analyzer completed succesfully" was not found in
the log.
After approximately 8 min passed, the other hm9000 analyzer acquired the
lock and started to work instead. And then it identified crashed instance
and enqueued start message. the crashed app was relaunched within ten min
after the detection.

2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner
was connecting to the hanged etcd. (About 15 min for listener to be
recovered. The same hm9000 listener seemed to be recovered somehow.)
The listener started to fail to sync heartbeats just after the connected
etcd was hanged. After 15min, "Save took too long. Not bumping freshness."
was showed in the listner's log and then analyzer also complained about the
old actual state: "Analyzer failed with error - Error:Actual state is not
fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had
recovered somehow and started to bump freshness periodically then analyzer
also started to analyze actual state and desied state and raised the
request
to start a crashed instance.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html
Sent from the CF Dev mailing list archive at Nabble.com.


Masumi Ito
 

I understand your suggestion however it is not easy to dispose the current
runtime and replace a new environment. So I would really appreciate it if
you give advice of how hm9000 detect and fail over the healthy etcd.

First of all, I thought it might be related to the tcp keep alive setting of
the hm9000 etcd client(coreos/etcd-go). Once I investigated it, the keep
alive interval seems to be one second, resulting in 10 sec (TCP_KEEPIDLE(1
sec)+TCP_KEEPINTVL(1 sec)*TCP_KEEPCNT(9)) required to detect a hunged
connection. Note that /proc/sys/net/ipv4/tcp_keepalive_probes value is 9 in
each hm9000 VM.

According to the tcpdump results on the hm9000, I found that [TCP
Retransmission] packets for original http request (i.e. PUT
/v2/keys/hm/locks/Analyzer or GET /v2/keys/hm/v4/desired-fresh) to the
hanged etcd was sent 9 times however it took more than one min totally. Even
after this connection is closed, the next trail to reconnect to a healthy
etcd VM seems not to be started immediately.

Therefore I am wondering what caused this totall recovery time (15min).

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096p3135.html
Sent from the CF Dev mailing list archive at Nabble.com.


CF Runtime
 

Before we dive into low-level specifics, we would like to get some high-level information.

What version of CF are you using? Which IAAS is this running on? What are the stemcell and bosh versions? Please also post a (sanitized) copy of the manifest.

Thanks,
Rob & Zak
CF release integration
Pivotal


Masumi Ito
 

Hi Rob & Zak ,

What version of CF are you using?
cf v212

Which IAAS is this running on?
OpenStack (Icehouse)

What are the stemcell and bosh versions?
stemcell 2989
BOSH 1.3008.0

Please also post a (sanitized) copy of the manifest.
We will share a sanitized one.

Regards,
Masumi




--
View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096p3143.html
Sent from the CF Dev mailing list archive at Nabble.com.