Looks like there are a lot of log lines in the hm.listener component where
it takes to long to save state to etcd. It updates state based on
heartbeats from the DEAs. When the etcd request takes too long, the
listener doesn't mark the data as fresh (it lets a key with a TTL expire).
Then when another component tries to get the state of actual running
instances (this value populates the number of running instances you see
changing in the CLI output), it bails early because it detects the data is
stale. CC can't determine the number of running instances, so it reports
-1 as a sentinel to indicate unknown, which the CLI renders as ?.
The question is why are saves sometimes taking too long, causing the data
to be marked stale so frequently?