Re: Cloud controller doesn't recover after database downtime


Mike Youngstrom
 

We have seen this same thing as well but haven't had time to dig into it deeper.  For us it isn't hard to reproduce.  Simply do a push on a loop while doing an update duplicates it for us.  You might have enough info here for an issue in the CC project if nobody from the team looks at this message.

Mike

On Thu, Apr 5, 2018 at 5:25 AM, Holger Oehm <holger.oehm@...> wrote:
Hi,

Today we saw in our productive system during the update of the
database instance (which hosts ccdb, uaadb, locketdb and diegodb)
an error during the push of an app.

That was to be expected. The unexpected thing was, that afterwards
(when the database instance was up and running again) further attempts
to push the same application also kept failing.
From the CF_TRACE we saw that a PUT to /v2/apps/<guid> got a response
with status code 400, with code 100001, description "The app is invalid: VCAP::CloudController::BuildCreate::StagingInProgress" and error_code "CF-AppInvalid".

This didn't recover by itself for 20 minutes. After that an operator did
a cf restage of the application and the problem disappeared.

Everything else worked as expected, also the diego-sync job was running
fine.

My guess is, that the database disappeared at an inconvenient point in
time. And this left an inconsistent state.

What looks strange to me is that a cf push of the same application
kept failing, but a cf restage fixed it. Shouldn't both commands
fix the situation?

Best Regards,
Holger.




Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.