Cloud controller doesn't recover after database downtime

Holger Oehm


Today we saw in our productive system during the update of the
database instance (which hosts ccdb, uaadb, locketdb and diegodb)
an error during the push of an app.

That was to be expected. The unexpected thing was, that afterwards
(when the database instance was up and running again) further attempts
to push the same application also kept failing.
From the CF_TRACE we saw that a PUT to /v2/apps/<guid> got a response
with status code 400, with code 100001, description "The app is invalid: VCAP::CloudController::BuildCreate::StagingInProgress" and error_code "CF-AppInvalid".

This didn't recover by itself for 20 minutes. After that an operator did
a cf restage of the application and the problem disappeared.

Everything else worked as expected, also the diego-sync job was running

My guess is, that the database disappeared at an inconvenient point in
time. And this left an inconsistent state.

What looks strange to me is that a cf push of the same application
kept failing, but a cf restage fixed it. Shouldn't both commands
fix the situation?

Best Regards,

Join to automatically receive all group messages.