App deployment hangs in legacy CF installation
We are running on an extremely old version of CF (we are in the process of
building one based on the latest), so I know there is very little the
community may be able to help.
But regardless... let me give it a try.
In my debug session, I tried to deploy a hello world app, and deployment
stopped with "STARTED" and eventually timeout.
The full log:
I can easily reproduce this when I did two concurrent push. Sometimes they
go through, sometimes they don't.
We have looked at every log in CF and we don't have any lead. I did bosh
restart JOB hoping it was caused by a slow process, but that did not help.
I found ntp was not installed on some of the components (we installed ntp
on all of the DEAs), and i found clock was not synced so I synced the
clocked, and still no help.
Any idea where I should look at? I thought about our EC2 instance health
but all of them seem to be healthy. I am considering relaunching (bosh
recreate) one component at a time.
The one thing I did notice is I am constantly deploying to a couple DEAs. I
will look into them but I am not sure...
Any ideas will be appreciated. Thanks.
once you get to this line where you make the app started , then the nexttoggle quoted messageShow quoted text
step is that the cloud controller should be sending a NATS message targeted
at a particular DEA selected to run the app.
so you could monitor:
* NATS to see if you see the CC sending the NATS message
* the DEA logs to see if it receives the message
* the DEA to logs see if it is able to react to the message once it
we have had issues in the past where NATS issues on client/server
communication were addressed with restarting clients and servers, but it's
been quite awhile. letting us know which cf-release you are using could
On Mon, Jun 29, 2015 at 7:20 AM, John Wong <gokoproject(a)gmail.com> wrote:
Hi Jamestoggle quoted messageShow quoted text
Thanks for the info. I and my team greatly appreciate your time here. I
believe we are running on v153 (or close to that), which is very old.
I will have a look at those components more closely. A symptom we observe
is sometimes an app deployed successfully, the app would crash in a few
minutes even without activity.
What we see is socket closed on read error (which indicates IMO the
container was killed and the logger could not contact it).
On Mon, Jun 29, 2015 at 1:35 PM, James Bayer <jbayer(a)pivotal.io> wrote:
once you get to this line where you make the app started , then the