Re: App deployment hangs in legacy CF installation


John Wong
 

Hi James

Thanks for the info. I and my team greatly appreciate your time here. I
believe we are running on v153 (or close to that), which is very old.

I will have a look at those components more closely. A symptom we observe
is sometimes an app deployed successfully, the app would crash in a few
minutes even without activity.

What we see is socket closed on read error (which indicates IMO the
container was killed and the logger could not contact it).

John

On Mon, Jun 29, 2015 at 1:35 PM, James Bayer <jbayer(a)pivotal.io> wrote:

once you get to this line where you make the app started [1], then the
next step is that the cloud controller should be sending a NATS message
targeted at a particular DEA selected to run the app.

so you could monitor:
* NATS to see if you see the CC sending the NATS message
* the DEA logs to see if it receives the message
* the DEA to logs see if it is able to react to the message once it
receives it

we have had issues in the past where NATS issues on client/server
communication were addressed with restarting clients and servers, but it's
been quite awhile. letting us know which cf-release you are using could
help.

[1]
https://gist.github.com/yeukhon/666fa1936ef3473c6de6#file-gistfile1-txt-L534

On Mon, Jun 29, 2015 at 7:20 AM, John Wong <gokoproject(a)gmail.com> wrote:

Hi.

We are running on an extremely old version of CF (we are in the process
of building one based on the latest), so I know there is very little the
community may be able to help.

But regardless... let me give it a try.


In my debug session, I tried to deploy a hello world app, and deployment
stopped with "STARTED" and eventually timeout.

The full log:
https://gist.githubusercontent.com/yeukhon/666fa1936ef3473c6de6/raw/1f662b86e806ab1fff230f5558f4942d9785c584/gistfile1.txt


I can easily reproduce this when I did two concurrent push. Sometimes
they go through, sometimes they don't.

We have looked at every log in CF and we don't have any lead. I did bosh
restart JOB hoping it was caused by a slow process, but that did not help.
I found ntp was not installed on some of the components (we installed ntp
on all of the DEAs), and i found clock was not synced so I synced the
clocked, and still no help.

Any idea where I should look at? I thought about our EC2 instance health
but all of them seem to be healthy. I am considering relaunching (bosh
recreate) one component at a time.

The one thing I did notice is I am constantly deploying to a couple DEAs.
I will look into them but I am not sure...


Any ideas will be appreciated. Thanks.

John

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev


--
Thank you,

James Bayer

_______________________________________________
cf-dev mailing list
cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.