Thanks for the input, that's a good call. A colleague of mine (who is
toggle quoted message
Show quoted text
currently on vacation) did look at that... Sadly he's not around to ask
what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that
went away with vm reconstruction but not with dea job restart) ?
Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that
would be different or that could cause symptoms like this, although I could
be overlooking something as well. Without logs / app to try and replicate
I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones <
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that
there must be a moving part in the equation I'm blind to, in which case
that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go
back and fetch logs unfortunately. There's no chance of being able to
create a test case as I'm on client's time, so consider this a thought
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring
LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for
both. On some failing DEAs there were no other apps, on others there were -
it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring
contexts start single-threadedly. Do you know if that's a correct
Do you know if there any *things* that could have been different
between the DEAs that I didn't account for? Ie another moving part that's
*not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io>
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones <
I've witnessed behaviour caused by the combination of a DEA and a
Spring application that I can't explain. If you like a good mystery or you
happen to know a lot about Java proxies and DEA transient state, please
A particular Spring app
Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.Ever try bumping up the log level for Spring when you were getting the
problem? If so, did the problem still occur? Were you able to capture the
lucid64 or cflinuxfs2? or didn't matter?
All DEAs were from the same CF release (PCF ERT 1.5.2)
All DEAs were up-to-date according to BOSH (ie no outstanding changes
waiting to be applied)
All DEAs were deployed with identical BOSH job config
All Warden containers were using the same root FS
The droplet was the same across all DEAsWhat was the output of the Java build pack when the droplet was
The droplet version was the same
The droplet tarballs all had the same MD5 checksum
created? or better yet, run `cf files <app> app/.java-buildpack.log` and
include the output.
Warden was providing the exact same env and start command to all
I saw the same behaviour repeat itself across 5 completely separate
Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was
referenced by implementation rather than interface (yes, I know, but it was
not my code!).
Any chance you could include logs from the crash? Was there an
exception / stacktrace generated? Alternatively, have you been able to
create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for
the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also
reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH
release, with the same config, on the same stemcell, running the same
version of Warden, with the same droplet, and the same root FS, and the
same env, and the same start command, yielded different behaviour. I'm even
further confused as to why a `bosh recreate` changed that behaviour. What
could possibly have changed? Something on ephemeral disk? But what else is
there on ephemeral disk that could have mattered and was likely to have
How much was on the disk? Was it getting full? How many other apps
were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync
Anyone with a convincing explanation (that does not involve voodoo)Wild guess, race condition in the code somewhere?
will receive one free beer and a high-five at the next CF Summit!