here is some guidance on how to check for available entropy on a linux host [1]. i'm not sure if the bosh agent, DEA or diego cell captures this metric or not. but we should certainly look into it. when you're inside a container, you can check for available entropy with the "cf ssh" command that is supported now with diego (or the app itself could log it before startup). see an example of this command running on pivotal's hosted diego which indicates values lower than 200 while you're trying to do operations needed entropy can cause a problem [2]. [1] https://major.io/2007/07/01/check-available-entropy-in-linux/[2] $ cf ssh MYAPP vcap(a)uqj9t0vqu9l:~$$ cat /proc/sys/kernel/random/entropy_avail; date; 855 Wed Aug 19 13:32:05 UTC 2015 vcap(a)uqj9t0vqu9l:~$ cat /proc/sys/kernel/random/entropy_avail; date; 866 Wed Aug 19 13:32:07 UTC 2015 vcap(a)uqj9t0vqu9l:~$ cat /proc/sys/kernel/random/entropy_avail; date; 876 Wed Aug 19 13:32:08 UTC 2015
toggle quoted message
Show quoted text
On Wed, Aug 19, 2015 at 5:06 AM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote: I've seen this happen to a good number of apps running on PWS, so it's something you can encounter when running CF on AWS as well.
What usually happens is that the application takes significantly longer to start, sometimes to the point where it fails to start quick enough and CF marks it as crashed. I haven't seen it cause any NPE's though. My understanding is that the JVM will just block until it gets the entropy it needs.
Dan
On Wed, Aug 19, 2015 at 3:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote:
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Thank you,
James Bayer
|
|
I've seen this happen to a good number of apps running on PWS, so it's something you can encounter when running CF on AWS as well.
What usually happens is that the application takes significantly longer to start, sometimes to the point where it fails to start quick enough and CF marks it as crashed. I haven't seen it cause any NPE's though. My understanding is that the JVM will just block until it gets the entropy it needs.
Dan
toggle quoted message
Show quoted text
On Wed, Aug 19, 2015 at 3:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote: Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
|
|
Johannes Hiemer <jvhiemer@...>
Go for it and let's see if we can document this issue afterwards with some logs for other people. On Wed, Aug 19, 2015 at 9:55 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Ooh, that's interesting. Coupled with what Guillaume suggested, I can imagine that being a problem. We did get a NullPointerException logged by some Spring Security component where we couldn't figure out what could possibly be null, so it's conceivable that some nested call to java.util.Random failed and returned null.
Sadly I don't have the logs any more, but this narrative is convincing enough to make me think it might have been the problem :)
On Wed, Aug 19, 2015 at 8:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote:
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen Johannes Hiemer
|
|
Ooh, that's interesting. Coupled with what Guillaume suggested, I can imagine that being a problem. We did get a NullPointerException logged by some Spring Security component where we couldn't figure out what could possibly be null, so it's conceivable that some nested call to java.util.Random failed and returned null.
Sadly I don't have the logs any more, but this narrative is convincing enough to make me think it might have been the problem :)
toggle quoted message
Show quoted text
On Wed, Aug 19, 2015 at 8:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote: Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Regards,
Daniel Jones EngineerBetter.com
|
|
Johannes Hiemer <jvhiemer@...>
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=enRegards, Johannes On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen Johannes Hiemer
|
|
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
toggle quoted message
Show quoted text
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote: Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
|
|
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry. Perhaps someone else on the list has some thoughts? Dan On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
|
|
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
toggle quoted message
Show quoted text
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote: On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
|
|
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app? was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs? All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter? The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output. Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior? There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)? Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs? Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere? Dan
|
|
Argh - apologies for the poor formatting. Using the web UI after getting "connection refused" bouncebacks. Am I doing something wrong?
This is the mail system at host smtp1.linuxfoundation.org.
[snip] The mail system
<cf-dev(a)lists.cloudfoundry.org>: connect to 172.17.197.36[172.17.197.36]:25: Connection refused
|
|
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app was crashing only on specific DEAs in a Cloud Foundry.
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Regards,
Daniel Jones EngineerBetter.com
|
|