Re: Update on Mailman 3 launch
We have an open upstream bug about the issue that is causing delays to mail (including messages you post via the web interface sometimes not showing up). I hope this will have a proper resolution by the end of the week, and in the mean time we will be monitoring for this and “unsticking” any messages that get stuck: so if you don’t see any error, then your message *will* get posted. (https://gitlab.com/mailman/mailman/issues/138) We have a workaround in place for the above issue (hack code to disable connection pooling) and it seems to have addressed this problem until we can get a better fix. We've also just filed a new bug related to digest delivery mode affecting 3 or 4 list members, however we haven't yet determined exactly what the nature of the impact is. Hopefully a quick fix as soon as we discover more about the bug. ( https://gitlab.com/mailman/mailman/issues/141) Eric
|
|
Re: Running Docker private images on CF
Thanks for the details. I deployed diego-docker-cache-release and I could run private docker images now. One note however. I had to modify the **property-overrides.yml *to add the IP:<port> of the *docker-cache/0* job among the *insecure_docker_registry_list* of for it to work. Without which it says {"timestamp":"1439701925.514369965","source":"garden-linux","message":"garden-linux.pool.umojd9q7s54.provide-rootfs-failed","log_level":2,"data":{"error":"repository_fetcher: ProvideRegistry: could not fetch image f93137f1-.. from registry 10.250.21.80:8080: Registry 10.250.21.80:8080 is missing from -insecureDockerRegistryList ([docker-registry.service.cf.internal:8080])","session":"2.13"}} Consul discovery at fault I suspect, if not, pls suggest. Another observation on the Docker registry URI while running docker private images(*, not a Diego issue, I guess*) Looks like by default (*when I don't mention **docker_login_server*), the images are pulled using the V1 api $ cf start myapp Starting app myapp in org myorg / space default as user... Creating container Successfully created container Staging... Docker daemon running Staging process started ... Caching docker image ... *Logging to https://index.docker.io/v1/ < https://index.docker.io/v1/> ...* WARNING: login credentials saved in /root/.dockercfg. Login Succeeded Logged in. Pulling docker image <dockerid>/<image>:latest ... latest: Pulling from <dockerid>/image 511136ea3c5a: Pulling fs layer 30d39e59ffe2: Pulling fs layer c90d655b99b2: Pulling fs layer ….. when I explicitly mention the V2 URI, which is *registry.hub.docker.com < http://registry.hub.docker.com>* (*correct me if I am wrong*), pulling the image fails. $ cf start myapp Starting app myapp in org myorg / space default as user... Creating container Successfully created container Staging... Docker daemon running Staging process started ... Caching docker image ... *Logging to https://registry.hub.docker.com/< https://registry.hub.docker.com/> ...* WARNING: login credentials saved in /root/.dockercfg. *Login Succeeded* Logged in. Pulling docker image <dockerid>/<image>:latest ... time="2015-08-19T19:59:44Z" level=error msg=*"Error from V2 registry: Authentication is required."* Pulling repository <dockerid>/<image> Error: image <dockerid>/<image>:latest ... not found Thanks
toggle quoted messageShow quoted text
On Tue, Aug 11, 2015 at 6:45 PM, Eric Malm <emalm(a)pivotal.io> wrote: Hi, Dharmi,
In order to run private docker images (that is, ones that require user/password/email authentication with the registry), you'll have to stage them into the optional diego-docker-cache deployed alongside Diego. The BOSH release is located at https://github.com/cloudfoundry-incubator/diego-docker-cache-release. If you've already deployed Diego using the spiff-based manifest-generation templates in diego-release, the deployment for this release should be similar. If you deploy the caching registry release without TLS enabled or enabled but with a self-signed certificate, Diego should then be configured with the URL "docker-registry.service.cf.internal:8080" supplied in the diego.garden-linux.insecure_docker_registry_list property, and diego.stager.insecure_docker_registry set to 'true', as you can see in https://github.com/cloudfoundry-incubator/diego-docker-cache-release/blob/develop/stubs-for-diego-release/bosh-lite-property-overrides.yml .
Once that release is deployed, you can follow the instructions at https://github.com/cloudfoundry-incubator/diego-docker-cache-release#caching-docker-image-with-diego to stage your image into the cache, which should be as simple as setting the DIEGO_DOCKER_CACHE env var to 'true' on your app before staging it. When you start the app, Diego will then instruct Garden to pull the image from the internal caching registry rather than from the remote registry you staged it from. This has the added benefit of ensuring that you're always running exactly the Docker image you staged, rather than something that may have changed in the remote registry.
Thanks, Eric, CF Runtime Diego PM
On Tue, Aug 11, 2015 at 9:32 AM, dharmi <dharmi(a)gmail.com> wrote:
We have CF v214 with Diego deployed on AWS.
I am able to successfully create apps from Docker public repo, as per the apidocs <http://apidocs.cloudfoundry.org/214/apps/creating_an_app.html> ,
but, while creating apps from the Docker private repos, I see the below error from 'cf logs' when starting the app.
[API/0] OUT Updated app with guid bcb8f363-xyz ({"route"=>"5af6948b-xyz"}) [API/0] OUT Updated app with guid bcb8f363-xyz ({"state"=>"STARTED"}) [STG/0] OUT Creating container [STG/0] OUT Successfully created container [STG/0] OUT Staging... [STG/0] OUT Staging process started ... [STG/0] ERR Staging process failed: Exit trace for group: [STG/0] ERR builder exited with error: failed to fetch metadata from [:dockerid/go-app] with tag [latest] and insecure registries [] due to HTTP code: 404 [STG/0] OUT Exit status 2 [STG/0] ERR Staging Failed: Exited with status 2 [API/0] ERR Failed to stage application: staging failed
cf curl command for reference.
cf curl /v2/apps -X POST -H "Content-Type: application/json" -H "Authorization: bearer *accessToken*" -d ' {"name": "myapp", "space_guid": "71b22eba-xyz", "docker_image": ":dockerid/go-app", "diego": true, "docker_credentials_json": {"docker_login_server": "https://index.docker.io/v1/", "docker_user": ":dockerid", "docker_password": ":dockerpwd", "docker_email": ":email" } }'
Looking at the apidocs, the 'Example value' for 'docker_credentials_json' indicates a Hash value (#<RspecApiDocumentation::Views::HtmlExample:0x0000000bb883e0>), but looking inside the code, we found the below JSON format.
let(:docker_credentials) do { docker_login_server: login_server, docker_user: user, docker_password: password, docker_email: email }
Pls correct me if I am missing something.
Thanks, Dharmi
-- View this message in context: http://cf-dev.70369.x6.nabble.com/Running-Docker-private-images-on-CF-tp1148.html Sent from the CF Dev mailing list archive at Nabble.com.
-- Wise people learn when they can. Fools learn when they must.” - The Duke of Ellington
|
|
Re: I'm getting different x_forwarded_for in my Gorouter access logs depending on what browser/cli-tool I use.
|
|
Re: Security Question --- Securely wipe data on warden container removal / destruction???
Will Pragnell <wpragnell@...>
In the Docker image case, the filesystem layer specific to the container is also deleted immediately when the container stops running (this is the same for buildpack based apps on Diego/Garden). Lower layers in the image (i.e. the pre-existing docker image, as pulled from the registry) are not currently removed, even if not used in any other running containers.
In the coming weeks, we'll define and implement a strategy to remove unused images, but the details aren't decided yet.
toggle quoted messageShow quoted text
On 19 August 2015 at 14:57, James Bayer <jbayer(a)pivotal.io> wrote: warden/DEAs keeps container file systems for a configured amount of time, something like 1 hr before removing the containers, i believe with standard removal tools.
diego cells and garden removes container file system immediately after they are stopped by the user or the system. when using docker images, the container images are cached in the garden graph directory and i'm not quite sure of their cleanup / garbage collection life cycle.
On Wed, Aug 19, 2015 at 1:08 AM, Chris K <christopherkugler2(a)yahoo.de> wrote:
Hi,
I have a few questions regarding the way data is removed when an application is removed and its corresponding warden container is destroyed. As the Cloud Foundry instance my company is using may be shared with multiple tenants, this is a very critical question for us to be answered. From Cloud Foundry's GitHub repository I gathered the following information regarding the destruction process:
"When a container is destroyed -- either per user request, or automatically after being idle -- Warden first kills all unprivileged processes running inside the container. These processes first receive a TERM signal followed by a KILL if they haven't exited after a couple of seconds. When these processes have terminated, the root of the container's process tree is sent a KILL . Once all resources the container used have been released, its files are removed and it is considered destroyed." (Quote: https://github.com/cloudfoundry/warden/tree/master/warden)
According to this quote all files of the file system are removed before the resources can be used again. But how are they removed? Are they securely wiped, meaning all blocks are set to zero (or randomized)? And how is data removed from the RAM before it can be assigned to a new warden (i.e. new application).
In case the data is not being securely wiped, how much access does an application have towards the available memory? Is it for example possible to create files of arbitrary size and read / access them?
I'd be thankful for any kind of hints on this topic.
With Regards, Chris
-- Thank you,
James Bayer
|
|
Re: I'm getting different x_forwarded_for in my Gorouter access logs depending on what browser/cli-tool I use.
Simon Johansson <simon@...>
|
|
I'm getting different x_forwarded_for in my Gorouter access logs depending on what browser/cli-tool I use.
Simon Johansson <simon@...>
Hiya! Im looking into an issue where the x_forwarded_for have different values depending on what you are using to hit it with. With curl, w3m, lynx x_forwarded_for gets set to "sourceIP, a-gateway-ip" whereas with Firefox, Chrome, Opera, wget x_forwarded_for is simply set to "sourceIP" This is causing confusion but most of all it makes me tear my hair out as I cannot figure out what is going on, from what I can see the issue is not in Gorouter directly but in the stdlib of Golang. I have made two patches to figure out what is going on In the gorouter --- a/proxy/proxy.go +++ b/proxy/proxy.go @@ -2,18 +2,18 @@ package proxy import ( "errors" + "fmt" "io" ) const ( @@ -117,6 +117,8 @@ func (p *proxy) lookup(request *http.Request) *route.Pool { } func (p *proxy) ServeHTTP(responseWriter http.ResponseWriter, request *http.Request) { + fmt.Println("In proxy.ServeHTTP") + fmt.Println("Request: ", request) startedAt := time.Now() accessLog := access_log.AccessLogRecord{ @@ -207,7 +209,9 @@ func (p *proxy) ServeHTTP(responseWriter http.ResponseWriter, request *http.Requ }, } + fmt.Println("X-Forwarded-For before newReverseProxy.ServeHTTP: ", request.Header.Get("X-Forwarded-For")) p.newReverseProxy(roundTripper, request).ServeHTTP(proxyWriter, request) + fmt.Println("X-Forwarded-For after newReverseProxy.ServeHTTP: ", request.Header.Get("X-Forwarded-For")) accessLog.FinishedAt = time.Now() accessLog.BodyBytesSent = proxyWriter.Size() And in golang/src/net/http/httputil/reverseproxy.go --- a/src/net/http/httputil/reverseproxy.go +++ b/src/net/http/httputil/reverseproxy.go @@ -7,6 +7,7 @@ package httputil import ( + "fmt" "io" "log" "net" @@ -101,6 +102,7 @@ var hopHeaders = []string{ } func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, req *http.Request) { + fmt.Println("In net/http/httputil/reverseproxy.go: ServeHTTP") transport := p.Transport if transport == nil { transport = http.DefaultTransport @@ -132,6 +134,7 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, req *http.Request) { } } + fmt.Println("X-Forwarded-For in req before 'If we aren't the first proxy retain prior': ", req.Header.Get("X-Forwarded-For")) if clientIP, _, err := net.SplitHostPort(req.RemoteAddr); err == nil { // If we aren't the first proxy retain prior // X-Forwarded-For information as a comma+space @@ -140,7 +143,9 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, req *http.Request) { clientIP = strings.Join(prior, ", ") + ", " + clientIP } outreq.Header.Set("X-Forwarded-For", clientIP) + fmt.Println("X-Forwarded-For in outreq: ", outreq.Header.Get("X-Forwarded-For")) } + fmt.Println("X-Forwarded-For in req after 'If we aren't the first proxy retain prior': ", req.Header.Get("X-Forwarded-For")) res, err := transport.RoundTrip(outreq) if err != nil { @@ -158,6 +163,7 @@ func (p *ReverseProxy) ServeHTTP(rw http.ResponseWriter, req *http.Request) { rw.WriteHeader(res.StatusCode) p.copyResponse(rw, res.Body) + fmt.Println("Done in net/http/httputil/reverseproxy.go: ServeHTTP") } So basically debug printing. This is the difference I see (first example with 2 ips in the headers, second example with 1 ip in the headers), both requests from the same machine to the same Gorouter. Logs from gorouter In proxy.ServeHTTP Request: &{GET / HTTP/1.1 1 1 map[Accept:[*/*] Forwarded:[for=172.21.27.221; proto=http] X-Forwarded-Proto:[http] X-Forwarded-For:[172.21.27.221] True_client_ip:[172.21.27.221] X-Cf-Requestid:[874ed33a-c361-4946-74c3-c18b693ade7e] User-Agent:[curl/7.43.0]] 0xbb2730 0 [] false cf-env.test.cf.springer-sbm.com map[] map[] <nil> map[] 10.230.31.8:43597 / <nil>} X-Forwarded-For before newReverseProxy.ServeHTTP: 172.21.27.221 In net/http/httputil/reverseproxy.go: ServeHTTP X-Forwarded-For in req before 'If we aren't the first proxy retain prior': 172.21.27.221 X-Forwarded-For in outreq: 172.21.27.221, 10.230.31.8 X-Forwarded-For in req after 'If we aren't the first proxy retain prior': 172.21.27.221, 10.230.31.8 Done in net/http/httputil/reverseproxy.go: ServeHTTP X-Forwarded-For after newReverseProxy.ServeHTTP: 172.21.27.221, 10.230.31.8 $ cf logs cf-env 2015-08-19T15:11:36.22+0200 [RTR/0] OUT cf-env.test.cf.springer-sbm.com - [19/08/2015:13:11:36 +0000] "GET / HTTP/1.1" 200 0 4155 "-" "curl/7.43.0" 10.230.31.8:43597 x_forwarded_for:"172.21.27.221, 10.230.31.8" vcap_request_id:c2ce64d5-5951-46fb-7b0a-9ea233c34823 response_time:0.011068200 app_id:9011c83a-9407-4e0a-ae91-0c66ff3d6e92 Logs from gorouter In proxy.ServeHTTP Request: &{GET / HTTP/1.1 1 1 map[True_client_ip:[172.21.27.221] X-Cf-Requestid:[d7629a05-cbae-4f13-63d2-43acf5bfe8c9] User-Agent:[Mozilla/5.0 (X11; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0] Accept-Language:[en-US,en;q=0.5] Accept-Encoding:[gzip, deflate] Forwarded:[for=172.21.27.221; proto=http] X-Forwarded-Proto:[http] X-Forwarded-For:[172.21.27.221] Accept:[text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8] Cookie:[wt3_eid=%3B846182373400841%7C2141346926300579527%3B532695141032829%7C2142289007300951926%3B929408468507536%7C2142323421600457694%3B987963572816400%7C2142323579400274413%3B595753140200997%7C2142364572200667778%3B754741632944378%7C2142539934000605010] Connection:[keep-alive]] 0xbb2730 0 [] false cf-env.test.cf.springer-sbm.com map[] map[] <nil> map[] 10.230.31.8:43601 / <nil>} X-Forwarded-For before newReverseProxy.ServeHTTP: 172.21.27.221 In net/http/httputil/reverseproxy.go: ServeHTTP X-Forwarded-For in req before 'If we aren't the first proxy retain prior': 172.21.27.221 X-Forwarded-For in outreq: 172.21.27.221, 10.230.31.8 X-Forwarded-For in req after 'If we aren't the first proxy retain prior': 172.21.27.221 Done in net/http/httputil/reverseproxy.go: ServeHTTP X-Forwarded-For after newReverseProxy.ServeHTTP: 172.21.27.221 $ cf logs cf-env 2015-08-19T15:12:16.28+0200 [RTR/0] OUT cf-env.test.cf.springer-sbm.com - [19/08/2015:13:12:16 +0000] "GET / HTTP/1.1" 200 0 4700 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0" 10.230.31.8:43601 x_forwarded_for:"172.21.27.221" vcap_request_id:f807d34a-9191-4a29-5831-cb78eece72e1 response_time:0.008955044 app_id:9011c83a-9407-4e0a-ae91-0c66ff3d6e92 I use Gorouter 212 source and Golang 1.4.2 source. What confuses me is that here( https://github.com/golang/go/blob/release-branch.go1.4/src/net/http/httputil/reverseproxy.go#L110) we set outreq to req, and here( https://github.com/golang/go/blob/release-branch.go1.4/src/net/http/httputil/reverseproxy.go#L142) we set the outreq's X-Forwareded-For headers that affect req's X-Forwareded-For header when using curl/w3m/lynx but not with Firefox/Opera/Chrome/wget Anyone ever seen anything similar or anything obvious I have missed?
|
|
Re: Security Question --- Securely wipe data on warden container removal / destruction???
warden/DEAs keeps container file systems for a configured amount of time, something like 1 hr before removing the containers, i believe with standard removal tools. diego cells and garden removes container file system immediately after they are stopped by the user or the system. when using docker images, the container images are cached in the garden graph directory and i'm not quite sure of their cleanup / garbage collection life cycle. On Wed, Aug 19, 2015 at 1:08 AM, Chris K <christopherkugler2(a)yahoo.de> wrote: Hi,
I have a few questions regarding the way data is removed when an application is removed and its corresponding warden container is destroyed. As the Cloud Foundry instance my company is using may be shared with multiple tenants, this is a very critical question for us to be answered. From Cloud Foundry's GitHub repository I gathered the following information regarding the destruction process:
"When a container is destroyed -- either per user request, or automatically after being idle -- Warden first kills all unprivileged processes running inside the container. These processes first receive a TERM signal followed by a KILL if they haven't exited after a couple of seconds. When these processes have terminated, the root of the container's process tree is sent a KILL . Once all resources the container used have been released, its files are removed and it is considered destroyed." (Quote: https://github.com/cloudfoundry/warden/tree/master/warden)
According to this quote all files of the file system are removed before the resources can be used again. But how are they removed? Are they securely wiped, meaning all blocks are set to zero (or randomized)? And how is data removed from the RAM before it can be assigned to a new warden (i.e. new application).
In case the data is not being securely wiped, how much access does an application have towards the available memory? Is it for example possible to create files of arbitrary size and read / access them?
I'd be thankful for any kind of hints on this topic.
With Regards, Chris
-- Thank you, James Bayer
|
|
Re: More reliable way to collect user application logs
|
|
Re: CF integration with logging and monitoring tool
what is your logging and metrics tool? does it have an api?
the loggregator nozzle's are the go-forward approach to tapping into logs and metrics in the entire cf system and sending them somewhere.
toggle quoted messageShow quoted text
On Wed, Aug 12, 2015 at 4:14 AM, Swatz bosh <swatzron(a)gmail.com> wrote: I would like to know in CF, how to integrate with 3rd party app logging and monitoring like Graphite, Nagios, etc, what is the recommended approach. I found few articles where its mentioned firehose is the better option. Whereas I remember having collector (stats_z1/z2) job pointing to such monitoring server works well. So what steps I need to follow if I have to integrate such application monitoring tool using firehose? Do I need to write Nozzles for my monitoring tool like CloudCredo did for Graphite? https://github.com/CloudCredo/graphite-nozzle http://www.cloudcredo.com/how-to-integrate-graphite-with-cloud-foundry/ So if I have to integrate with Nagios, Wily, Splunk etc I would need nozzle for all of them? If not what changes I need to do in my configuration and in buildpack(not sure)?
I also found NOAA client https://github.com/cloudfoundry/noaa which I think consumes all logs from doppler and show them on console? How this Noaa client is using firehose, is not very clear in the document?
-- Thank you,
James Bayer
|
|
Re: Running Docker private images on CF
toggle quoted messageShow quoted text
On Mon, Aug 10, 2015 at 3:34 PM, Dharmendra Sarkar <dharmi(a)gmail.com> wrote: We have CF v214 with Diego deployed on AWS.
I am able to successfully create apps from Docker public repo, as per the apidocs, but, while creating apps from the Docker private repos, I see the below error from 'cf logs' when starting the app. 'appreciate any pointers.
[API/0] OUT Updated app with guid bcb8f363-xyz ({"route"=>"5af6948b-xyz"}) [API/0] OUT Updated app with guid bcb8f363-xyz ({"state"=>"STARTED"}) [STG/0] OUT Creating container [STG/0] OUT Successfully created container [STG/0] OUT Staging... [STG/0] OUT Staging process started ... [STG/0] ERR Staging process failed: Exit trace for group: [STG/0] ERR builder exited with error: failed to fetch metadata from [adobecloud/go-app] with tag [latest] and insecure registries [] due to HTTP code: 404 [STG/0] OUT Exit status 2 [STG/0] ERR Staging Failed: Exited with status 2 [API/0] ERR Failed to stage application: staging failed
cf curl command for reference.
cf curl /v2/apps -X POST -H "Content-Type: application/json" -H "Authorization: bearer *accessToken*" -d ' {"name": "myapp", "space_guid": "71b22eba-xyz", "docker_image": "adobecloud/go-app", "diego": true, "docker_credentials_json": {"docker_login_server": "https://index.docker.io/v1/", "docker_user": ":dockerid", "docker_password": ":dockerpwd", "docker_email": ":email" } }'
Looking at the apidocs, the 'Example value' for 'docker_credentials_json' indicates a Hash value (#<RspecApiDocumentation::Views::HtmlExample:0x0000000bb883e0>), but looking inside the code, 'found the below JSON format.
let(:docker_credentials) do { docker_login_server: login_server, docker_user: user, docker_password: password, docker_email: email }
Pls correct me if I am missing something.
Thanks, Dharmi
-- Thank you,
James Bayer
|
|
Re: Notifications for service provisioning
there is no standard notification mechanism built into cf today for this kind of information. as an administrator, you could possibly potentially build and deploy a loggregator firehose nozzle that looked for this type of information and create notifications based on that content. the service lifecycle events are tracked and stored in the cloud controller database and accessible via the api and cli. for example: $ cf curl /v2/events?q=type:audit.service_instance.create see this and similar lifecycle events for services: http://apidocs.cloudfoundry.org/215/events/list_service_instance_create_events.htmlthere have been recent threads on the list about having a more proactive notification system component, but that is still in the formative stages of discussion. On Wed, Aug 12, 2015 at 11:25 AM, Vineet Banga <vineetbanga1(a)gmail.com> wrote: Is there any notification mechanism available in CF to listen on service broker create/update/delete calls? We are implementing multiple services exposed via Service Broker in the marketplace and would like to take certain common actions when services are being provisioned.
Vineet
-- Thank you, James Bayer
|
|
Re: Bizarre DEA + Spring Behaviour
here is some guidance on how to check for available entropy on a linux host [1]. i'm not sure if the bosh agent, DEA or diego cell captures this metric or not. but we should certainly look into it. when you're inside a container, you can check for available entropy with the "cf ssh" command that is supported now with diego (or the app itself could log it before startup). see an example of this command running on pivotal's hosted diego which indicates values lower than 200 while you're trying to do operations needed entropy can cause a problem [2]. [1] https://major.io/2007/07/01/check-available-entropy-in-linux/[2] $ cf ssh MYAPP vcap(a)uqj9t0vqu9l:~$$ cat /proc/sys/kernel/random/entropy_avail; date; 855 Wed Aug 19 13:32:05 UTC 2015 vcap(a)uqj9t0vqu9l:~$ cat /proc/sys/kernel/random/entropy_avail; date; 866 Wed Aug 19 13:32:07 UTC 2015 vcap(a)uqj9t0vqu9l:~$ cat /proc/sys/kernel/random/entropy_avail; date; 876 Wed Aug 19 13:32:08 UTC 2015
toggle quoted messageShow quoted text
On Wed, Aug 19, 2015 at 5:06 AM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote: I've seen this happen to a good number of apps running on PWS, so it's something you can encounter when running CF on AWS as well.
What usually happens is that the application takes significantly longer to start, sometimes to the point where it fails to start quick enough and CF marks it as crashed. I haven't seen it cause any NPE's though. My understanding is that the JVM will just block until it gets the entropy it needs.
Dan
On Wed, Aug 19, 2015 at 3:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote:
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Thank you,
James Bayer
|
|
Re: Bizarre DEA + Spring Behaviour
I've seen this happen to a good number of apps running on PWS, so it's something you can encounter when running CF on AWS as well.
What usually happens is that the application takes significantly longer to start, sometimes to the point where it fails to start quick enough and CF marks it as crashed. I haven't seen it cause any NPE's though. My understanding is that the JVM will just block until it gets the entropy it needs.
Dan
toggle quoted messageShow quoted text
On Wed, Aug 19, 2015 at 3:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote: Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
|
|
Re: Security group rules to allow HTTP communication between 2 apps deployed on CF
On Sat, Aug 8, 2015 at 2:33 AM, Ahmad Ferdous Bin Alam < ahmadferdous(a)gmail.com> wrote: Hi,
I have deployed two node.js (express) applications - App1 and App2 - on a CF local instance. App2 consumes a service exposed (REST API) by App1. When App2 receives a request, it needs to communicate with App1. It worked all good when I tested. Once they are deployed on CF, it didn't work.
It turned out that App2 got error 'connect ECONNREFUSED'. How are you trying to connect to App1 from App2? If you access App2's URL, it should work? i.e. app-2.your-cf-domain.com I thought it might be a security group rule issue that prevented outbound traffic to App1. So I added a security group allowing all outgoing traffic. But it didn't help. Now I think it may have to do with inbound traffic rule. For inbound traffic, the restriction is HTTP, HTTPS & WebSockets. I don't believe there are any further restrictions. I searched for documentation as to how inbound traffic rules can be added but couldn't find.
My questions are: 1) Is it possible at all to have 2 apps deployed on CF communication with each other over HTTP?
Yes. If you deploy App2 and have it send a request to App1, that should work as long as you use the URL for App1. 2) Is the security group given below correct? Its purpose is to allow all outgoing traffic.
This is the group I've used to allow everything. What you've entered looks OK too. [ { "destination": "0.0.0.0-255.255.255.255", "protocol": "all" } ] Don't forget to bind the security group to your space or to the running / staging groups. Also, I think you need to restart or restage your app so it's container gets recreated with the new rules. 3) Is there any way we can add inbound traffic 'allow' rules? Shouldn't be necessary. Dan Please help.
Additional info: - I have CF locally installed as a Vagrant devbox (host Ubuntu 14.04). I used NISE installer: https://github.com/yudai/cf_nise_installer - I added the following security group to allow all outgoing traffic. I bound it to both staging and running security groups and finally restarted the apps so that the rules get applied. [ { "protocol":"tcp", "destination":"0.0.0.0/0", "ports":"1-65535" }, { "protocol":"udp", "destination":"0.0.0.0/0", "ports":"1-65535" } ]
|
|
Re: no more stdout in app files since upgrade to 214
|
|
Security Question --- Securely wipe data on warden container removal / destruction???
Hi, I have a few questions regarding the way data is removed when an application is removed and its corresponding warden container is destroyed. As the Cloud Foundry instance my company is using may be shared with multiple tenants, this is a very critical question for us to be answered. From Cloud Foundry's GitHub repository I gathered the following information regarding the destruction process: "When a container is destroyed -- either per user request, or automatically after being idle -- Warden first kills all unprivileged processes running inside the container. These processes first receive a TERM signal followed by a KILL if they haven't exited after a couple of seconds. When these processes have terminated, the root of the container's process tree is sent a KILL . Once all resources the container used have been released, its files are removed and it is considered destroyed." (Quote: https://github.com/cloudfoundry/warden/tree/master/warden) According to this quote all files of the file system are removed before the resources can be used again. But how are they removed? Are they securely wiped, meaning all blocks are set to zero (or randomized)? And how is data removed from the RAM before it can be assigned to a new warden (i.e. new application). In case the data is not being securely wiped, how much access does an application have towards the available memory? Is it for example possible to create files of arbitrary size and read / access them? I'd be thankful for any kind of hints on this topic. With Regards, Chris
|
|
Re: More reliable way to collect user application logs
|
|
Re: Bizarre DEA + Spring Behaviour
Johannes Hiemer <jvhiemer@...>
Go for it and let's see if we can document this issue afterwards with some logs for other people. On Wed, Aug 19, 2015 at 9:55 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Ooh, that's interesting. Coupled with what Guillaume suggested, I can imagine that being a problem. We did get a NullPointerException logged by some Spring Security component where we couldn't figure out what could possibly be null, so it's conceivable that some nested call to java.util.Random failed and returned null.
Sadly I don't have the logs any more, but this narrative is convincing enough to make me think it might have been the problem :)
On Wed, Aug 19, 2015 at 8:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote:
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen Johannes Hiemer
|
|
Re: Bizarre DEA + Spring Behaviour
Ooh, that's interesting. Coupled with what Guillaume suggested, I can imagine that being a problem. We did get a NullPointerException logged by some Spring Security component where we couldn't figure out what could possibly be null, so it's conceivable that some nested call to java.util.Random failed and returned null.
Sadly I don't have the logs any more, but this narrative is convincing enough to make me think it might have been the problem :)
toggle quoted messageShow quoted text
On Wed, Aug 19, 2015 at 8:32 AM, Johannes Hiemer <jvhiemer(a)gmail.com> wrote: Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=en
Regards, Johannes
On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen
Johannes Hiemer
-- Regards,
Daniel Jones EngineerBetter.com
|
|
Re: Bizarre DEA + Spring Behaviour
Johannes Hiemer <jvhiemer@...>
Daniel, I have had a problem with the deployment of Spring applications on Openstack recently as well. I am also not sure, without seeing the logs, what could be the reason, but did you try: http://www.evoila.de/vsphere/java-applications-not-starting-on-openstack-based-cloud-foundry-deployment/?lang=enRegards, Johannes On Wed, Aug 19, 2015 at 9:28 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote: Thanks for the input, that's a good call. A colleague of mine (who is currently on vacation) did look at that... Sadly he's not around to ask what he tested.
On Wed, Aug 19, 2015 at 7:06 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Some other "state" on the dea host such has shorteage on /dev/random (that went away with vm reconstruction but not with dea job restart) ?
Guillaume. Le 12 août 2015 14:14, "Daniel Mikusa" <dmikusa(a)pivotal.io> a écrit :
It seems like you were pretty thorough. I can't think of anything that would be different or that could cause symptoms like this, although I could be overlooking something as well. Without logs / app to try and replicate I'm not sure I can help much more. Sorry.
Perhaps someone else on the list has some thoughts?
Dan
On Wed, Aug 12, 2015 at 3:25 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi Dan,
Thanks for taking the time to reply.
I didn't include too much in the way of detail, as I was thinking that there must be a moving part in the equation I'm blind to, in which case that's a gap in my knowledge that I ought to fill in.
As we did `bosh recreate` on all the VMs, which fixed it` I can't go back and fetch logs unfortunately. There's no chance of being able to create a test case as I'm on client's time, so consider this a thought exercise :)
The app was Spring Boot 1.2.3, pulling in Spring Boot JDBC and Spring LDAP. Root FS was cflinuxfs2, and the Java buildpack logged the same for both. On some failing DEAs there were no other apps, on others there were - it didn't seem to be a factor. All DEAs had plenty of disk space.
I was wondering if there was a race condition, but I assumed Spring contexts start single-threadedly. Do you know if that's a correct assumption?
Do you know if there any *things* that could have been different between the DEAs that I didn't account for? Ie another moving part that's *not* either release, job, stemcell, droplet, root FS, app environment?
On Tue, Aug 11, 2015 at 12:32 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Tue, Aug 11, 2015 at 5:15 AM, Daniel Jones < daniel.jones(a)engineerbetter.com> wrote:
Hi all,
I've witnessed behaviour caused by the combination of a DEA and a Spring application that I can't explain. If you like a good mystery or you happen to know a lot about Java proxies and DEA transient state, please read on!
A particular Spring app Version of Spring? What parts of Spring are you pulling into the app?
was crashing only on specific DEAs in a Cloud Foundry.
Ever try bumping up the log level for Spring when you were getting the problem? If so, did the problem still occur? Were you able to capture the logs?
All DEAs were from the same CF release (PCF ERT 1.5.2) All DEAs were up-to-date according to BOSH (ie no outstanding changes waiting to be applied) All DEAs were deployed with identical BOSH job config All Warden containers were using the same root FS
lucid64 or cflinuxfs2? or didn't matter?
The droplet was the same across all DEAs The droplet version was the same The droplet tarballs all had the same MD5 checksum
What was the output of the Java build pack when the droplet was created? or better yet, run `cf files <app> app/.java-buildpack.log` and include the output.
Warden was providing the exact same env and start command to all containers I saw the same behaviour repeat itself across 5 completely separate Cloud Foundry installations
The crash was Spring not being able to autowire a bean, where it was referenced by implementation rather than interface (yes, I know, but it was not my code!). Any chance you could include logs from the crash? Was there an exception / stacktrace generated? Alternatively, have you been able to create a simple test app that replicates the behavior?
There was some Javassist/CGLIB action going on, creating proxies for the sake of transaction management.
Rebooting the troublesome DEAs did not fix the problem.
Doing a `bosh recreate` did reliably fix the problem.
Alternatively, changing the Spring code to wire by interface also reliably fixed the problem.
I can't understand why different DEA instances, from the same BOSH release, with the same config, on the same stemcell, running the same version of Warden, with the same droplet, and the same root FS, and the same env, and the same start command, yielded different behaviour. I'm even further confused as to why a `bosh recreate` changed that behaviour. What could possibly have changed? Something on ephemeral disk? But what else is there on ephemeral disk that could have mattered and was likely to have changed? How much was on the disk? Was it getting full? How many other apps were running on that DEA (before vs after)?
Do CGLIB/Javassist have some native dependencies that weren't in sync between DEAs?
Anyone with a convincing explanation (that does not involve voodoo) will receive one free beer and a high-five at the next CF Summit!
Wild guess, race condition in the code somewhere?
Dan
-- Regards,
Daniel Jones EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
-- Mit freundlichen Grüßen Johannes Hiemer
|
|