api and api_worker jobs fail to bosh update, but monit start OK

Guillaume Berche
Hi, I'm experiencing a weird situation where api and api_worker jobs fail to update through bosh and end up being reported as "not running". However, manually running "monit start cloud_controller_ng" (or rebooting the vm), the faulty jobs starts fine, and bosh deployment proceeds without errors. Looking at monit logs, it seems that there is an extra monit stop request for the cc_ng. Below are detailed traces illustrating the issue. $ bosh deploy [..] Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39) Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44) When instructing bosh to update the job (in this case only a config change), we indeed see the bosh agent asking monit to stop jobs, restart monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before the bosh director ends up timeouting and calling the canary failed. $ less /var/vcap/monit/monit.log [UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done *[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl *[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1 There is no associated traces of the bosh agent asking this extra stop: $ less /var/vcap/bosh/log/current 2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url=' http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: ' http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ******************** 2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}} This is reproducing systematically on our set up using bosh release 152 with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf release 207 running stemcell 2889. Enabling monit verbose logs discarded the theory of monit restarting cc_ng jobs because of too much ram usage, or failed http health check (along with the short time window in which the extra stop is requested: ~15s). I also discarded possibility of multiple monit instances, or pid inconsistency with cc_ng process. I'm now suspecting either the bosh agent to send extra stop request, or something with the cc_ng ctl scripts. As a side question, can someone explain how the cc_ng ctl script works, I'm suprised with the following process tree, where ruby seems to call the ctl script. Is the cc spawning it self ? $ ps auxf --cols=2000 | less [...] vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63 root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) } root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63 root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) } I was wondering whether this could come from our setting running CF with a more recent stemcell version (2922) than what the cf release notes are mentionning as "tested configuration". Are the latest stemcells tested against latest CF release ? Is there any way to see what stemcell version the runtime team pipelines is using [1] seemed to accept env vars and [2] required logging in ? I scanned through the bosh agent commit logs to spot something related but without luck so far. Thanks in advance for your help, Guillaume. [1] https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh< https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh> [2] https://concourse.diego-ci.cf-app.com/< https://concourse.diego-ci.cf-app.com/>
|
|
Does DEA have any limitation on the network resources for warden containers a DEA
I have a small environment with old version as 183
It only has one DEA, having couple apps running on that, it suddenly stop working when deploying apps (staging) without any obvious error and empty logs.
I took a look at the DEA logs and found this (stackstrace below):
I added one more DEA, then I am able to deploy more apps.
{"timestamp":1432671377.9456468,"message":"instance.start.failed with error Could not acquire network","log_level":"warn","source":"Dea::Instance","data":{"attributes":{"prod":false,"executableFile":"deprecated","limits":{"mem":256,"disk":1024,"fds":16384},"cc_partition":"default","console":false,"debug":null,"start_command":null,"health_check_timeout":180,"vcap_application":{"limits":{"mem":256,"disk":1024,"fds":16384},"application_version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","application_name":"NewrelicServiceBroker-v1","application_uris":[" newrelic-broker.pcf.inbcu.com "],"version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","name":"NewrelicServiceBroker-v1","space_name":"NewrelicServiceBroker-service-space","space_id":"7c372fd8-9e72-4e0a-b38c-a40024e88b29","uris":[" newrelic-broker.pcf.inbcu.com "],"users":null},"egress_network_rules":[{"protocol":"all","destination":" 0.0.0.0-255.255.255.255 "}],"instance_index":0,"application_version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","application_name":"NewrelicServiceBroker-v1","application_uris":[" newrelic-broker.pcf.inbcu.com"],"application_id":"b7ebe668-1f3f-46c2-88d3-8377824a7dd8","droplet_sha1":"658859369d03874604d0131812f8e6cf9811265a","instance_id":"4abe2c22600449d8aa7beff84c5776fc","private_instance_id":"33991dacb5e94e5dbae9542ff3f218b8f648742aea50432ca1ddb2d1ae4328f4","state":"CRASHED","state_timestamp":1432671377.9454725,"state_born_timestamp":1432671377.6084838,"state_starting_timestamp":1432671377.609722,"state_crashed_timestamp":1432671377.9454768},"duration":0.335908487,"error":"Could not acquire network","backtrace":["/var/vcap/packages/dea_next/vendor/cache/warden-dd32a459c99d/em-warden-client/lib/em/warden/client/connection.rb:27:in `get'","/var/vcap/packages/dea_next/vendor/cache/warden-dd32a459c99d/em-warden-client/lib/em/warden/client.rb:43:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:192:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:153:in `block in new_container_with_bind_mounts'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `with_em'","/var/vcap/packages/dea_next/lib/container/container.rb:137:in `new_container_with_bind_mounts'","/var/vcap/packages/dea_next/lib/container/container.rb:120:in `block in create_container'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `with_em'","/var/vcap/packages/dea_next/lib/container/container.rb:119:in `create_container'","/var/vcap/packages/dea_next/lib/dea/starting/instance.rb:520:in `block in promise_container'","/var/vcap/packages/dea_next/lib/dea/promise.rb:92:in `call'","/var/vcap/packages/dea_next/lib/dea/promise.rb:92:in `block in run'"]},"thread_id":4874360,"fiber_id":23565220,"process_id":28699,"file":"/var/vcap/packages/dea_next/lib/dea/task.rb","lineno":97,"method":"block in resolve_and_log"}
{"timestamp":1432671655.6220152,"message":"nats.message.received","log_level":"debug","source":"Dea::Nats","data":{"subject":"dea.stop","data":{"droplet":"890cfbde-0957-444e-aa0c-249c0fef42ca"}},"thread_id":4874360,"fiber_id":13477840,"process_id":28699,"file":"/var/vcap/packages/dea_next/lib/dea/nats.rb","lineno":148,"method":"handle_incoming_message"}
|
|
Re: Multiple Availability Zone
Try telling BOSH to ignore what AZ the server is in when provisioning disks: https://github.com/cloudfoundry/bosh/blob/master/release/jobs/director/spec#L395It will default to cinders default AZ for storage that you have configured. John From: cf-dev-bounces(a)lists.cloudfoundry.org [mailto:cf-dev-bounces(a)lists.cloudfoundry.org] On Behalf Of Guangcai Wang Sent: 27 May 2015 08:23 To: cf-dev(a)lists.cloudfoundry.org Subject: [cf-dev] Multiple Availability Zone Hi, I am trying to deploy cf into Openstack with multiple computing nodes. Computer node 1: has all openstack services running, including cinder service (az1) Computer node 2: has computing service only. (az2) when I deployed the cf, the job VMs has been provisioned into the two availability zones evenly. When BOSH started to update the job VM (etcd was provisioned in az2) to create a disk, I got an error "Availability zone 'az2' is invalid". My question is how to specify the availability zone for VMs and their persistent disk? Thanks. This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email
|
|
Multiple Availability Zone
Hi,
I am trying to deploy cf into Openstack with multiple computing nodes.
Computer node 1: has all openstack services running, including cinder service (az1) Computer node 2: has computing service only. (az2)
when I deployed the cf, the job VMs has been provisioned into the two availability zones evenly. When BOSH started to update the job VM (etcd was provisioned in az2) to create a disk, I got an error "Availability zone 'az2' is invalid".
My question is how to specify the availability zone for VMs and their persistent disk?
Thanks.
|
|
Re: Release Notes for v210
|
|
Re: metron_agent.deployment
Diego Lapiduz <diego@...>
Thanks Ivan! That is exactly what I was looking for.
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 10:47 PM, Ivan Sim <ivans(a)activestate.com> wrote: For all the loggregator processes, you will be able to find their configuration properties in their respective spec and ERB files in the loggregator repository <https://github.com/cloudfoundry/loggregator/tree/develop/bosh/jobs>[1]. In your case, the metron_agent.deployment property is seen here <https://github.com/cloudfoundry/loggregator/blob/27490d3387566f42fb71bab3dc760ca1b5c1be6d/bosh/jobs/metron_agent/spec#L47> [2]
[1] https://github.com/cloudfoundry/loggregator/tree/develop/bosh/jobs [2] https://github.com/cloudfoundry/loggregator/blob/27490d3387566f42fb71bab3dc760ca1b5c1be6d/bosh/jobs/metron_agent/spec#L47 .
On Tue, May 26, 2015 at 7:40 PM, Diego Lapiduz <diego(a)lapiduz.com> wrote:
Hi all,
I've been trying to figure out an issue here while upgrading to 210 from 208.
It seems that a requirement has been added to the deployment manifests for a "metron_agent.deployment" property but I can't find it anywhere in the cf-release manifests.
From what I can tell the only manifest with that setting is https://github.com/cloudfoundry/cf-release/blob/master/example_manifests/minimal-aws.yml#L327 .
Is there a place to look for canonical manifests other than cf-release? Should I just rely on the release notes?
I just added that property to cf-properties.yml and seems to work fine.
Thanks for understanding as we are going through our first couple of big upgrades.
Cheers, Diego
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Ivan Sim
|
|
Re: metron_agent.deployment
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 7:40 PM, Diego Lapiduz <diego(a)lapiduz.com> wrote: Hi all,
I've been trying to figure out an issue here while upgrading to 210 from 208.
It seems that a requirement has been added to the deployment manifests for a "metron_agent.deployment" property but I can't find it anywhere in the cf-release manifests.
From what I can tell the only manifest with that setting is https://github.com/cloudfoundry/cf-release/blob/master/example_manifests/minimal-aws.yml#L327 .
Is there a place to look for canonical manifests other than cf-release? Should I just rely on the release notes?
I just added that property to cf-properties.yml and seems to work fine.
Thanks for understanding as we are going through our first couple of big upgrades.
Cheers, Diego
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Ivan Sim
|
|
Diego Lapiduz <diego@...>
Hi all, I've been trying to figure out an issue here while upgrading to 210 from 208. It seems that a requirement has been added to the deployment manifests for a "metron_agent.deployment" property but I can't find it anywhere in the cf-release manifests. From what I can tell the only manifest with that setting is https://github.com/cloudfoundry/cf-release/blob/master/example_manifests/minimal-aws.yml#L327. Is there a place to look for canonical manifests other than cf-release? Should I just rely on the release notes? I just added that property to cf-properties.yml and seems to work fine. Thanks for understanding as we are going through our first couple of big upgrades. Cheers, Diego
|
|
Re: [vcap-dev] Java OOM debugging
toggle quoted messageShow quoted text
On 15-05-14 10:23 AM, Daniel Jones wrote: Hi Lari,
Thanks again for your input. Have you seen this problem with versions of Tomcat before 8.0.20?
David and I think we've narrowed down the issue to a change from using Tomcat 8.0.18 to 8.0.21. We're running more tests and collaborating with Pivotal support. We also noticed that non-prod versions of our apps were taking longer to crash, so it would seem to be activity-related at least.
Do you know how Tomcat's APR/NIO memory gets allocated? Is there a way of telling from pmap whether pages are being used for NIO buffers or by the APR?
I wonder if the other folks that have reported CF out of memory errors with later versions of Tomcat are seeing slow creeps in native memory consumption?
On Mon, May 11, 2015 at 2:19 PM, Lari Hotari <Lari(a)hotari.net <mailto:Lari(a)hotari.net>> wrote:
fyi. Tomcat 8.0.20 might be consuming more memory than 8.0.18: https://github.com/cloudfoundry/java-buildpack/issues/166#issuecomment-94517568
Other things we’ve tried:
- We set verbose garbage collection to verify there was no memory size issues within the JVM. There wasn’t.
- We tried setting minimum memory for native, it had no effect. The container still gets killed
- We tried adjusting the ‘memory heuristics’ so that they added up to 80 rather than 100. This had the effect of causing a delay in the container being killed. However it still was killed.
I think adjusting memory heuristics so that they add up to 80 doesn't make a difference because the values aren't percentages. The values are proportional weighting values used in the memory calculation: https://github.com/grails-samples/java-buildpack/blob/b4abf89/docs/jre-oracle_jre.md#memory-calculation
I found out that the only way to reserve "unused" memory is to set a high value for the native memory lower bound in the memory_sizes.native setting of config/open_jdk_jre.yml . Example: https://github.com/grails-samples/java-buildpack/blob/22e0f6a/config/open_jdk_jre.yml#L25
This seems like classic memory leak behaviour to me.
In my case it wasn't a classical Java memory leak, since the Java application wasn't leaking memory. I was able to confirm this by getting some heap dumps with the HeapDumpServlet (https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/HeapDumpServlet.groovy) and analyzing them.
In my case the JVM's RSS memory size is slowly growing. It probably is some kind of memory leak since one process I've been monitoring now is very close to the memory limit. The uptime is now almost 3 weeks.
Here is the latest diff of the meminfo report. https://gist.github.com/lhotari/ee77decc2585f56cf3ad#file-meminfo_diff_example2-txt
From a Java perspective this isn't classical. The JVM heap isn't filling up. The problem is that RSS size is slowly growing and will eventually cause the Java process to cross the memory boundary so that the process gets kill by the Linux kernel cgroups OOM killer.
RSS size might be growing because of many reasons. I have been able to slow down the growth by doing the various MALLOC_ and JVM parameter tuning (-XX:MinMetaspaceExpansion=1M -XX:CodeCacheExpansionSize=1M). I'm able to get a longer uptime, but the problem isn't solved.
Lari
On 15-05-11 06:41 AM, Head-Rapson, David wrote:
Thanks for the continued advice.
We’ve hit on a key discovery after yet another a soak test this weekend.
- When we deploy using Tomcat 8.0.18 we don’t see the issue
- When we deploy using Tomcat 8.0.20 (same app version, same CF space, same services bound, same JBP code version, same JRE version, running at the same time), we see the crashes occurring after just a couple of hours.
Ideally we’d go ahead with the memory calculations you mentioned however we’re stuck on lucid64 because we’re using Pivotal CF 1.3.x & we’re having upgrade issues to 1.4.x.
So we’re not able to adjust MALLOC_ARENA_MAX, nor are we able to view RSS in pmap as you describe
Other things we’ve tried:
- We set verbose garbage collection to verify there was no memory size issues within the JVM. There wasn’t.
- We tried setting minimum memory for native, it had no effect. The container still gets killed
- We tried adjusting the ‘memory heuristics’ so that they added up to 80 rather than 100. This had the effect of causing a delay in the container being killed. However it still was killed.
This seems like classic memory leak behaviour to me.
*From:*Lari Hotari [mailto:lari.hotari(a)sagire.fi] *On Behalf Of *Lari Hotari *Sent:* 08 May 2015 16:25 *To:* Daniel Jones; Head-Rapson, David *Cc:* cf-dev(a)lists.cloudfoundry.org <mailto:cf-dev(a)lists.cloudfoundry.org> *Subject:* Re: [Cf-dev] [vcap-dev] Java OOM debugging
For my case, it turned out to be essential to reserve enough memory for "native" in the JBP. For the 2GB total memory, I set the minimum to 330M. With that setting I have been able to get over 2 weeks up time by now.
I mentioned this in my previous email:
The workaround for that in my case was to add a native key under memory_sizes in open_jdk_jre.yml and set the minimum to 330M (that is for a 2GB total memory). see example https://github.com/grails-samples/java-buildpack/blob/22e0f6a/config/open_jdk_jre.yml#L25 that was how I got the app I'm running on CF to stay within the memory bounds. I'm sure there is now also a way to get the keys without forking the buildpack. I could have also adjusted the percentage portions, but I wanted to set a hard minimum for this case.
I've been trying to get some insight by diffing the reports gathered from the meminfo servlet https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemoryInfoServlet.groovy <https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemoryInfoServlet.groovy>
Here is such an example of a diff: https://gist.github.com/lhotari/ee77decc2585f56cf3ad#file-meminfo_diff_example-txt
meminfo has pmap output included to get the report of the memory map of the process. I have just noticed that most of the memory has already been mmap:ed from the OS and it's just growing in RSS size. For example: < 00000000a7600000 1471488 1469556 1469556 rw--- [ anon ] > 00000000a7600000 1471744 1470444 1470444 rw--- [ anon ]
The pmap output from lucid64 didn't include the RSS size, so you have to use cflinuxfs2 for this. It's also better because of other reasons. The glibc in lucid64 is old and has some bugs around the MALLOC_ARENA_MAX.
I was manually able to estimate the maximum size of the RSS size of what the Java process will consume by simply picking the large anon-blocks from the pmap report and calculating those blocks by the allocated virtual size (VSS). Based on this calculation, I picked the minimum of 330M for "native" in open_jdk_jre.yml as I mentioned before.
It looks like these rows are for the Heap size: < 00000000a7600000 1471488 1469556 1469556 rw--- [ anon ] > 00000000a7600000 1471744 1470444 1470444 rw--- [ anon ]
It looks like the JVM doesn't fully allocate that block in RSS initially and most of the growth of RSS size comes from that in my case. In your case, it might be something different.
I also added a servlet for getting glibc malloc_info statistics in XML format (). I haven't really analysed that information because of time constraints and because I don't have a pressing problem any more. btw. The malloc_info XML report is missing some key elements, that has been added in later glibc versions (https://github.com/bminor/glibc/commit/4d653a59ffeae0f46f76a40230e2cfa9587b7e7e).
If killjava.sh never fires and the app crashed with Warden out of memory errors, then I believe it's the kernel's cgroups OOM killer that has killed the container processes. I have found this location where Warden oom notifier gets the OOM notification event: https://github.com/cloudfoundry/warden/blob/ad18bff/warden/lib/warden/container/features/mem_limit.rb#L70 This is the oom.c source code: https://github.com/cloudfoundry/warden/blob/ad18bff7dc56acbc55ff10bcc6045ebdf0b20c97/warden/src/oom/oom.c . It reads the cgroups control files and receives events from the kernel that way.
I'd suggest that you use pmap for the Java process after it has started and calculate the maximum RSS size by calculating the VSS size of the large anon blocks instead of RSS for the blocks that the Java process has reserved for it's different memory areas (I think you shouldn't . You should discard adding VSS for the CompressedClassSpaceSize block. After this calculation, add enough memory to the "native" parameter in JBP until the RSS size calculated this way stays under the limit. That's the only "method" I have come up by now.
It might be required to have some RSS space allocated for any zip/jar files read by the Java process. I think that Java uses mmap files for zip file reading by default and that might go on top of all other limits. To test this theory, I'd suggest testing by adding -Dsun.zip.disableMemoryMapping=true system property setting to JAVA_OPTS. That disables the native mmap for zip/jar file reading. I haven't had time to test this assumption.
I guess the only way to understand how Java allocates memory is to look at the source code. from http://openjdk.java.net/projects/jdk8u/ , the instructions to get the source code of JDK 8: hg clone http://hg.openjdk.java.net/jdk8u/jdk8u;cd jdk8u;sh get_source.sh This tool is really good for grepping and searching the source code: http://geoff.greer.fm/ag/ <http://geoff.greer.fm/ag/> On Ubuntu it's in silversearcher-ag package, "apt-get install silversearcher-ag" and on MacOSX brew it's "brew install the_silver_searcher". This alias is pretty useful: alias codegrep='ag --color --group --pager less -C 5' Then you just search for the correct location in code by starting with the tokens you know about: codegrep MaxMetaspaceSize this gives pretty good starting points in looking how the JDK allocates memory.
So the JDK source code is only a few commands away.
It would be interesting to hear more about this if someone has the time to dig in to this. This is about how far I got and I hope sharing this information helps someone continue. :)
Lari github/twitter: lhotari
On 15-05-08 10:02 AM, Daniel Jones wrote:
Hi Lari et al,
Thanks for your help Lari.
David and I are pairing on this issue, and we're yet to resolve it. We're in the process of creating a repeatable test case (our most crashy app makes calls to external services that need mocking), but in the meantime, here's what we've seen.
Between Java Buildpack commit e89e546 and 17162df, we see apps crashing with Warden out of memory errors. killjava.sh never fires, and this has led us to believe that the kernel is shooting a cgroup process in the head after the cgroup oversteps its memory limit. We cannot find any evidence of the OOM killer firing in any logs, but we may not be looking in the right place.
The JBP is setting heap to be 70%, metaspace to be 15% (with max set to the same as initial), 5% for "stack", 5% for "normalised stack" and 10% for "native". We do not understand why this adds up to 105%, but haven't looked into the JBP algorithm yet. Any pointers on what "normalised stack" is would be much appreciated, as this doesn't appear in the list of heuristics supplied via app env.
Other team members tried applying the same settings that you suggested - thanks for this. Apps still crash with these settings, albeit less frequently.
After reading the blog you linked to (http://java.dzone.com/articles/java-8-permgen-metaspace) we wondered whether the increased /reserved /metaspace claimed after metaspace GC might be causing a problem; however we reused the test code to create a metaspace leak in a CF app and saw metaspace GCs occur correctly, and memory usage never grow over MaxMetaspaceSize. This figures, as the committed metaspace is still less than MaxMetaspaceSize, and the reserved appears to be whatever RAM is free across the whole DEA.
We noted that an Oracle blog (https://blogs.oracle.com/poonam/entry/about_g1_garbage_collector_permanent) mentions that the metaspace size parameters are approximate. We're currently wondering if native allocations by Tomcat (APR, NIO) are taking up more container memory, and so when the metaspace fills, it's creeping slightly over the limit and triggering the kernel's OOM killer.
Any suggestions would be much appreciated. We've tried to resist tweaking heuristics blindly, but are running out of options as we're struggling to figure out how the Java process is using /committed/ memory. pmap seems to show virtual memory, and so it's hard to see if things like the metaspace or NIO ByteBuffers are nabbing too much and trigger the kernel's OOM killer.
Thanks for all your help,
Daniel Jones & David Head-Rapson
On Wed, Apr 29, 2015 at 8:07 PM, Lari Hotari <Lari(a)hotari.net <mailto:Lari(a)hotari.net>> wrote:
Hi,
I created a few tools to debug OOM problems since the application I was responsible for running on CF was failing constantly because of OOM problems. The problems I had, turned out not to be actual memory leaks in the Java application.
In the "cf events appname" log I would get entries like this: 2015-xx-xxTxx:xx:xx.00-0400 app.crash appname index: 1, reason: CRASHED, exit_description: out of memory, exit_status: 255
These type of entries are produced when the container goes over it's memory resource limits. It doesn't mean that there is a memory leak in the Java application. The container gets killed by the Linux kernel oom killer (https://github.com/cloudfoundry/warden/blob/master/warden/README.md#limit-handle-mem-value) based on the resource limits set to the warden container.
The memory limit is specified in number of bytes. It is enforced using the control group associated with the container. When a container exceeds this limit, one or more of its processes will be killed by the kernel. Additionally, the Warden will be notified that an OOM happened and it subsequently tears down the container.
In my case it never got killed by the killjava.sh script that gets called in the java-buildpack when an OOM happens in Java.
This is the tool I built to debug the problems: https://github.com/lhotari/java-buildpack-diagnostics-app I deployed that app as part of the forked buildpack I'm using. Please read the readme about what it's limitations are. It worked for me, but it might not work for you. It's opensource and you can fork it. :)
There is a solution in my toolcase for creating a heapdump and uploading that to S3: https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/HeapDumpServlet.groovy The readme explains how to setup Amazon S3 keys for this: https://github.com/lhotari/java-buildpack-diagnostics-app#amazon-s3-setup Once you get a dump, you can then analyse the dump in a java profiler tool like YourKit.
I also have a solution that forks the java-buildpack modifies killjava.sh and adds a script that uploads the heapdump to S3 in the case of OOM: https://github.com/lhotari/java-buildpack/commit/2d654b80f3bf1a0e0f1bae4f29cb85f56f5f8c46
In java-buildpack-diagnostics-app I have also other tools for getting Linux operation system specific memory information, for example:
https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemoryInfoServlet.groovy https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemorySmapServlet.groovy https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MallocInfoServlet.groovy
These tools are handy for looking at details of the Java process RSS memory usage growth.
There is also a solution for getting ssh shell access inside your application with tmate.io <http://tmate.io>: https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/TmateSshServlet.groovy (this version is only compatible with the new "cflinuxfs2" stack)
It looks like there are serious problems on CloudFoundry with the memory sizing calculation. An application that doesn't have a OOM problem will get killed by the oom killer because the Java process will go over the memory limits. I filed this issue: https://github.com/cloudfoundry/java-buildpack/issues/157 , but that might not cover everything.
The workaround for that in my case was to add a native key under memory_sizes in open_jdk_jre.yml and set the minimum to 330M (that is for a 2GB total memory). see example https://github.com/grails-samples/java-buildpack/blob/22e0f6a/config/open_jdk_jre.yml#L25 that was how I got the app I'm running on CF to stay within the memory bounds. I'm sure there is now also a way to get the keys without forking the buildpack. I could have also adjusted the percentage portions, but I wanted to set a hard minimum for this case.
It was also required to do some other tuning.
I added this to JAVA_OPTS: -XX:CompressedClassSpaceSize=256M -XX:InitialCodeCacheSize=64M -XX:CodeCacheExpansionSize=1M -XX:CodeCacheMinimumFreeSpace=1M -XX:ReservedCodeCacheSize=200M -XX:MinMetaspaceExpansion=1M -XX:MaxMetaspaceExpansion=8M -XX:MaxDirectMemorySize=96M while trying to keep the Java process from growing in RSS memory size.
The memory overhead of a 64 bit Java process on Linux can be reduced by specifying these environment variables:
stack: cflinuxfs2 . . . env: MALLOC_ARENA_MAX: 2 MALLOC_MMAP_THRESHOLD_: 131072 MALLOC_TRIM_THRESHOLD_: 131072 MALLOC_TOP_PAD_: 131072 MALLOC_MMAP_MAX_: 65536
MALLOC_ARENA_MAX works only on cflinuxfs2 stack (the lucid64 stack has a buggy version of glibc).
explanation about MALLOC_ARENA_MAX from Heroku: https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior some measurement data how it reduces memory consumption: https://devcenter.heroku.com/articles/testing-cedar-14-memory-use
I have created a PR to add this to CF java-buildpack: https://github.com/cloudfoundry/java-buildpack/pull/160
I also created an issues https://github.com/cloudfoundry/java-buildpack/issues/163 and https://github.com/cloudfoundry/java-buildpack/pull/159 .
I hope this information helps others struggling with OOM problems in CF. I'm not saying that this is a ready made solution just for you. YMMV. It worked for me.
-Lari
On 15-04-29 10:53 AM, Head-Rapson, David wrote:
Hi,
I’m after some guidance on how to get profile Java apps in CF, in order to get to the bottom of memory issues.
We have an app that’s crashing every few hours with OOM error, most likely it’s a memory leak.
I’d like to profile the JVM and work out what’s eating memory, however tools like yourkit require connectivity INTO the JVM server (i.e. the warden container), either via host / port or via SSH.
Since warden containers cannot be connected to on ports other than for HTTP and cannot be SSHd to, neither of these works for me.
I tried installed a standalone JDK onto the warden container, however as soon as I ran ‘jmap’ to invoke the dump, warden cleaned up the container – most likely for memory over-consumption.
I had previously found a hack in the Weblogic buildpack (https://github.com/pivotal-cf/weblogic-buildpack/blob/master/docs/container-wls-monitoring.md) for modifying the start script which, when used with –XX:HeapDumpOnOutOfMemoryError, should copy any heapdump files to a file share somewhere. I have my own custom buildpack so I could use something similar.
Has anyone got a better solution than this?
We would love to use newrelic / app dynamics for this however we’re not allowed. And I’m not 100% certain they could help with this either.
Dave
The information transmitted is intended for the person or entity to which it is addressed and may contain confidential, privileged or copyrighted material. If you receive this in error, please contact the sender and delete the material from any computer. Fidelity only gives information on products and services and does not give investment advice to retail clients based on individual circumstances. Any comments or statements made are not necessarily those of Fidelity. All e-mails may be monitored. FIL Investments International (Reg. No.1448245), FIL Investment Services (UK) Limited (Reg. No. 2016555), FIL Pensions Management (Reg. No. 2015142) and Financial Administration Services Limited (Reg. No. 1629709) are authorised and regulated in the UK by the Financial Conduct Authority. FIL Life Insurance Limited (Reg No. 3406905) is authorised in the UK by the Prudential Regulation Authority and regulated in the UK by the Financial Conduct Authority and the Prudential Regulation Authority. Registered offices at Oakhill House, 130 Tonbridge Road, Hildenborough, Tonbridge, Kent TN11 9DZ.
-- You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group. To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/DFFA4ADB9F3BC34194429921AB329336408CAB04%40UKFIL7006WIN.intl.intlroot.fid-intl.com <https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/DFFA4ADB9F3BC34194429921AB329336408CAB04%40UKFIL7006WIN.intl.intlroot.fid-intl.com?utm_medium=email&utm_source=footer>. To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+unsubscribe(a)cloudfoundry.org <mailto:vcap-dev+unsubscribe(a)cloudfoundry.org>.
_______________________________________________ Cf-dev mailing list Cf-dev(a)lists.cloudfoundry.org <mailto:Cf-dev(a)lists.cloudfoundry.org> https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
--
Regards,
Daniel Jones
EngineerBetter.com
-- Regards,
Daniel Jones EngineerBetter.com
|
|
[IMPORTANT] lucid64 stack removal planned with next final cf-release.
|
|
john mcteague <john.mcteague@...>
Ive only had a brief look, my graphite server does not seem to have the same set of stats for each dea and doppler node, but where i can draw a comparison is that the dopplers in z2 and 3 are receiving 10x more logs than zone 4.
The dea stat for z4 has a receivedMessageCount that is lower than dopplers z4 receivedMessageCount. Im not convinced my stats are being sent correctly, I will investigate and provide further info tomorrow.
Thanks for your help.
John
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 10:35 PM, John Tuley <jtuley(a)pivotal.io> wrote: I also don't expect that to be the source of your CPU imbalance.
Well, the bad news is that the easy stuff checks out, so I have no idea what's actually wrong. I'll keep suggesting diagnostics, but I don't have a silver bullet for you.
Do you have a collector wired up in your deployment? If so, I'd take a look at the metrics `MetronAgent.dropondeAgentListener.receivedMessageCount` across each of the runners, and `DopplerServer.dropsondeListener.receivedMessageCount` across each of the dopplers. That should give you a better idea of the number of log messages that *should* be sent to each Doppler (the first metric) and that *are* received and processed (the second metric).
If the metron numbers are high (as you expect), but the doppler numbers are low, then there's probably something wrong with those doppler instances. If the metron numbers are low, then there might be something wrong with metron on the runners, or with the DEA logging agent. Or, maybe the app instances in those zones just aren't logging much (which seems the least likely explanation so far).
– John Tuley
On Tue, May 26, 2015 at 3:24 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
- From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run "}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote:
John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for
the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your
configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3
consume far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you
see logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should
show you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are
not routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the
routers
cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains
in
openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single
app
running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
I also don't expect that to be the source of your CPU imbalance. Well, the bad news is that the easy stuff checks out, so I have no idea what's actually wrong. I'll keep suggesting diagnostics, but I don't have a silver bullet for you. Do you have a collector wired up in your deployment? If so, I'd take a look at the metrics `MetronAgent.dropondeAgentListener.receivedMessageCount` across each of the runners, and `DopplerServer.dropsondeListener.receivedMessageCount` across each of the dopplers. That should give you a better idea of the number of log messages that *should* be sent to each Doppler (the first metric) and that *are* received and processed (the second metric). If the metron numbers are high (as you expect), but the doppler numbers are low, then there's probably something wrong with those doppler instances. If the metron numbers are low, then there might be something wrong with metron on the runners, or with the DEA logging agent. Or, maybe the app instances in those zones just aren't logging much (which seems the least likely explanation so far). – John Tuley On Tue, May 26, 2015 at 3:24 PM, john mcteague <john.mcteague(a)gmail.com> wrote: - From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run "}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote:
John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your
configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you
see logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
john mcteague <john.mcteague@...>
- From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run"}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote: John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see
logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
John, Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.) Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)? And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.) If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look. – John Tuley On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however
the
dopplers associated with zone 4 / 5 are getting a very small amount of
load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see
logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1]. you
can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across
all 5
zones (6 instances in each). I have additionally verified that each
logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
|
|
john mcteague <john.mcteague@...>
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John.
toggle quoted messageShow quoted text
On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote: Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however the dopplers associated with zone 4 / 5 are getting a very small amount of load sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
|
|
Onsi Fakhouri <ofakhouri@...>
Diego is very much usable at this point and we're encouraging beta testers to start putting workloads on it. Check out github.com/cloudfoundry-incubator/diego for all things Diego.
Diego supports one off tasks. It's up to the consumer (e.g. Cloud Controller) to submit the tasks when they want them run. We'd like to bubble this functionality up to the CC but it's not a very high priority at the moment.
Onsi
Sent from my iPad
toggle quoted messageShow quoted text
On May 26, 2015, at 8:21 AM, Corentin Dupont <corentin.dupont(a)create-net.org> wrote:
Another question, what is the status of Diego? Is there an expected date for its release? Is it useable already? If I understand correctly, Diego doesn't supports cron-like jobs, but will facilitate them?
On Tue, May 26, 2015 at 5:08 PM, James Bayer <jbayer(a)pivotal.io> wrote: those are exciting use cases, thank you for sharing the background!
On Tue, May 26, 2015 at 2:37 AM, Corentin Dupont <cdupont(a)create-net.org> wrote: Hi James, thanks for the answer! We are interested to implement a job scheduler for CF. Do you think this could be interesting to have?
We are working in a project called DC4Cities (http://www.dc4cities.eu) were the objective is to make data centres use more renewable energy. We want to use PaaS frameworks such as CloudFoundry to achieve this goal. The idea is to schedule some PaaS tasks at the moment there is more renewable energies (when the sun is shining).
That's why I had the idea to implement a job scheduler for batch jobs in CF. For example one could state "I need to have this task to run for 2 hours per day" and the scheduler could choose when to run it.
Another possibility is to have application-oriented SLA implemented at CF level. For example if some KPIs of the application are getting too low, CF would spark a new container. If the SLA is defined with some flexibility, it could also be used to schedule renewable energies. For example in our trial scenarios we have an application that convert images. Its SLA says that it needs to convert 1000 images per day, but you are free to produce them when you want i.e. when renewable energies are available...
On Mon, May 25, 2015 at 7:29 PM, James Bayer <jbayer(a)pivotal.io> wrote: there is ongoing work to support process types using buildpacks, so that the same application codebase could be used for multiple different types of processes (web, worker, etc).
once process types and diego tasks are fully available, we expect to implement a user-facing api for running batch jobs as application processes.
what people do today is run a long-running process application which uses something like quartz scheduler [1] or ruby clock with a worker system like resque [2]
[1] http://quartz-scheduler.org/ [2] https://github.com/resque/resque-scheduler
On Mon, May 25, 2015 at 6:19 AM, Corentin Dupont <cdupont(a)create-net.org> wrote: To complete my request, I'm thinking of something like this in the manifest.yml:
applications: - name: virusscan memory: 512M instances: 1 schedule: - startFrom : a date endBefore : a date walltime : a duration precedence : other application name moldable : true/false
What do you think?
On Mon, May 25, 2015 at 11:25 AM, Corentin Dupont <cdupont(a)create-net.org> wrote:
---------- Forwarded message ---------- From: Corentin Dupont <corentin.dupont(a)create-net.org> Date: Mon, May 25, 2015 at 11:21 AM Subject: scheduler To: cf-dev(a)lists.cloudfoundry.org
Hi guys, just to know, is there a project to add a job scheduler in Cloud Foundry? I'm thinking of something like the Heroku scheduler (https://devcenter.heroku.com/articles/scheduler). That would be very neat to have regular tasks triggered... Thanks, Corentin
-- Corentin Dupont Researcher @ Create-Net www.corentindupont.info
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-- Corentin Dupont Researcher @ Create-Net www.corentindupont.info _______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
Corentin Dupont <corentin.dupont@...>
Another question, what is the status of Diego? Is there an expected date for its release? Is it useable already? If I understand correctly, Diego doesn't supports cron-like jobs, but will facilitate them?
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 5:08 PM, James Bayer <jbayer(a)pivotal.io> wrote: those are exciting use cases, thank you for sharing the background!
On Tue, May 26, 2015 at 2:37 AM, Corentin Dupont <cdupont(a)create-net.org> wrote:
Hi James, thanks for the answer! We are interested to implement a job scheduler for CF. Do you think this could be interesting to have?
We are working in a project called DC4Cities (http://www.dc4cities.eu) were the objective is to make data centres use more renewable energy. We want to use PaaS frameworks such as CloudFoundry to achieve this goal. The idea is to schedule some PaaS tasks at the moment there is more renewable energies (when the sun is shining).
That's why I had the idea to implement a job scheduler for batch jobs in CF. For example one could state "I need to have this task to run for 2 hours per day" and the scheduler could choose when to run it.
Another possibility is to have application-oriented SLA implemented at CF level. For example if some KPIs of the application are getting too low, CF would spark a new container. If the SLA is defined with some flexibility, it could also be used to schedule renewable energies. For example in our trial scenarios we have an application that convert images. Its SLA says that it needs to convert 1000 images per day, but you are free to produce them when you want i.e. when renewable energies are available...
On Mon, May 25, 2015 at 7:29 PM, James Bayer <jbayer(a)pivotal.io> wrote:
there is ongoing work to support process types using buildpacks, so that the same application codebase could be used for multiple different types of processes (web, worker, etc).
once process types and diego tasks are fully available, we expect to implement a user-facing api for running batch jobs as application processes.
what people do today is run a long-running process application which uses something like quartz scheduler [1] or ruby clock with a worker system like resque [2]
[1] http://quartz-scheduler.org/ [2] https://github.com/resque/resque-scheduler
On Mon, May 25, 2015 at 6:19 AM, Corentin Dupont <cdupont(a)create-net.org
wrote: To complete my request, I'm thinking of something like this in the manifest.yml:
applications: - name: virusscan memory: 512M instances: 1
*schedule: - startFrom : a date endBefore : a date walltime : a duration* * precedence : other application name moldable : true/false*
What do you think?
On Mon, May 25, 2015 at 11:25 AM, Corentin Dupont < cdupont(a)create-net.org> wrote:
---------- Forwarded message ---------- From: Corentin Dupont <corentin.dupont(a)create-net.org> Date: Mon, May 25, 2015 at 11:21 AM Subject: scheduler To: cf-dev(a)lists.cloudfoundry.org
Hi guys, just to know, is there a project to add a job scheduler in Cloud Foundry? I'm thinking of something like the Heroku scheduler ( https://devcenter.heroku.com/articles/scheduler). That would be very neat to have regular tasks triggered... Thanks, Corentin
--
Corentin Dupont Researcher @ Create-Netwww.corentindupont.info
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
--
Corentin Dupont Researcher @ Create-Netwww.corentindupont.info
|
|
those are exciting use cases, thank you for sharing the background! On Tue, May 26, 2015 at 2:37 AM, Corentin Dupont <cdupont(a)create-net.org> wrote: Hi James, thanks for the answer! We are interested to implement a job scheduler for CF. Do you think this could be interesting to have?
We are working in a project called DC4Cities (http://www.dc4cities.eu) were the objective is to make data centres use more renewable energy. We want to use PaaS frameworks such as CloudFoundry to achieve this goal. The idea is to schedule some PaaS tasks at the moment there is more renewable energies (when the sun is shining).
That's why I had the idea to implement a job scheduler for batch jobs in CF. For example one could state "I need to have this task to run for 2 hours per day" and the scheduler could choose when to run it.
Another possibility is to have application-oriented SLA implemented at CF level. For example if some KPIs of the application are getting too low, CF would spark a new container. If the SLA is defined with some flexibility, it could also be used to schedule renewable energies. For example in our trial scenarios we have an application that convert images. Its SLA says that it needs to convert 1000 images per day, but you are free to produce them when you want i.e. when renewable energies are available...
On Mon, May 25, 2015 at 7:29 PM, James Bayer <jbayer(a)pivotal.io> wrote:
there is ongoing work to support process types using buildpacks, so that the same application codebase could be used for multiple different types of processes (web, worker, etc).
once process types and diego tasks are fully available, we expect to implement a user-facing api for running batch jobs as application processes.
what people do today is run a long-running process application which uses something like quartz scheduler [1] or ruby clock with a worker system like resque [2]
[1] http://quartz-scheduler.org/ [2] https://github.com/resque/resque-scheduler
On Mon, May 25, 2015 at 6:19 AM, Corentin Dupont <cdupont(a)create-net.org> wrote:
To complete my request, I'm thinking of something like this in the manifest.yml:
applications: - name: virusscan memory: 512M instances: 1
*schedule: - startFrom : a date endBefore : a date walltime : a duration* * precedence : other application name moldable : true/false*
What do you think?
On Mon, May 25, 2015 at 11:25 AM, Corentin Dupont < cdupont(a)create-net.org> wrote:
---------- Forwarded message ---------- From: Corentin Dupont <corentin.dupont(a)create-net.org> Date: Mon, May 25, 2015 at 11:21 AM Subject: scheduler To: cf-dev(a)lists.cloudfoundry.org
Hi guys, just to know, is there a project to add a job scheduler in Cloud Foundry? I'm thinking of something like the Heroku scheduler ( https://devcenter.heroku.com/articles/scheduler). That would be very neat to have regular tasks triggered... Thanks, Corentin
--
Corentin Dupont Researcher @ Create-Netwww.corentindupont.info
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you, James Bayer
|
|
Re: List Reply-To behavior
Chip Childers <cchilders@...>
I've asked the admin team to make this adjustment. Thanks for pointing this out!
Chip Childers | Technology Chief of Staff | Cloud Foundry Foundation
toggle quoted messageShow quoted text
On Fri, May 22, 2015 at 10:06 AM, James Bayer <jbayer(a)pivotal.io> wrote: yes, this has affected me
On Fri, May 22, 2015 at 4:33 AM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
On Fri, May 22, 2015 at 6:22 AM, Matthew Sykes <matthew.sykes(a)gmail.com> wrote:
The vcap-dev list used to use a Reply-To header pointing back to the list such that replying to a post would automatically go back to the list. The current mailman configuration for cf-dev does not set a Reply-To header and the default behavior is to reply to the author.
While I understand the pros and cons of setting the Reply-To header, this new behavior has bitten me several times and I've found myself re-posting a response to the list instead of just the author.
I'm interested in knowing if anyone else has been bitten by this behavior and would like a Reply-To header added back...
+1 and +1
Dan
Thanks.
-- Matthew Sykes matthew.sykes(a)gmail.com
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
Re: CVE-2015-1834 CC Path Traversal vulnerability
|
|