Re: [vcap-dev] Java OOM debugging
Lari Hotari <Lari@...>
This Java native memory leak debugging war story from a Twitter engineer
toggle quoted messageShow quoted text
is very interesting: http://www.evanjones.ca/java-native-leak-bug.html Tweet is https://twitter.com/epcjones/status/603295445067014144 . It seems to be very important to check all locations where GzipInputStream (and other InflaterInputStream impl) and GzipOutputStream (and other DeflaterOutputStream impl) are used. I assume that an InputStream opened from a resource URL originating from a Jar file could also leak native memory. (ClassLoader.getResource(...).openStream()). This makes it a very common source of native memory problems in Java. It could be a coincidence, but Tomcat 8.0.20 seems to have changes in this area: https://github.com/apache/tomcat/commit/6e5420c67fbad81973d888ad3701a392fac4fc71 (I linked to that commit in my email on May 14). Lari
On 15-05-14 10:23 AM, Daniel Jones wrote:
Hi Lari,
|
|
metron_agent.deployment
Diego Lapiduz <diego@...>
Hi all,
I've been trying to figure out an issue here while upgrading to 210 from 208. It seems that a requirement has been added to the deployment manifests for a "metron_agent.deployment" property but I can't find it anywhere in the cf-release manifests. From what I can tell the only manifest with that setting is https://github.com/cloudfoundry/cf-release/blob/master/example_manifests/minimal-aws.yml#L327 . Is there a place to look for canonical manifests other than cf-release? Should I just rely on the release notes? I just added that property to cf-properties.yml and seems to work fine. Thanks for understanding as we are going through our first couple of big upgrades. Cheers, Diego
|
|
Re: metron_agent.deployment
Ivan Sim <ivans@...>
For all the loggregator processes, you will be able to find their
toggle quoted messageShow quoted text
configuration properties in their respective spec and ERB files in the loggregator repository <https://github.com/cloudfoundry/loggregator/tree/develop/bosh/jobs>[1]. In your case, the metron_agent.deployment property is seen here <https://github.com/cloudfoundry/loggregator/blob/27490d3387566f42fb71bab3dc760ca1b5c1be6d/bosh/jobs/metron_agent/spec#L47> [2] [1] https://github.com/cloudfoundry/loggregator/tree/develop/bosh/jobs [2] https://github.com/cloudfoundry/loggregator/blob/27490d3387566f42fb71bab3dc760ca1b5c1be6d/bosh/jobs/metron_agent/spec#L47 .
On Tue, May 26, 2015 at 7:40 PM, Diego Lapiduz <diego(a)lapiduz.com> wrote:
Hi all, --
Ivan Sim
|
|
Re: metron_agent.deployment
Diego Lapiduz <diego@...>
Thanks Ivan! That is exactly what I was looking for.
toggle quoted messageShow quoted text
On Tue, May 26, 2015 at 10:47 PM, Ivan Sim <ivans(a)activestate.com> wrote:
For all the loggregator processes, you will be able to find their
|
|
Re: Release Notes for v210
Dieu Cao <dcao@...>
The cf-release v210 was released on May 23rd, 2015
toggle quoted messageShow quoted text
Runtime - Addressed USN-2617-1 <http://www.ubuntu.com/usn/usn-2617-1/> CVE-2015-3202 <http://people.canonical.com/~ubuntu-security/cve/2015/CVE-2015-3202.html> FUSE vulnerabilities - Removed fuse binaries from lucid64 rootfs . Apps running on lucid64 stack requiring fuse should switch to cflinuxfs2 details <https://www.pivotaltracker.com/story/show/95186578> - fuse binaries updated on cflinuxfs2 rootfs. details <https://www.pivotaltracker.com/story/show/95177810> - [Experimental] Work continues on support for Asynchronous Service Instance Operationsdetails <https://www.pivotaltracker.com/epic/show/1561148> - Support for configurable max polling duration - [Experimental] Work continues on /v3 and Application Process Types details <https://www.pivotaltracker.com/epic/show/1334418> - [Experimental] Work continues on Route API details <https://www.pivotaltracker.com/epic/show/1590160> - [Experimental] Work continues on Context Path Routes details <https://www.pivotaltracker.com/epic/show/1808212> - Work continues on support for Service Keys details <https://www.pivotaltracker.com/epic/show/1743366> - Upgrade etcd server to 2.0.1 details <https://www.pivotaltracker.com/story/show/91070214> - Should be run as 1 node (for small deployments) or 3 nodes spread across zones (for HA) - Also upgrades hm9k dependencies. LAMB client to be upgraded in a subsequent release. Older client is compatible. - cloudfoundry/cf-release #670 <https://github.com/cloudfoundry/cf-release/pull/670>: Be able to specify timeouts for acceptance tests without defaults in the spec. details <https://www.pivotaltracker.com/story/show/93914198> - Fix bug where ssl enabled routers were not draining properly details <https://www.pivotaltracker.com/story/show/94718480> - cloudfoundry/cloud_controller_ng #378 <https://github.com/cloudfoundry/cf-release/pull/378>: current usage against the org quota details <https://www.pivotaltracker.com/story/show/94171010> UAA - Bumped to UAA 2.3.0 details <https://github.com/cloudfoundry/uaa/releases/tag/2.3.0> Used Configuration - BOSH Version: 152 - Stemcell Version: 2889 - CC Api Version: 2.27.0 Commit summary <http://htmlpreview.github.io/?https://github.com/cloudfoundry-community/cf-docs-contrib/blob/master/release_notes/cf-210-whats-in-the-deploy.html> Compatible Diego Version - final release 0.1247.0 commit <https://github.com/cloudfoundry-incubator/diego-release/commit/a122a78eeb344bbfc90b7bcd0fa987d08ef1a5d1> Manifest and Job Spec Changes - properties.acceptance_tests.skip_regex added - properties.app_ssh.host_key_fingerprint added - properties.app_ssh.port defaults to 2222 - properties.uaa.newrelic added - properties.login.logout.redirect.parameter.whitelist
On Sat, May 23, 2015 at 9:50 PM, James Bayer <jbayer(a)pivotal.io> wrote:
CVE-2015-3202 details:
|
|
Multiple Availability Zone
iamflying
Hi,
I am trying to deploy cf into Openstack with multiple computing nodes. Computer node 1: has all openstack services running, including cinder service (az1) Computer node 2: has computing service only. (az2) when I deployed the cf, the job VMs has been provisioned into the two availability zones evenly. When BOSH started to update the job VM (etcd was provisioned in az2) to create a disk, I got an error "Availability zone 'az2' is invalid". My question is how to specify the availability zone for VMs and their persistent disk? Thanks.
|
|
Re: Multiple Availability Zone
John McTeague
Try telling BOSH to ignore what AZ the server is in when provisioning disks:
https://github.com/cloudfoundry/bosh/blob/master/release/jobs/director/spec#L395 It will default to cinders default AZ for storage that you have configured. John From: cf-dev-bounces(a)lists.cloudfoundry.org [mailto:cf-dev-bounces(a)lists.cloudfoundry.org] On Behalf Of Guangcai Wang Sent: 27 May 2015 08:23 To: cf-dev(a)lists.cloudfoundry.org Subject: [cf-dev] Multiple Availability Zone Hi, I am trying to deploy cf into Openstack with multiple computing nodes. Computer node 1: has all openstack services running, including cinder service (az1) Computer node 2: has computing service only. (az2) when I deployed the cf, the job VMs has been provisioned into the two availability zones evenly. When BOSH started to update the job VM (etcd was provisioned in az2) to create a disk, I got an error "Availability zone 'az2' is invalid". My question is how to specify the availability zone for VMs and their persistent disk? Thanks. This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email
|
|
Does DEA have any limitation on the network resources for warden containers a DEA
Shaozhen Ding
I have a small environment with old version as 183
It only has one DEA, having couple apps running on that, it suddenly stop working when deploying apps (staging) without any obvious error and empty logs. I took a look at the DEA logs and found this (stackstrace below): I added one more DEA, then I am able to deploy more apps. {"timestamp":1432671377.9456468,"message":"instance.start.failed with error Could not acquire network","log_level":"warn","source":"Dea::Instance","data":{"attributes":{"prod":false,"executableFile":"deprecated","limits":{"mem":256,"disk":1024,"fds":16384},"cc_partition":"default","console":false,"debug":null,"start_command":null,"health_check_timeout":180,"vcap_application":{"limits":{"mem":256,"disk":1024,"fds":16384},"application_version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","application_name":"NewrelicServiceBroker-v1","application_uris":[" newrelic-broker.pcf.inbcu.com "],"version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","name":"NewrelicServiceBroker-v1","space_name":"NewrelicServiceBroker-service-space","space_id":"7c372fd8-9e72-4e0a-b38c-a40024e88b29","uris":[" newrelic-broker.pcf.inbcu.com "],"users":null},"egress_network_rules":[{"protocol":"all","destination":" 0.0.0.0-255.255.255.255 "}],"instance_index":0,"application_version":"f6a8cfc7-ae71-4cab-90f4-67c2a21a3e8a","application_name":"NewrelicServiceBroker-v1","application_uris":[" newrelic-broker.pcf.inbcu.com"],"application_id":"b7ebe668-1f3f-46c2-88d3-8377824a7dd8","droplet_sha1":"658859369d03874604d0131812f8e6cf9811265a","instance_id":"4abe2c22600449d8aa7beff84c5776fc","private_instance_id":"33991dacb5e94e5dbae9542ff3f218b8f648742aea50432ca1ddb2d1ae4328f4","state":"CRASHED","state_timestamp":1432671377.9454725,"state_born_timestamp":1432671377.6084838,"state_starting_timestamp":1432671377.609722,"state_crashed_timestamp":1432671377.9454768},"duration":0.335908487,"error":"Could not acquire network","backtrace":["/var/vcap/packages/dea_next/vendor/cache/warden-dd32a459c99d/em-warden-client/lib/em/warden/client/connection.rb:27:in `get'","/var/vcap/packages/dea_next/vendor/cache/warden-dd32a459c99d/em-warden-client/lib/em/warden/client.rb:43:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:192:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:153:in `block in new_container_with_bind_mounts'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `with_em'","/var/vcap/packages/dea_next/lib/container/container.rb:137:in `new_container_with_bind_mounts'","/var/vcap/packages/dea_next/lib/container/container.rb:120:in `block in create_container'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `call'","/var/vcap/packages/dea_next/lib/container/container.rb:229:in `with_em'","/var/vcap/packages/dea_next/lib/container/container.rb:119:in `create_container'","/var/vcap/packages/dea_next/lib/dea/starting/instance.rb:520:in `block in promise_container'","/var/vcap/packages/dea_next/lib/dea/promise.rb:92:in `call'","/var/vcap/packages/dea_next/lib/dea/promise.rb:92:in `block in run'"]},"thread_id":4874360,"fiber_id":23565220,"process_id":28699,"file":"/var/vcap/packages/dea_next/lib/dea/task.rb","lineno":97,"method":"block in resolve_and_log"} {"timestamp":1432671655.6220152,"message":"nats.message.received","log_level":"debug","source":"Dea::Nats","data":{"subject":"dea.stop","data":{"droplet":"890cfbde-0957-444e-aa0c-249c0fef42ca"}},"thread_id":4874360,"fiber_id":13477840,"process_id":28699,"file":"/var/vcap/packages/dea_next/lib/dea/nats.rb","lineno":148,"method":"handle_incoming_message"}
|
|
api and api_worker jobs fail to bosh update, but monit start OK
Hi,
I'm experiencing a weird situation where api and api_worker jobs fail to update through bosh and end up being reported as "not running". However, manually running "monit start cloud_controller_ng" (or rebooting the vm), the faulty jobs starts fine, and bosh deployment proceeds without errors. Looking at monit logs, it seems that there is an extra monit stop request for the cc_ng. Below are detailed traces illustrating the issue. $ bosh deploy [..] Started updating job ha_proxy_z1 > ha_proxy_z1/0 (canary). Done (00:00:39) Started updating job api_z1 > api_z1/0 (canary). Failed: `api_z1/0' is not running after update (00:10:44) When instructing bosh to update the job (in this case only a config change), we indeed see the bosh agent asking monit to stop jobs, restart monit itself, start jobs, and then we see the extra stop (at* 12:33:26) *before the bosh director ends up timeouting and calling the canary failed. $ less /var/vcap/monit/monit.log [UTC May 22 12:33:17] info : Awakened by User defined signal 1[UTC May 22 12:33:17] info : Awakened by the SIGHUP signal[UTC May 22 12:33:17] info : Reinitializing monit - Control file '/var/vcap/bosh/etc/monitrc'[UTC May 22 12:33:17] info : Shutting down monit HTTP server[UTC May 22 12:33:18] info : monit HTTP server stopped[UTC May 22 12:33:18] info : Starting monit HTTP server at [127.0.0.1:2822][UTC May 22 12:33:18] info : monit HTTP server started[UTC May 22 12:33:18] info : 'system_897cdb8d-f9f7-4bfa-a748-512489b676e0' Monit reloaded[UTC May 22 12:33:23] info : start service 'consul_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : Awakened by User defined signal 1[UTC May 22 12:33:23] info : 'consul_agent' start: /var/vcap/jobs/consul_agent/bin/agent_ctl[UTC May 22 12:33:23] info : start service 'nfs_mounter' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'metron_agent' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:23] info : start service 'cloud_controller_worker_1' on user request[UTC May 22 12:33:23] info : monit daemon at 1050 awakened[UTC May 22 12:33:24] info : 'consul_agent' start action done[UTC May 22 12:33:24] info : 'nfs_mounter' start: /var/vcap/jobs/nfs_mounter/bin/nfs_mounter_ctl[UTC May 22 12:33:24] info : 'cloud_controller_worker_1' start: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl*[UTC May 22 12:33:25] info : 'cloud_controller_worker_1' start action done *[UTC May 22 12:33:25] info : 'metron_agent' start: /var/vcap/jobs/metron_agent/bin/metron_agent_ctl[UTC May 22 12:33:26] info : 'metron_agent' start action done*[UTC May 22 12:33:26] info : 'cloud_controller_worker_1' stop: /var/vcap/jobs/cloud_controller_worker/bin/cloud_controller_worker_ctl *[UTC May 22 12:33:27] info : 'nfs_mounter' start action done[UTC May 22 12:33:27] info : Awakened by User defined signal 1 There is no associated traces of the bosh agent asking this extra stop: $ less /var/vcap/bosh/log/current 2015-05-22_12:33:23.73606 [monitJobSupervisor] 2015/05/22 12:33:23 DEBUG - Starting service cloud_controller_worker_12015-05-22_12:33:23.73608 [http-client] 2015/05/22 12:33:23 DEBUG - Monit request: url='http://127.0.0.1:2822/cloud_controller_worker_1' body='action=start'2015-05-22_12:33:23.73608 [attemptRetryStrategy] 2015/05/22 12:33:23 DEBUG - Making attempt #02015-05-22_12:33:23.73609 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Requesting (attempt=1): Request{ Method: 'POST', URL: 'http://127.0.0.1:2822/cloud_controller_worker_1' }2015-05-22_12:33:23.73647 [clientRetryable] 2015/05/22 12:33:23 DEBUG - [requestID=52ede4f0-427d-4e65-6da1-d3b5c4b5cafd] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK'}2015-05-22_12:33:23.73648 [MBus Handler] 2015/05/22 12:33:23 INFO - Responding2015-05-22_12:33:23.73650 [MBus Handler] 2015/05/22 12:33:23 DEBUG - Payload2015-05-22_12:33:23.73650 ********************2015-05-22_12:33:23.73651 {"value":"started"}2015-05-22_12:33:23.73651 ******************** 2015-05-22_12:33:36.69397 [NATS Handler] 2015/05/22 12:33:36 DEBUG - Message Payload2015-05-22_12:33:36.69397 ********************2015-05-22_12:33:36.69397 {"job":"api_worker_z1","index":0,"job_state":"failing","vitals":{"cpu":{"sys":"6.5","user":"14.4","wait":"0.4"},"disk":{"ephemeral":{"inode_percent":"10","percent":"14"},"persistent":{"inode_percent":"36","percent":"48"},"system":{"inode_percent":"36","percent":"48"}},"load":["0.19","0.06","0.06"],"mem":{"kb":"81272","percent":"8"},"swap":{"kb":"0","percent":"0"}}} This is reproducing systematically on our set up using bosh release 152 with stemcell bosh-vcloud-esxi-ubuntu-trusty-go_agent version 2889, and cf release 207 running stemcell 2889. Enabling monit verbose logs discarded the theory of monit restarting cc_ng jobs because of too much ram usage, or failed http health check (along with the short time window in which the extra stop is requested: ~15s). I also discarded possibility of multiple monit instances, or pid inconsistency with cc_ng process. I'm now suspecting either the bosh agent to send extra stop request, or something with the cc_ng ctl scripts. As a side question, can someone explain how the cc_ng ctl script works, I'm suprised with the following process tree, where ruby seems to call the ctl script. Is the cc spawning it self ? $ ps auxf --cols=2000 | less [...] vcap 8011 0.6 7.4 793864 299852 ? S<l May26 6:01 ruby /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller -m -c /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml root 8014 0.0 0.0 19596 1436 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8023 0.0 0.0 5924 1828 ? S< May26 0:00 | \_ tee -a /dev/fd/63 root 8037 0.0 0.0 19600 1696 ? S< May26 0:00 | | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8061 0.0 0.0 5916 1924 ? S< May26 0:00 | | \_ logger -p user.info -t vcap.cloud_controller_ng_ctl.stdout root 8024 0.0 0.0 7552 1788 ? S< May26 0:00 | \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) } root 8015 0.0 0.0 19600 1440 ? S< May26 0:00 \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8021 0.0 0.0 5924 1832 ? S< May26 0:00 \_ tee -a /dev/fd/63 root 8033 0.0 0.0 19600 1696 ? S< May26 0:00 | \_ /bin/bash /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_ctl start root 8060 0.0 0.0 5912 1920 ? S< May26 0:00 | \_ logger -p user.error -t vcap.cloud_controller_ng_ctl.stderr root 8022 0.0 0.0 7552 1748 ? S< May26 0:00 \_ awk -W Interactive {lineWithDate="echo [`date +\"%Y-%m-%d %H:%M:%S%z\"`] \"" $0 "\""; system(lineWithDate) } I was wondering whether this could come from our setting running CF with a more recent stemcell version (2922) than what the cf release notes are mentionning as "tested configuration". Are the latest stemcells tested against latest CF release ? Is there any way to see what stemcell version the runtime team pipelines is using [1] seemed to accept env vars and [2] required logging in ? I scanned through the bosh agent commit logs to spot something related but without luck so far. Thanks in advance for your help, Guillaume. [1] https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh <https://github.com/cloudfoundry/bosh-lite/blob/master/ci/ci-stemcell-bats.sh> [2] https://concourse.diego-ci.cf-app.com/ <https://concourse.diego-ci.cf-app.com/>
|
|
Cloud Foundry Warden Mechanism
Kenneth Ham <kenneth.ham@...>
I need some help here. I have been working on this for a week now and have
researched the entire web but I couldn¹t find any relevant resource to this. 1. Lib/warden/container/linux.rb  I am trying to create a callback mechanism during do_create, do_destroy, etc, and publish my callback to a web API. How can I best achieve this? 2. Using unix socket, I tried to read /tmp/warden.sock and intercept messages, I can¹t seem to get this to work, any advise what I have done wrong? Please advise what is the best way that I can approach this. Thank you. /kennetham Important: This email and any attachments are confidential and may also be privileged. If you are not the intended addressee, please delete this email and any attachments from your system and notify the sender immediately; you should not copy, distribute, circulate or in any other way use or take actions in reliance on the information contained in this email or any attachments for any purpose, nor disclose its contents to any other person. Thank you.
|
|
Re: scheduler
Corentin Dupont <corentin.dupont@...>
Some other questions:
toggle quoted messageShow quoted text
- is there a consolidation mechanism? From what I can see from the videos, Diego is only doing load balancing when allocating an application to a DEA. What is more important to us is to consolidate: we want to minimize the number of DEAs. Is there an extensibility mechanism to the scheduler? - is there an auto-scaling mechanism? I'm thinking of auto-scaling at two levels: At application level, it would be nice to have auto-scaling in the manifest.yml: if some KPI goes up, launch more instances. At DEA level, a bit like in bosh-scaler: if DEAs are full, launch a new one. Thanks!! Corentin
On Tue, May 26, 2015 at 5:25 PM, Onsi Fakhouri <ofakhouri(a)pivotal.io> wrote:
Diego is very much usable at this point and we're encouraging beta testers --
Corentin Dupont Researcher @ Create-Netwww.corentindupont.info
|
|
Re: [vcap-dev] bosh create release --force
Filip Hanik
The script that is executing at the time is:
https://github.com/cloudfoundry/cf-release/blob/master/packages/uaa/pre_packaging#L36 So what my suggestion is to test if this works is that you can do 1. 'cd src/uaa' 2. ensure that you have a JDK 7 installed 3. run the command './gradlew assemble --info' and this will tell us if the build process works on your machine. We're looking for the output BUILD SUCCESSFUL Total time: 40.509 secs Task timings: 579ms :cloudfoundry-identity-common:jar 7056ms :cloudfoundry-identity-common:javadoc 1981ms :cloudfoundry-identity-scim:compileJava 747ms :cloudfoundry-identity-login:compileJava 3800ms :cloudfoundry-identity-scim:javadoc 3141ms :cloudfoundry-identity-login:javadoc 3055ms :cloudfoundry-identity-uaa:war 1379ms :cloudfoundry-identity-samples:cloudfoundry-identity-api:javadoc 2176ms :cloudfoundry-identity-samples:cloudfoundry-identity-api:war 1443ms :cloudfoundry-identity-samples:cloudfoundry-identity-app:javadoc 2178ms :cloudfoundry-identity-samples:cloudfoundry-identity-app:war On Wed, May 27, 2015 at 7:22 AM, Dhilip Kumar S <dhilip.kumar.s(a)huawei.com> wrote: Hi All,
|
|
Re: scheduler
Eric Malm <emalm@...>
Hi, Corentin,
Diego, like the DEAs, supports evacuation of LRP instances during controlled shutdown of a cell VM (the analog of a single DEA in Diego's architecture). If you're using BOSH to deploy your Diego cluster and you redeploy to scale down the number of cell VMs, BOSH will trigger evacuation via the `drain` script in the rep job template. This will cause that cell's rep process to signal to the rest of the system via the BBS that its instances should be started on the other cells. Once those instances are all placed elsewhere, or the drain timeout is reached, the cell will finish shutting down. If you're not using BOSH to deploy your cluster, the drain script template in diego-release should show you how to trigger the rep to evacuate manually. If you're reducing the size of your deployment, you should of course ensure that you have sufficient capacity in the scaled-down cluster to run all your application instances, with some headroom for staging tasks and placement of high-memory app instances. Diego's placement algorithm currently prefers an even distribution of instances across availability zones and cell VMs, so its ideal placement results in roughly the same amount of capacity free on each cell. Diego itself does not include an autoscaling mechanism for long-running processes, although it does now report instance CPU/disk/memory usage metrics through the loggregator system. One could use that to build an autoscaler for CF apps via the CC API; if existing autoscalers use those fields from the 'stats' endpoint on the CC API, they should continue to function with the Diego backend. Likewise, Diego has no knowledge of its provisioner (BOSH or otherwise), so it can't scale its own deployment automatically, but one could automate monitoring Diego's capacity metrics (also emitted through the loggregator system) and scaling up or down the cell deployment in response to certain capacity thresholds. Thanks, Eric, CF Runtime Diego PM On Wed, May 27, 2015 at 5:22 AM, Corentin Dupont < corentin.dupont(a)create-net.org> wrote: Some other questions:
|
|
Re: api and api_worker jobs fail to bosh update, but monit start OK
Mike Youngstrom
We recently experienced a similar issue. Not sure if it is the same. But
toggle quoted messageShow quoted text
it was caused when we moved the nfs_mounter job template to the first item in the list of templates for the CC job. We moved nfs_mounter to the last job template in the list and we haven't had a problem since. It was really strange cause you think you'd want nfs_mounter first. Anyway, something to try. Mike
On Wed, May 27, 2015 at 4:51 AM, Guillaume Berche <bercheg(a)gmail.com> wrote:
Hi,
|
|
Re: api and api_worker jobs fail to bosh update, but monit start OK
Dieu Cao <dcao@...>
We have environments on stemcell 2977 that are running well.
toggle quoted messageShow quoted text
We have an environment using NFS that ran into that same issue and we have this bug open. [1] Specifying the nfs_mounter job last should work in the mean time until we get the order switched. This was apparently introduced when we added consul_agent to the cloud controller jobs. I'll update the release notes for the affected releases. -Dieu CF Runtime PM [1] https://www.pivotaltracker.com/story/show/94152506
On Wed, May 27, 2015 at 10:09 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:
We recently experienced a similar issue. Not sure if it is the same. But
|
|
(No subject)
Abderrahim Chibani
We recently experienced a similar issue. Not sure if it is the same. But
it was caused when we moved the nfs_mounter job template to the first item in the list of templates for the CC job. We moved nfs_mounter to the last job template in the list and we haven't had a problem since. It was really strange cause you think you'd want nfs_mounter first. Anyway, something to try.
|
|
Diego Question
Daniel Mikusa
I was testing an app on Diego today and part of the test was for the app to
fail. I simulated this by putting some garbage into the `-c` argument of `cf push`. This had the right effect and my app failed. At the same time, I was tailing the logs in another window. While I got my logs, I also got a hundreds of lines like this... ``` 2015-05-27T16:46:01.64-0400 [HEALTH/0] OUT healthcheck failed 2015-05-27T16:46:01.65-0400 [HEALTH/0] OUT Exit status 1 2015-05-27T16:46:02.19-0400 [HEALTH/0] OUT healthcheck failed 2015-05-27T16:46:02.19-0400 [HEALTH/0] OUT Exit status 1 2015-05-27T16:46:02.74-0400 [HEALTH/0] OUT healthcheck failed 2015-05-27T16:46:02.74-0400 [HEALTH/0] OUT Exit status 1 ... ``` Is that expected? It seems to add a lot of noise. Sorry, I don't know the exact version of Diego. I was testing on PWS. Thanks, Dan
|
|
Re: Diego Question
Karen Wang <khwang@...>
Dan,
toggle quoted messageShow quoted text
We announce the PWS's CF version on status.run.pivotal.io: About This Site If you encounter any issues please contact support(a)run.pivotal.io. Pivotal Web Services is the latest public release of the OSS Cloud Foundry Project The current release of Cloud Foundry deployed on PWS is v210 on 23 May 2015. Details about this release can be found at the Cloud Foundry community wiki which is located at: https://github.com/cloudfoundry-community/cf-docs-contrib/wiki/All-CF-Releases When you click on the link and go to the release note for v210, there you'll see: Compatible Diego Version - final release 0.1247.0 commit <https://github.com/cloudfoundry-incubator/diego-release/commit/a122a78eeb344bbfc90b7bcd0fa987d08ef1a5d1> And this is the version of Diego deployed along side the specific CF release. Karen
On Wed, May 27, 2015 at 1:53 PM, Daniel Mikusa <dmikusa(a)pivotal.io> wrote:
I was testing an app on Diego today and part of the test was for the app
|
|
Re: Multiple Availability Zone
iamflying
I updated my bosh (using bosh-init) with enabling
ignore_server_availability_zone. but it still failed when I deployed my cf. Anything suggestion? openstack: &openstack auth_url: http://137.172.74.78:5000/v2.0 # <--- Replace with OpenStack Identity API endpoint tenant: cf # <--- Replace with OpenStack tenant name username: cf-admin # <--- Replace with OpenStack username api_key: passw0rd # <--- Replace with OpenStack password default_key_name: cf-keypair default_security_groups: [default,bosh] ignore_server_availability_zone: true Error message from the deployment of cf: Started updating job etcd_z1 > etcd_z1/0 (canary). Failed: OpenStack API Bad Request (Invalid input received: Availability zone 'cloud-cf-az2' is invalid). Check task debug log for details. (00:00:19) Error 100: OpenStack API Bad Request (Invalid input received: Availability zone 'cloud-cf-az2' is invalid). Check task debug log for details. I checked the api request on first computing node. (/var/log/cinder/api.log) 2015-05-27 16:28:40.652 32174 DEBUG cinder.api.v1.volumes [req-4df6ac85-e986-438a-a953-5a2190ec5f62 8b0d5a75bd9c4539ba7fa64e5669c6c8 48a0898a9c4944f1b321da699ca4c37a - - -] Create volume request body: {u'volume': {'scheduler_hints': {}, u'availability_zone': u'cloud-cf-az2', u'display_name': u'volume-36f9a2eb-8bc9-4f27-9530-34c9d24fa881', u'display_description': u'', u'size': 10}} create /usr/lib/python2.6/site-packages/cinder/api/v1/volumes.py:316 Attached my cf deployment file for reference. cf-deployment-single-az.yml <http://cf-dev.70369.x6.nabble.com/file/n206/cf-deployment-single-az.yml> -- View this message in context: http://cf-dev.70369.x6.nabble.com/cf-dev-Multiple-Availability-Zone-tp192p206.html Sent from the CF Dev mailing list archive at Nabble.com.
|
|
Custom Login Server with UAA 2.0+
Matt Cholick
Prior to the consolidation of uaa and the login server in uaa release 2.0,
we were running our own login server to handle auth to our platform. We simply reduced the instance of the bundled CF login server to 0 and put our own in place, which snagged the login subdomain. This worked just fine; our solution implemented all the needed endpoints to login. We're now upgrading to a newer release with uaa 2.0+ and having difficulties. The uaa registrar hardcodes grabbing the login subdomains: ... - login.<%= properties.domain %> - '*.login.<%= properties.domain %>' ... See: https://github.com/cloudfoundry/cf-release/blob/master/jobs/uaa/templates/cf-registrar.config.yml.erb This prevents us from taking over login. We locally removed those list items and our custom login server does continue to work. We have some questions about the right approach going forward though. Are uaa and the login server going to continue to merge: to the point where we can no longer take over the login subdomain? Will this strategy no longer be feasible? What's the right answer non ldap/saml environments, if the uaa project's roadmap makes this replacement impossible? If our current solution will continue to work for the foreseeable future, would the uaa team be amenable to a pull-request making the uri values configurable, so we can continue to take over the login subdomain? -Matt Cholick
|
|