Re: Remarks about the “confab” wrapper for consul
Orchestrating a raft cluster in a way that requires no manual intervention is incredibly difficult. We write the PID file late for a specific reason: https://www.pivotaltracker.com/story/show/112018069For dealing with wedged states like the one you encountered, we have some recommendations in the documentation: https://github.com/cloudfoundry-incubator/consul-release/#disaster-recoveryWe have acceptance tests we run in CI that exercise rolling a 3 node cluster, so if you hit a failure it would be useful to get logs if you have any. Cheers, Amit On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org> wrote: Actually, doing some further tests, I realize a mere 'join' is definitely not enough.
Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:
monit stop consul_agent echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' > /var/vcap/store/consul_agent/raft/peers.json
And make sure you start them quite at the same time with “monit start consul_agent”
So this advocates a strongly for setting *skip_leave_on_interrupt=true* and *leave_on_terminate=false* in confab, because loosing the peers.json is really something we don't want in our CF deployments!
/Benjamin
Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :
Hi cf devs,
I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader. That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.
But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.
More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.
And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.
So, I’m asking 3 questions here:
1. Does writing the PID file in confab *that* late really makes sense? 2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do? 3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?
With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.
This can be done with those extra configs from the “confab” wrapper:
{ "skip_leave_on_interrupt": true, "leave_on_terminate": false }
What do you guys think of it?
/Benjamin
|
|
Re: cf stop not sending SIGTERM
Thanks! This was happening in JBP v3.3.1, I just tested with JBPv 3.6 and it's working.
|
|
Re: Doppler/Firehose - Multiline Log Entry
Mike Youngstrom <youngm@...>
Finally got around to testing this. Preliminary testing show that "\u2028" doesn't function as a new line character in bash and causes eclipse console to wig out. I don't think "\u2028" is a viable long term solution. Hope you make progress on a metric format available to an app in a container. I too would like a tracker link to such a feature if there is one.
Thanks, Mike
toggle quoted message
Show quoted text
On Mon, Mar 14, 2016 at 2:28 PM, Mike Youngstrom <youngm(a)gmail.com> wrote: Hi Jim,
So, to be clear what we're basically doing is using unicode newline character to fool loggregator (which is looking for \n) into thinking that it isn't a new log event right? Does \u2028 work as a new line character when tailing logs in the CLI? Anyone tried this unicode new line character in various consoles? IDE, xterm, etc? I'm wondering if developers will need to have different config for development.
Mike
On Mon, Mar 14, 2016 at 12:17 PM, Jim CF Campbell <jcampbell(a)pivotal.io> wrote:
Hi Mike and Alex,
Two things - for Java, we are working toward defining an enhanced metric format that will support transport of Multi Lines.
The second is this workaround that David Laing suggested for Logstash. Think you could use it for Splunk?
With the Java Logback library you can do this by adding "%replace(%xException){'\n','\u2028'}%nopex" to your logging config[1] , and then use the following logstash conf.[2] Replace the unicode newline character \u2028 with \n, which Kibana will display as a new line.
mutate {
gsub => [ "[@message]", '\u2028', "
"] ^^^ Seems that passing a string with an actual newline in it is the only way to make gsub work
}
to replace the token with a regular newline again so it displays "properly" in Kibana.
[1] github.com/dpin...ication.yml#L12 <https://github.com/dpinto-pivotal/cf-SpringBootTrader-config/blob/master/application.yml#L12>
[2] github.com/logs...se.conf#L60-L64 <https://github.com/logsearch/logsearch-for-cloudfoundry/blob/master/src/logsearch-config/src/logstash-filters/snippets/firehose.conf#L60-L64>
On Mon, Mar 14, 2016 at 11:11 AM, Mike Youngstrom <youngm(a)gmail.com> wrote:
I'll let the Loggregator team respond formally. But, in my conversations with the Loggregator team I think we're basically stuck not sure what the right thing to do is on the client side. How does the client trigger in loggregator that this is a multi line log message or what is the right way for loggregator to detect that the client is trying to send a multi line log message? Any ideas?
Mike
On Mon, Mar 14, 2016 at 10:25 AM, Aliaksandr Prysmakou < prysmakou(a)gmail.com> wrote:
Hi guys, Are there any updates about "Multiline Log Entry" issue? How correctly deal with stacktraces? Links to the tracker to read? ---- Alex Prysmakou / Altoros Tel: (617) 841-2121 ext. 5161 | Toll free: 855-ALTOROS Skype: aliaksandr.prysmakou www.altoros.com | blog.altoros.com | twitter.com/altoros
-- Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963
|
|
[PROPOSAL]: Removing ability to specify npm version
|
|
Re: Staging and Runtime Hooks Feature Narrative
Mike Youngstrom <youngm@...>
An interesting proposal. Any thoughts about this proposal in relation to multi-buildpacks [0]? How many of the use cases for this feature go away in lue of multi-buildpack support? I think it would be interesting to be able to apply hooks without checking scripts into application (like multi-bulidpack). This feature also appears to be somewhat related to [1]. I hope that someone is overseeing all these newly proposed buildpack features to help ensure they are coherent. Mike [0] https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/H64GGU6Z75CZDXNWC7CKUX64JNPARU6Y/ [1] https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GRKFQ2UOQL7APRN6OTGET5HTOJZ7DHRQ/#SEA2RWDCAURSVPIMBXXJMWN7JYFQICL3
toggle quoted message
Show quoted text
On Fri, Apr 8, 2016 at 4:16 PM, Troy Topnik <troy.topnik(a)hpe.com> wrote: This feature allows developers more control of the staging and deployment of their application code, without them having to fork existing buildpacks or create their own.
https://docs.google.com/document/d/1PnTtTLwXOTG7f70ilWGlbTbi1LAXZu9zYnrUVvjr31I/edit
Hooks give developers the ability to optionally: * run scripts in the staging container before and/or after the bin/compile scripts executed by the buildpack, and * run scripts in each app container before the app starts (via .profile as per the Heroku buildpack API)
A similar feature has been available and used extensively in Stackato for a few years, and we'd like to contribute this functionality back to Cloud Foundry.
A proof-of-concept of this feature has already been submitted as a pull request, and the Feature Narrative addresses many of the questions raised in the PR discussion:
https://github.com/cloudfoundry-incubator/buildpack_app_lifecycle/pull/13
Please weigh in with comments in the document itself or in this thread.
Thanks,
TT
|
|
Re: Request for Multibuildpack Use Cases
Mike Youngstrom <youngm@...>
toggle quoted message
Show quoted text
On Sun, Apr 10, 2016 at 6:15 PM, Danny Rosen <drosen(a)pivotal.io> wrote: Hi there,
The CF Buildpacks team is considering taking on a line of work to provide more formal support for multibuildpacks. Before we start, we would be interested in learning if any community users have compelling use cases they could share with us.
For more information on multibuildpacks, see Heroku's documentation [1]
[1] - https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app
|
|
Re: Remarks about the “confab” wrapper for consul

Benjamin Gandon
Actually, doing some further tests, I realize a mere 'join' is definitely not enough.
Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:
monit stop consul_agent echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' > /var/vcap/store/consul_agent/raft/peers.json
And make sure you start them quite at the same time with “monit start consul_agent”
So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!
/Benjamin
toggle quoted message
Show quoted text
Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :
Hi cf devs,
I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader. That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.
But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.
More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.
And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.
So, I’m asking 3 questions here:
1. Does writing the PID file in confab that late really makes sense? 2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do? 3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?
With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.
This can be done with those extra configs from the “confab” wrapper:
{ "skip_leave_on_interrupt": true, "leave_on_terminate": false }
What do you guys think of it?
/Benjamin
|
|
Re: AUFS bug in Linux kernel

Benjamin Gandon
Very neat! Thanks a lot Eric.
toggle quoted message
Show quoted text
Le 11 avr. 2016 à 17:46, Eric Malm <emalm(a)pivotal.io> a écrit :
Hi, Benjamin,
Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be fine, and I believe the earliest BOSH-Lite Vagrant box with one of those kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was a one-off build that Canonical supplied in advance of the release of the official kernel package with the fix (https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1 <https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1>), and the 'official' package with kernel 3.19.0-49 still has the AUFS bug.
Thanks, Eric
On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote: Hi,
Sorry for the late up, but would this hit bosh-lite too? Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.
Config : - Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it) - Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other
Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ?
As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.
Thanks!
Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io <mailto:emalm(a)pivotal.io>> a écrit :
Hi, Mike,
Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.
Thanks, Eric
On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote: Thanks Will. Does anyone know if this bug could also impacts Warden?
Mike
On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote: A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.
This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.
The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.
We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.
If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.
As always, please get in touch out if you have any questions.
Will - Garden PM
[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043> [2]: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/> [3]: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/> [4]: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/> [5]: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>
|
|
Remarks about the “confab” wrapper for consul

Benjamin Gandon
Hi cf devs,
I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader. That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.
But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.
More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.
And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.
So, I’m asking 3 questions here:
1. Does writing the PID file in confab that late really makes sense? 2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do? 3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?
With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.
This can be done with those extra configs from the “confab” wrapper:
{ "skip_leave_on_interrupt": true, "leave_on_terminate": false }
What do you guys think of it?
/Benjamin
|
|
Re: AUFS bug in Linux kernel
Hi, Benjamin, Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be fine, and I believe the earliest BOSH-Lite Vagrant box with one of those kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was a one-off build that Canonical supplied in advance of the release of the official kernel package with the fix ( https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1), and the 'official' package with kernel 3.19.0-49 still has the AUFS bug. Thanks, Eric On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org> wrote: Hi,
Sorry for the late up, but would this hit bosh-lite too? Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.
Config : - Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I *might* have upgraded it) - Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other
Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ?
As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.
Thanks!
Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :
Hi, Mike,
Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.
Thanks, Eric
On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?
Mike
On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io> wrote:
A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.
This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.
The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.
We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.
If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.
As always, please get in touch out if you have any questions.
Will - Garden PM
[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 [2]: http://www.ubuntu.com/usn/usn-2857-1/ [3]: http://www.ubuntu.com/usn/usn-2868-1/ [4]: http://www.ubuntu.com/usn/usn-2869-1/ [5]: http://www.ubuntu.com/usn/usn-2871-2/
|
|
Re: AUFS bug in Linux kernel

Benjamin Gandon
Hi, Sorry for the late up, but would this hit bosh-lite too? Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite. Config : - Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it) - Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit < https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ? As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare. Thanks!
toggle quoted message
Show quoted text
Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :
Hi, Mike,
Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.
Thanks, Eric
On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote: Thanks Will. Does anyone know if this bug could also impacts Warden?
Mike
On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote: A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.
This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.
The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.
We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.
If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.
As always, please get in touch out if you have any questions.
Will - Garden PM
[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043> [2]: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/> [3]: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/> [4]: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/> [5]: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>
|
|
CPU weight of application
|
|
Re: Go buildpack, cloud native and 12 factor
All buildpacks except the binary buildpack perform build at push time. Both "buildpacks" and "12-factor" came out of Heroku. That said, whether to use the Go buildpack vs binary buildpack is an interesting question. One thing that's highly desirable is to build one thing, which you can then promote from test, to staging, to production. To that end, a CI pipeline that builds a binary, and promotes it between jobs that deploy to various environments would be the best way to achieve this. On the other hand, as Rash points out, this requires a leaked abstraction of your build pipeline having to know the target platform to compile for. In theory, test, staging, and prod might be using different stacks, though you're probably always safe assuming 64-bit linux, so I'd say the risk of having to cross-compile is fairly low. That said, for small projects, I definitely just use the Go buildpack for its convenience. Amit On Sun, Apr 10, 2016 at 4:05 PM, Rasheed Abdul-Aziz <rabdulaziz(a)pivotal.io> wrote: Buildbacks are still environmentally aware of the target build environment. They mean you don't need to worry about cross platform support.
On Sun, Apr 10, 2016 at 6:57 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
On a lazy sunday evening experimenting with the Go and Binary buildpacks, a thought came to my head regarding cloud native patterns, and in particular 12 factors' Build, Release, Run.
To me, the Go buildpack is somewhat of an outlier amongst most of the other buildpacks, it performs compilation (build) at push time and violates 12 factor.
Now this doesn't make it wrong, I'm sure many people are using Cloud Foundry for apps that may not "cloud native" and violate one or two of the 12 factors, but I'm curious how people approach Go based apps in large scale production environments? Do they allow the Go buildpack or push people to the binary buildpack? What do people see as the main reasons for one over the other?
John.
|
|
Request for Multibuildpack Use Cases
Hi there, The CF Buildpacks team is considering taking on a line of work to provide more formal support for multibuildpacks. Before we start, we would be interested in learning if any community users have compelling use cases they could share with us. For more information on multibuildpacks, see Heroku's documentation [1] [1] - https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app
|
|
Re: Go buildpack, cloud native and 12 factor
Buildbacks are still environmentally aware of the target build environment. They mean you don't need to worry about cross platform support. On Sun, Apr 10, 2016 at 6:57 PM, john mcteague <john.mcteague(a)gmail.com> wrote: On a lazy sunday evening experimenting with the Go and Binary buildpacks, a thought came to my head regarding cloud native patterns, and in particular 12 factors' Build, Release, Run.
To me, the Go buildpack is somewhat of an outlier amongst most of the other buildpacks, it performs compilation (build) at push time and violates 12 factor.
Now this doesn't make it wrong, I'm sure many people are using Cloud Foundry for apps that may not "cloud native" and violate one or two of the 12 factors, but I'm curious how people approach Go based apps in large scale production environments? Do they allow the Go buildpack or push people to the binary buildpack? What do people see as the main reasons for one over the other?
John.
|
|
Go buildpack, cloud native and 12 factor
john mcteague <john.mcteague@...>
On a lazy sunday evening experimenting with the Go and Binary buildpacks, a thought came to my head regarding cloud native patterns, and in particular 12 factors' Build, Release, Run.
To me, the Go buildpack is somewhat of an outlier amongst most of the other buildpacks, it performs compilation (build) at push time and violates 12 factor.
Now this doesn't make it wrong, I'm sure many people are using Cloud Foundry for apps that may not "cloud native" and violate one or two of the 12 factors, but I'm curious how people approach Go based apps in large scale production environments? Do they allow the Go buildpack or push people to the binary buildpack? What do people see as the main reasons for one over the other?
John.
|
|
Re: App running even after delete. Pointers on finding it and debugging?
Tom Sherrod <tom.sherrod@...>
Thank you, Eric. The query into auth_username and password, prompted me to review the manifest. Those were not correct and a typo in the cc_uploader, cc, base_url. I've made the corrections. I likely got out of sync between versions of diego and cf. I use generate_manifest occasionally. Will need to use it again to get the versions back in sync.
Thanks, Tom
toggle quoted message
Show quoted text
On Fri, Apr 8, 2016 at 9:29 PM, Eric Malm <emalm(a)pivotal.io> wrote: Thanks, Tom. The errors about the 401 response code make me suspect that the nsync-bulker doesn't have the correct basic-auth credentials for the internal app-enumeration endpoint it queries on CC. Could you check whether the diego.nsync.cc.basic_auth_username and diego.nsync.cc.basic_auth_password properties in your Diego manifest are the same as the cc.internal_api_user and cc.internal_api_password properties in your CF manifest? There was also a previous pair of CF/Diego release versions where those properties had different defaults for the user names in the job specs, but I believe they match in CF v230 and Diego v0.1450.0.
Best, Eric
On Fri, Apr 8, 2016 at 5:33 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
Yes, the logs are there. I grepped the logs for error. I see a lot of:
{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6713"}}
{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6714"}}
{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6715"}}
{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6716"}}
{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6717"}}
Let me know if there's something specific you wish to find.
Tom
On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote:
Thanks, Tom, glad you were able to use veritas to find and remove the stray apps. I'd like to know how they remained present in the first place. Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your deployment? That BOSH job has the responsibility of updating the Diego DesiredLRPs to match the current set of CF apps, so if there are synchronization errors they should be present in those logs.
Thanks, Eric, CF Runtime Diego PM
On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:
It would be nice to figure out the root cause here.
Does having two crashed and two apps have some significance as to why the delete failed, though appeared successful?
On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
Thank you.
Veritas is quite informative. I found 2 apps running and 2 crashed. I deleted them and all appears well.
On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Ok, I would use veritas to look at the Diego BBS, and confirm that it still thinks the app is there. You can also go onto the router and query its HTTP endpoint to confirm that the route you're seeing is also still there: https://github.com/cloudfoundry/gorouter#instrumentation. Lastly I would connect to the CCDB and confirm that the app and route are *not* there. This will reduce the problem to figuring out why Diego isn't being updated to know that the non-existing app is no longer desired.
On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
The route still exists. I was reluctant to delete it and have the "app" still running. I wanted some way to track it down, not that it has helped, other than let me know it is still running.
Pushed the app, with a different name/host, with no problems and it runs as it should.
On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Tom,
So you're saying that none of the org/spaces shows the app or the route, but the app continues to run and be routeable?
I could imagine this happen if some CC Bridge components are not able to talk to either CC or Diego BBS, leaving the data in the Diego BBS stale. In the case of stale info, Diego may not know that the LRP is no longer desired, and it will do the safe thing of keeping it around, and emitting its route to the gorouter, which just does what it's told (it doesn't check whether CC knows about the route or not).
Are you able to push new apps or delete other apps with the Diego backend?
Amit
On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
JT,
Thanks for responding.
This is a test runtime and small. I checked all orgs and spaces. No routes matching the app.
Found the route information and the result:
{
"total_results": 0,
"total_pages": 1,
"prev_url": null,
"next_url": null,
"resources": []
}
To learn what the output may look like, I check existing routes with apps and without. The output appears to be the same as if the app has been deleted.
Even now, the app url still returns a page from the app, even though it is deleted.
Thanks,
Tom
On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io> wrote:
Tom,
Are you sure the route isn't bound to another application in another org/space?
When you do `cf routes` it only show routes for the current space. You can hit specific API endpoints though to get all the apps for a route.
For example, `cf curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all the apps for a particular route. Obviously replacing the route ID with the correct ID. To find that, I recommend going through `CF_TRACE=true cf routes` and grabbing the ID.
Let see if you can hunt it down that way.
Kind Regards,
JT
On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod < tom.sherrod(a)gmail.com> wrote:
cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0 Default to diego true.
Developer deployed a java application. Deleted the application: cf delete <app> No errors. The app still responds. The only thing left is the route. I've not encountered this before. Delete has been delete and even if route remains, 404 Not Found: Requested route ('<hostname.domain>') does not exist. is returned.
Pointers on tracking this down appreciated.
Tom
|
|
Re: App running even after delete. Pointers on finding it and debugging?
Thanks, Tom. The errors about the 401 response code make me suspect that the nsync-bulker doesn't have the correct basic-auth credentials for the internal app-enumeration endpoint it queries on CC. Could you check whether the diego.nsync.cc.basic_auth_username and diego.nsync.cc.basic_auth_password properties in your Diego manifest are the same as the cc.internal_api_user and cc.internal_api_password properties in your CF manifest? There was also a previous pair of CF/Diego release versions where those properties had different defaults for the user names in the job specs, but I believe they match in CF v230 and Diego v0.1450.0.
Best, Eric
toggle quoted message
Show quoted text
On Fri, Apr 8, 2016 at 5:33 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote: Yes, the logs are there. I grepped the logs for error. I see a lot of:
{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6713"}}
{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6714"}}
{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6715"}}
{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6716"}}
{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6717"}}
Let me know if there's something specific you wish to find.
Tom
On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote:
Thanks, Tom, glad you were able to use veritas to find and remove the stray apps. I'd like to know how they remained present in the first place. Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your deployment? That BOSH job has the responsibility of updating the Diego DesiredLRPs to match the current set of CF apps, so if there are synchronization errors they should be present in those logs.
Thanks, Eric, CF Runtime Diego PM
On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:
It would be nice to figure out the root cause here.
Does having two crashed and two apps have some significance as to why the delete failed, though appeared successful?
On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
Thank you.
Veritas is quite informative. I found 2 apps running and 2 crashed. I deleted them and all appears well.
On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Ok, I would use veritas to look at the Diego BBS, and confirm that it still thinks the app is there. You can also go onto the router and query its HTTP endpoint to confirm that the route you're seeing is also still there: https://github.com/cloudfoundry/gorouter#instrumentation. Lastly I would connect to the CCDB and confirm that the app and route are *not* there. This will reduce the problem to figuring out why Diego isn't being updated to know that the non-existing app is no longer desired.
On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
The route still exists. I was reluctant to delete it and have the "app" still running. I wanted some way to track it down, not that it has helped, other than let me know it is still running.
Pushed the app, with a different name/host, with no problems and it runs as it should.
On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Tom,
So you're saying that none of the org/spaces shows the app or the route, but the app continues to run and be routeable?
I could imagine this happen if some CC Bridge components are not able to talk to either CC or Diego BBS, leaving the data in the Diego BBS stale. In the case of stale info, Diego may not know that the LRP is no longer desired, and it will do the safe thing of keeping it around, and emitting its route to the gorouter, which just does what it's told (it doesn't check whether CC knows about the route or not).
Are you able to push new apps or delete other apps with the Diego backend?
Amit
On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
JT,
Thanks for responding.
This is a test runtime and small. I checked all orgs and spaces. No routes matching the app.
Found the route information and the result:
{
"total_results": 0,
"total_pages": 1,
"prev_url": null,
"next_url": null,
"resources": []
}
To learn what the output may look like, I check existing routes with apps and without. The output appears to be the same as if the app has been deleted.
Even now, the app url still returns a page from the app, even though it is deleted.
Thanks,
Tom
On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io> wrote:
Tom,
Are you sure the route isn't bound to another application in another org/space?
When you do `cf routes` it only show routes for the current space. You can hit specific API endpoints though to get all the apps for a route.
For example, `cf curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all the apps for a particular route. Obviously replacing the route ID with the correct ID. To find that, I recommend going through `CF_TRACE=true cf routes` and grabbing the ID.
Let see if you can hunt it down that way.
Kind Regards,
JT
On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod <tom.sherrod(a)gmail.com
wrote: cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0 Default to diego true.
Developer deployed a java application. Deleted the application: cf delete <app> No errors. The app still responds. The only thing left is the route. I've not encountered this before. Delete has been delete and even if route remains, 404 Not Found: Requested route ('<hostname.domain>') does not exist. is returned.
Pointers on tracking this down appreciated.
Tom
|
|
Re: App running even after delete. Pointers on finding it and debugging?
Tom Sherrod <tom.sherrod@...>
Yes, the logs are there. I grepped the logs for error. I see a lot of:
{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6713"}}
{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6714"}}
{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6715"}}
{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6716"}}
{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{" error":"invalid response code 401","session":"6717"}}
Let me know if there's something specific you wish to find.
Tom
toggle quoted message
Show quoted text
On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote: Thanks, Tom, glad you were able to use veritas to find and remove the stray apps. I'd like to know how they remained present in the first place. Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your deployment? That BOSH job has the responsibility of updating the Diego DesiredLRPs to match the current set of CF apps, so if there are synchronization errors they should be present in those logs.
Thanks, Eric, CF Runtime Diego PM
On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:
It would be nice to figure out the root cause here.
Does having two crashed and two apps have some significance as to why the delete failed, though appeared successful?
On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
Thank you.
Veritas is quite informative. I found 2 apps running and 2 crashed. I deleted them and all appears well.
On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Ok, I would use veritas to look at the Diego BBS, and confirm that it still thinks the app is there. You can also go onto the router and query its HTTP endpoint to confirm that the route you're seeing is also still there: https://github.com/cloudfoundry/gorouter#instrumentation. Lastly I would connect to the CCDB and confirm that the app and route are *not* there. This will reduce the problem to figuring out why Diego isn't being updated to know that the non-existing app is no longer desired.
On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
The route still exists. I was reluctant to delete it and have the "app" still running. I wanted some way to track it down, not that it has helped, other than let me know it is still running.
Pushed the app, with a different name/host, with no problems and it runs as it should.
On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io> wrote:
Tom,
So you're saying that none of the org/spaces shows the app or the route, but the app continues to run and be routeable?
I could imagine this happen if some CC Bridge components are not able to talk to either CC or Diego BBS, leaving the data in the Diego BBS stale. In the case of stale info, Diego may not know that the LRP is no longer desired, and it will do the safe thing of keeping it around, and emitting its route to the gorouter, which just does what it's told (it doesn't check whether CC knows about the route or not).
Are you able to push new apps or delete other apps with the Diego backend?
Amit
On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
JT,
Thanks for responding.
This is a test runtime and small. I checked all orgs and spaces. No routes matching the app.
Found the route information and the result:
{
"total_results": 0,
"total_pages": 1,
"prev_url": null,
"next_url": null,
"resources": []
}
To learn what the output may look like, I check existing routes with apps and without. The output appears to be the same as if the app has been deleted.
Even now, the app url still returns a page from the app, even though it is deleted.
Thanks,
Tom
On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io> wrote:
Tom,
Are you sure the route isn't bound to another application in another org/space?
When you do `cf routes` it only show routes for the current space. You can hit specific API endpoints though to get all the apps for a route.
For example, `cf curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all the apps for a particular route. Obviously replacing the route ID with the correct ID. To find that, I recommend going through `CF_TRACE=true cf routes` and grabbing the ID.
Let see if you can hunt it down that way.
Kind Regards,
JT
On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0 Default to diego true.
Developer deployed a java application. Deleted the application: cf delete <app> No errors. The app still responds. The only thing left is the route. I've not encountered this before. Delete has been delete and even if route remains, 404 Not Found: Requested route ('<hostname.domain>') does not exist. is returned.
Pointers on tracking this down appreciated.
Tom
|
|
Re: cf v233 api_z1/api_z2 failing
Hi Kara & Peter, Thanks a lot for your help. That fixed the issue. On Thu, Apr 7, 2016 at 9:04 PM, Ranga Rajagopalan < ranga.rajagopalan(a)gmail.com> wrote: HI Kara,
Thanks. Let me try with a valid app_domain.
On Thu, Apr 7, 2016 at 9:01 PM, Kara Alexandra <ardnaxelarak(a)gmail.com> wrote:
Hi Ranga,
The only reason we were using bosh-lite.com for our app_domains was because we were testing to reproduce on our local bosh-lite. Using 'cfapp' I managed to reproduce the issue locally. My guess is that this is because 'cfapp' is not a valid domain (it doesn't end with a valid top-level domain), and I'm guessing that fixing that will at least fix part of the problem if not all of it.
Thanks,
Kara
On Thu, Apr 7, 2016 at 8:27 PM, Ranga Rajagopalan < ranga.rajagopalan(a)gmail.com> wrote:
Here's my deployment manifest. app_domains is set to cfapp. I can't find bost-lite anywhere in the file at all.
On Thu, Apr 7, 2016 at 4:27 PM, Peter Goetz <peter.gtz(a)gmail.com> wrote:
Hi Ranga,
Looking at your logs we found an error that could possibly cause this and it is related to the properties.apps_domain in the deployment manifest. By setting it to 'b%()osh-lite.com' (using special characters), we could reproduce the following error which we also found in your log file:
Encountered error: name format\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/model/base.rb:1543:in `save'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/shared_domain.rb:35:in `block in find_or_create'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:134:in `_transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:108:in `block in transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/connecting.rb:249:in `block in synchronize'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/connection_pool/threaded.rb:103:in `hold'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/connecting.rb:249:in `synchronize'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:97:in `transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/shared_domain.rb:27:in `find_or_create'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:57:in `block in create_seed_domains'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:56:in `each'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:56:in `create_seed_domains'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:9:in `write_seed_data'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/runner.rb:93:in `block in run!'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in `call'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in `run_machine'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in `run'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/runner.rb:87:in `run!'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller:8:in `<main>'
Can you check the apps_domain property and see if there is anything suspicious with it?
Thanks, Peter & Kara
On Thu, Apr 7, 2016 at 2:32 PM Ranga Rajagopalan < ranga.rajagopalan(a)gmail.com> wrote:
Hi Peter,
Attaching /var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log.gz and /var/vcap/sys/log/cloud_controller_worker_ctl.log.gz. There isn't a /var/vcap/sys/log/cloud_controller_worker/ directory on this node.
vcap(a)572afc33-0735-4727-8ff3-9dc6d7fa8af0:~$ ls /var/vcap/sys/log/ agent_ctl.err.log cloud_controller_worker_ctl.log nginx_cc/ agent_ctl.log consul_agent/ nginx_ctl.err.log cloud_controller_migration/ metron_agent/ nginx_ctl.log cloud_controller_migration_ctl.err.log metron_agent_ctl.err.log route_registrar/ cloud_controller_migration_ctl.log metron_agent_ctl.log route_registrar_ctl.err.log cloud_controller_ng/ monit/ route_registrar_ctl.log cloud_controller_ng_ctl.err.log nfs_mounter/ statsd-injector/ cloud_controller_ng_ctl.log nfs_mounter_ctl.err.log statsd-injector-ctl.err.log cloud_controller_worker_ctl.err.log nfs_mounter_ctl.log statsd-injector-ctl.log
On Thu, Apr 7, 2016 at 12:21 PM, Peter Goetz <peter.gtz(a)gmail.com> wrote:
Hi Ranga,
To trouble-shoot this issue could you also provide the contents of /var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log and /var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log? This should give us more details about what's going on. The ctl script logs do not provide enough details.
Thanks, Peter
On Wed, Apr 6, 2016 at 6:12 PM Ranga Rajagopalan < ranga.rajagopalan(a)gmail.com> wrote:
I tried v231. Unfortunately, same issue.
-- Thanks,
Ranga
-- Thanks,
Ranga
-- Thanks,
Ranga
-- Thanks, Ranga
|
|