Date   

Re: Remarks about the “confab” wrapper for consul

Amit Kumar Gupta
 

Orchestrating a raft cluster in a way that requires no manual intervention
is incredibly difficult. We write the PID file late for a specific reason:

https://www.pivotaltracker.com/story/show/112018069

For dealing with wedged states like the one you encountered, we have some
recommendations in the documentation:

https://github.com/cloudfoundry-incubator/consul-release/#disaster-recovery

We have acceptance tests we run in CI that exercise rolling a 3 node
cluster, so if you hit a failure it would be useful to get logs if you have
any.

Cheers,
Amit

On Mon, Apr 11, 2016 at 9:38 AM, Benjamin Gandon <benjamin(a)gandon.org>
wrote:

Actually, doing some further tests, I realize a mere 'join' is definitely
not enough.

Instead, you need to restore the raft/peers.json on each one of the 3
consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' >
/var/vcap/store/consul_agent/raft/peers.json


And make sure you start them quite at the same time with “monit start
consul_agent”

So this advocates a strongly for setting *skip_leave_on_interrupt=true*
and *leave_on_terminate=false* in confab, because loosing the peers.json
is really something we don't want in our CF deployments!

/Benjamin


Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my
consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment
down, until you get a deeper understanding of consul, and figure out they
just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I
could just see monit restarting the “agent_ctl” every 60 seconds because
confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and
consul_z2/0) had properly left oneanother uppon a graceful shutdown. This
state was persisted in /var/vcap/store/raft/peers.json being “null” on each
one of them, so they would not get back together on restart. A manual
'join' was necessary. But it took me hours to get there because I’m no
expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and
monit was constantly starting and stopping it every 60 seconds. But once
you step back, you realize confab was actually waiting for the new leader
to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab *that* late really makes sense?
2. Could someone please write some minimal documentation about confab, at
least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not
here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no
consul server should not perform any graceful 'leave'. With this different
approach, then they would properly be back up even when performing a
complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin



Re: cf stop not sending SIGTERM

Will Tran
 

Thanks! This was happening in JBP v3.3.1, I just tested with JBPv 3.6 and it's working.


Re: Doppler/Firehose - Multiline Log Entry

Mike Youngstrom <youngm@...>
 

Finally got around to testing this. Preliminary testing show that "\u2028"
doesn't function as a new line character in bash and causes eclipse console
to wig out. I don't think "\u2028" is a viable long term solution. Hope
you make progress on a metric format available to an app in a container. I
too would like a tracker link to such a feature if there is one.

Thanks,
Mike

On Mon, Mar 14, 2016 at 2:28 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

Hi Jim,

So, to be clear what we're basically doing is using unicode newline
character to fool loggregator (which is looking for \n) into thinking that
it isn't a new log event right? Does \u2028 work as a new line character
when tailing logs in the CLI? Anyone tried this unicode new line character
in various consoles? IDE, xterm, etc? I'm wondering if developers will
need to have different config for development.

Mike

On Mon, Mar 14, 2016 at 12:17 PM, Jim CF Campbell <jcampbell(a)pivotal.io>
wrote:

Hi Mike and Alex,

Two things - for Java, we are working toward defining an enhanced metric
format that will support transport of Multi Lines.

The second is this workaround that David Laing suggested for Logstash.
Think you could use it for Splunk?

With the Java Logback library you can do this by adding
"%replace(%xException){'\n','\u2028'}%nopex" to your logging config[1] ,
and then use the following logstash conf.[2]
Replace the unicode newline character \u2028 with \n, which Kibana will
display as a new line.

mutate {

gsub => [ "[@message]", '\u2028', "

"]
^^^ Seems that passing a string with an actual newline in it is the only
way to make gsub work

}

to replace the token with a regular newline again so it displays
"properly" in Kibana.

[1] github.com/dpin...ication.yml#L12
<https://github.com/dpinto-pivotal/cf-SpringBootTrader-config/blob/master/application.yml#L12>

[2] github.com/logs...se.conf#L60-L64
<https://github.com/logsearch/logsearch-for-cloudfoundry/blob/master/src/logsearch-config/src/logstash-filters/snippets/firehose.conf#L60-L64>


On Mon, Mar 14, 2016 at 11:11 AM, Mike Youngstrom <youngm(a)gmail.com>
wrote:

I'll let the Loggregator team respond formally. But, in my
conversations with the Loggregator team I think we're basically stuck not
sure what the right thing to do is on the client side. How does the client
trigger in loggregator that this is a multi line log message or what is the
right way for loggregator to detect that the client is trying to send a
multi line log message? Any ideas?

Mike

On Mon, Mar 14, 2016 at 10:25 AM, Aliaksandr Prysmakou <
prysmakou(a)gmail.com> wrote:

Hi guys,
Are there any updates about "Multiline Log Entry" issue? How correctly
deal with stacktraces?
Links to the tracker to read?
----
Alex Prysmakou / Altoros
Tel: (617) 841-2121 ext. 5161 | Toll free: 855-ALTOROS
Skype: aliaksandr.prysmakou
www.altoros.com | blog.altoros.com | twitter.com/altoros


--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io |
303.618.0963


[PROPOSAL]: Removing ability to specify npm version

John Shahid
 

Hi all,

The buildpacks team would like to propose a change to the nodejs buildpack.
It was recently brought to our attention in this issue
<https://github.com/cloudfoundry/nodejs-buildpack/issues/54>, that the
nodejs buildpack will try to download npm if the version specified in
package.json didn’t match the version shipped with nodejs. According to
heroku
<https://devcenter.heroku.com/articles/nodejs-support#specifying-an-npm-version>
this is a feature that exists for historical reasons that do not apply
anymore.

We would like to ask if anyone relies on this feature or have an objection
to this change.

Thanks,

The Buildpacks Team


Re: Staging and Runtime Hooks Feature Narrative

Mike Youngstrom <youngm@...>
 

An interesting proposal. Any thoughts about this proposal in relation to
multi-buildpacks [0]? How many of the use cases for this feature go away
in lue of multi-buildpack support? I think it would be interesting to be
able to apply hooks without checking scripts into application (like
multi-bulidpack).

This feature also appears to be somewhat related to [1]. I hope that
someone is overseeing all these newly proposed buildpack features to help
ensure they are coherent.

Mike


[0]
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/H64GGU6Z75CZDXNWC7CKUX64JNPARU6Y/
[1]
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GRKFQ2UOQL7APRN6OTGET5HTOJZ7DHRQ/#SEA2RWDCAURSVPIMBXXJMWN7JYFQICL3

On Fri, Apr 8, 2016 at 4:16 PM, Troy Topnik <troy.topnik(a)hpe.com> wrote:

This feature allows developers more control of the staging and deployment
of their application code, without them having to fork existing buildpacks
or create their own.


https://docs.google.com/document/d/1PnTtTLwXOTG7f70ilWGlbTbi1LAXZu9zYnrUVvjr31I/edit

Hooks give developers the ability to optionally:
* run scripts in the staging container before and/or after the
bin/compile scripts executed by the buildpack, and
* run scripts in each app container before the app starts (via .profile
as per the Heroku buildpack API)

A similar feature has been available and used extensively in Stackato for
a few years, and we'd like to contribute this functionality back to Cloud
Foundry.

A proof-of-concept of this feature has already been submitted as a pull
request, and the Feature Narrative addresses many of the questions raised
in the PR discussion:

https://github.com/cloudfoundry-incubator/buildpack_app_lifecycle/pull/13

Please weigh in with comments in the document itself or in this thread.

Thanks,

TT


Re: Request for Multibuildpack Use Cases

Mike Youngstrom <youngm@...>
 

This seems to be yet another way to extend buildpacks with out forking to
go along with [0] and [1]. My only hope is that all these newly proposed
extension mechanisms come together in a simple, coherent, and extensible
way.

Mike

[0]
https://github.com/cloudfoundry-incubator/buildpack_app_lifecycle/pull/13
[1]
https://docs.google.com/document/d/145aOpNoq7BpuB3VOzUIDh-HBx0l3v4NHLYfW8xt2zK0/edit#

On Sun, Apr 10, 2016 at 6:15 PM, Danny Rosen <drosen(a)pivotal.io> wrote:

Hi there,

The CF Buildpacks team is considering taking on a line of work to provide
more formal support for multibuildpacks. Before we start, we would be
interested in learning if any community users have compelling use cases
they could share with us.

For more information on multibuildpacks, see Heroku's documentation [1]

[1] -
https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app


Re: Remarks about the “confab” wrapper for consul

Benjamin Gandon
 

Actually, doing some further tests, I realize a mere 'join' is definitely not enough.

Instead, you need to restore the raft/peers.json on each one of the 3 consul server nodes:

monit stop consul_agent
echo '["10.244.0.58:8300","10.244.2.54:8300","10.244.0.54:8300"]' > /var/vcap/store/consul_agent/raft/peers.json

And make sure you start them quite at the same time with “monit start consul_agent”

So this advocates a strongly for setting skip_leave_on_interrupt=true and leave_on_terminate=false in confab, because loosing the peers.json is really something we don't want in our CF deployments!

/Benjamin

Le 11 avr. 2016 à 18:15, Benjamin Gandon <benjamin(a)gandon.org> a écrit :

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin


Re: AUFS bug in Linux kernel

Benjamin Gandon
 

Very neat!
Thanks a lot Eric.

Le 11 avr. 2016 à 17:46, Eric Malm <emalm(a)pivotal.io> a écrit :

Hi, Benjamin,

Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be fine, and I believe the earliest BOSH-Lite Vagrant box with one of those kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was a one-off build that Canonical supplied in advance of the release of the official kernel package with the fix (https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1 <https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1>), and the 'official' package with kernel 3.19.0-49 still has the AUFS bug.

Thanks,
Eric

On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote:
Hi,

Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.

Config :
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it)
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other

Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ?

As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.

Thanks!


Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io <mailto:emalm(a)pivotal.io>> a écrit :

Hi, Mike,

Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.

Thanks,
Eric

On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?

Mike

On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote:
A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.

This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.

The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.

We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.

If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.

As always, please get in touch out if you have any questions.

Will - Garden PM

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043>
[2]: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/>
[3]: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/>
[4]: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/>
[5]: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>


Remarks about the “confab” wrapper for consul

Benjamin Gandon
 

Hi cf devs,


I’m running a CF deployment with redundancy, and I just experienced my consul servers not being able to elect any leader.
That’s a VERY frustrating situation that keeps the whole CF deployment down, until you get a deeper understanding of consul, and figure out they just need a silly manual 'join' so that they get back together.

But that was definitely not easy to nail down because at first look, I could just see monit restarting the “agent_ctl” every 60 seconds because confab was not writing the damn PID file.


More specifically, the 3 consul servers (i.e. consul_z1/0, consul_z1/1 and consul_z2/0) had properly left oneanother uppon a graceful shutdown. This state was persisted in /var/vcap/store/raft/peers.json being “null” on each one of them, so they would not get back together on restart. A manual 'join' was necessary. But it took me hours to get there because I’m no expert with consul.

And until the 'join' is made, VerifySynced() was negative in confab, and monit was constantly starting and stopping it every 60 seconds. But once you step back, you realize confab was actually waiting for the new leader to be elected before it writes the PID file. Which is questionable.

So, I’m asking 3 questions here:

1. Does writing the PID file in confab that late really makes sense?
2. Could someone please write some minimal documentation about confab, at least to tell what it is supposed to do?
3. Wouldn’t it be wiser that whenever any of the consul servers is not here, then the cluster gets unhealthy?

With this 3rd question, I mean that even on a graceful TERM or INT, no consul server should not perform any graceful 'leave'. With this different approach, then they would properly be back up even when performing a complete graceful restart of the cluster.

This can be done with those extra configs from the “confab” wrapper:

{
"skip_leave_on_interrupt": true,
"leave_on_terminate": false
}

What do you guys think of it?


/Benjamin


Re: AUFS bug in Linux kernel

Eric Malm <emalm@...>
 

Hi, Benjamin,

Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all
susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be
fine, and I believe the earliest BOSH-Lite Vagrant box with one of those
kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was
a one-off build that Canonical supplied in advance of the release of the
official kernel package with the fix (
https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1),
and the 'official' package with kernel 3.19.0-49 still has the AUFS bug.

Thanks,
Eric

On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org>
wrote:

Hi,

Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar
issues with the 53 garden containers I use in Bosh-Lite.

Config :
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux
v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I *might* have upgraded
it)
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd
v36 + cf-mysql v26 + other

Will the linux-image-3.19.0-49-generic fix the issue, as it was done in
this 2016-02-08 commit
<https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for
stemcell 3192 ?

As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and
I would be happy to get a confirmation that (1) my bosh-lite deployment was
hit by the AUFS bug, and that (2) the new kernel I installed will get me
off this operational nightmare.

Thanks!


Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :

Hi, Mike,

Warden also uses aufs for its containers' overlay filesystems, so we
expect the same issue to affect the DEAs on these stemcell versions. I'm
not aware of a deliberate attempt to reproduce it on the DEAs, though.

Thanks,
Eric

On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:

Thanks Will. Does anyone know if this bug could also impacts Warden?

Mike

On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io>
wrote:

A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux
kernel. This bug can cause containers to end up with unkillable zombie
processes with high CPU usage. This can happen any time a container is
supposed to be destroyed.

This affects both Garden-Linux and Warden (and Docker). If you see
significant slowdown or increased CPU usage on DEAs or Diego cells, it
might well be this. It will probably build slowly up over time, so you may
not notice anything for a while depending on the rate of app instance churn
on your deployment.

The bad version of the kernel is present in stemcell 3160 and later. I
can't recommend using older stemcells because the bad kernel versions also
include fixes for several high severity security vulnerabilities (at least
[2-5], there may be others I've missed). Were it not for these, rolling
back to stemcell 3157 would be the fix.

We're waiting for a fix to make its way into the kernel, and the BOSH
team will produce a stemcell with the fix as soon as possible. In the
meantime, I'd suggest simply keeping a closer eye than usual on your DEAs
and Diego cells.

If this issue occurs, the only solution is to recreate that machine.
While we've not had any actual reports of this issue occurring for Cloud
Foundry deployments in the wild yet, we're confident that the issue will be
occurring. The Diego team have seen it in testing, and several teams have
encountered the issue with their Concourse workers, which also use
Garden-Linux.

As always, please get in touch out if you have any questions.

Will - Garden PM

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043
[2]: http://www.ubuntu.com/usn/usn-2857-1/
[3]: http://www.ubuntu.com/usn/usn-2868-1/
[4]: http://www.ubuntu.com/usn/usn-2869-1/
[5]: http://www.ubuntu.com/usn/usn-2871-2/


Re: AUFS bug in Linux kernel

Benjamin Gandon
 

Hi,

Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.

Config :
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it)
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other

Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ?

As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.

Thanks!

Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :

Hi, Mike,

Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.

Thanks,
Eric

On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?

Mike

On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote:
A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.

This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.

The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.

We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.

If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.

As always, please get in touch out if you have any questions.

Will - Garden PM

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043>
[2]: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/>
[3]: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/>
[4]: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/>
[5]: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>


CPU weight of application

Sam Dai
 

Hello,
According to code
https://github.com/cloudfoundry-incubator/nsync/blob/01a624d23cb683f35c88c4160205c4ad880faaf0/recipebuilder/recipe_builder.go#L72-L84
, diego apps scale the number of allocated cpu shares linearly with the
amount of memory when allocated memory is > 256MB and < 8192MB, is there a way
to allocate extra cpu to an app that happens to need less memory?

Thanks,
Sam


Re: Go buildpack, cloud native and 12 factor

Amit Kumar Gupta
 

All buildpacks except the binary buildpack perform build at push time.
Both "buildpacks" and "12-factor" came out of Heroku.

That said, whether to use the Go buildpack vs binary buildpack is an
interesting question. One thing that's highly desirable is to build one
thing, which you can then promote from test, to staging, to production. To
that end, a CI pipeline that builds a binary, and promotes it between jobs
that deploy to various environments would be the best way to achieve this.
On the other hand, as Rash points out, this requires a leaked abstraction
of your build pipeline having to know the target platform to compile for.
In theory, test, staging, and prod might be using different stacks, though
you're probably always safe assuming 64-bit linux, so I'd say the risk of
having to cross-compile is fairly low.

That said, for small projects, I definitely just use the Go buildpack for
its convenience.

Amit

On Sun, Apr 10, 2016 at 4:05 PM, Rasheed Abdul-Aziz <rabdulaziz(a)pivotal.io>
wrote:

Buildbacks are still environmentally aware of the target build
environment. They mean you don't need to worry about cross platform
support.

On Sun, Apr 10, 2016 at 6:57 PM, john mcteague <john.mcteague(a)gmail.com>
wrote:

On a lazy sunday evening experimenting with the Go and Binary buildpacks,
a thought came to my head regarding cloud native patterns, and in
particular 12 factors' Build, Release, Run.

To me, the Go buildpack is somewhat of an outlier amongst most of the
other buildpacks, it performs compilation (build) at push time and violates
12 factor.

Now this doesn't make it wrong, I'm sure many people are using Cloud
Foundry for apps that may not "cloud native" and violate one or two of the
12 factors, but I'm curious how people approach Go based apps in large
scale production environments? Do they allow the Go buildpack or push
people to the binary buildpack? What do people see as the main reasons for
one over the other?

John.


Request for Multibuildpack Use Cases

Danny Rosen
 

Hi there,

The CF Buildpacks team is considering taking on a line of work to provide
more formal support for multibuildpacks. Before we start, we would be
interested in learning if any community users have compelling use cases
they could share with us.

For more information on multibuildpacks, see Heroku's documentation [1]

[1] -
https://devcenter.heroku.com/articles/using-multiple-buildpacks-for-an-app


Re: Go buildpack, cloud native and 12 factor

Rasheed Abdul-Aziz
 

Buildbacks are still environmentally aware of the target build environment.
They mean you don't need to worry about cross platform support.

On Sun, Apr 10, 2016 at 6:57 PM, john mcteague <john.mcteague(a)gmail.com>
wrote:

On a lazy sunday evening experimenting with the Go and Binary buildpacks,
a thought came to my head regarding cloud native patterns, and in
particular 12 factors' Build, Release, Run.

To me, the Go buildpack is somewhat of an outlier amongst most of the
other buildpacks, it performs compilation (build) at push time and violates
12 factor.

Now this doesn't make it wrong, I'm sure many people are using Cloud
Foundry for apps that may not "cloud native" and violate one or two of the
12 factors, but I'm curious how people approach Go based apps in large
scale production environments? Do they allow the Go buildpack or push
people to the binary buildpack? What do people see as the main reasons for
one over the other?

John.


Go buildpack, cloud native and 12 factor

john mcteague <john.mcteague@...>
 

On a lazy sunday evening experimenting with the Go and Binary buildpacks, a
thought came to my head regarding cloud native patterns, and in particular
12 factors' Build, Release, Run.

To me, the Go buildpack is somewhat of an outlier amongst most of the other
buildpacks, it performs compilation (build) at push time and violates 12
factor.

Now this doesn't make it wrong, I'm sure many people are using Cloud
Foundry for apps that may not "cloud native" and violate one or two of the
12 factors, but I'm curious how people approach Go based apps in large
scale production environments? Do they allow the Go buildpack or push
people to the binary buildpack? What do people see as the main reasons for
one over the other?

John.


Re: App running even after delete. Pointers on finding it and debugging?

Tom Sherrod <tom.sherrod@...>
 

Thank you, Eric.
The query into auth_username and password, prompted me to review the
manifest. Those were not correct and a typo in the cc_uploader, cc,
base_url. I've made the corrections.
I likely got out of sync between versions of diego and cf. I use
generate_manifest occasionally. Will need to use it again to get the
versions back in sync.

Thanks,
Tom

On Fri, Apr 8, 2016 at 9:29 PM, Eric Malm <emalm(a)pivotal.io> wrote:

Thanks, Tom. The errors about the 401 response code make me suspect that
the nsync-bulker doesn't have the correct basic-auth credentials for the
internal app-enumeration endpoint it queries on CC. Could you check whether
the diego.nsync.cc.basic_auth_username
and diego.nsync.cc.basic_auth_password properties in your Diego manifest
are the same as the cc.internal_api_user and cc.internal_api_password
properties in your CF manifest? There was also a previous pair of CF/Diego
release versions where those properties had different defaults for the user
names in the job specs, but I believe they match in CF v230 and Diego
v0.1450.0.

Best,
Eric

On Fri, Apr 8, 2016 at 5:33 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:

Yes, the logs are there.
I grepped the logs for error. I see a lot of:


{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6713"}}


{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6714"}}


{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6715"}}


{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6716"}}


{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6717"}}


Let me know if there's something specific you wish to find.


Tom

On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote:

Thanks, Tom, glad you were able to use veritas to find and remove the
stray apps. I'd like to know how they remained present in the first place.
Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your
deployment? That BOSH job has the responsibility of updating the Diego
DesiredLRPs to match the current set of CF apps, so if there are
synchronization errors they should be present in those logs.

Thanks,
Eric, CF Runtime Diego PM

On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:

It would be nice to figure out the root cause here.

Does having two crashed and two apps have some significance as to why
the delete failed, though appeared successful?

On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:

Thank you.

Veritas is quite informative. I found 2 apps running and 2 crashed.
I deleted them and all appears well.


On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Ok, I would use veritas to look at the Diego BBS, and confirm that it
still thinks the app is there. You can also go onto the router and query
its HTTP endpoint to confirm that the route you're seeing is also still
there: https://github.com/cloudfoundry/gorouter#instrumentation.
Lastly I would connect to the CCDB and confirm that the app and route are
*not* there. This will reduce the problem to figuring out why Diego isn't
being updated to know that the non-existing app is no longer desired.

On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

The route still exists. I was reluctant to delete it and have the
"app" still running. I wanted some way to track it down, not that it has
helped, other than let me know it is still running.

Pushed the app, with a different name/host, with no problems and it
runs as it should.

On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io>
wrote:

Tom,

So you're saying that none of the org/spaces shows the app or the
route, but the app continues to run and be routeable?

I could imagine this happen if some CC Bridge components are not
able to talk to either CC or Diego BBS, leaving the data in the Diego BBS
stale. In the case of stale info, Diego may not know that the LRP is no
longer desired, and it will do the safe thing of keeping it around, and
emitting its route to the gorouter, which just does what it's told (it
doesn't check whether CC knows about the route or not).

Are you able to push new apps or delete other apps with the Diego
backend?

Amit

On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

JT,

Thanks for responding.

This is a test runtime and small. I checked all orgs and spaces.
No routes matching the app.

Found the route information and the result:

{

"total_results": 0,

"total_pages": 1,

"prev_url": null,

"next_url": null,

"resources": []

}

To learn what the output may look like, I check existing routes
with apps and without. The output appears to be the same as if the app has
been deleted.

Even now, the app url still returns a page from the app, even
though it is deleted.

Thanks,

Tom

On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io>
wrote:

Tom,

Are you sure the route isn't bound to another application in
another org/space?

When you do `cf routes` it only show routes for the current
space. You can hit specific API endpoints though to get all the apps for a
route.

For example, `cf
curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all
the apps for a particular route. Obviously replacing the route ID with the
correct ID. To find that, I recommend going through `CF_TRACE=true cf
routes` and grabbing the ID.

Let see if you can hunt it down that way.

Kind Regards,

JT

On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod <
tom.sherrod(a)gmail.com> wrote:

cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0
Default to diego true.

Developer deployed a java application. Deleted the application:
cf delete <app> No errors.
The app still responds. The only thing left is the route.
I've not encountered this before. Delete has been delete and
even if route remains, 404 Not Found: Requested route ('<hostname.domain>')
does not exist. is returned.

Pointers on tracking this down appreciated.

Tom


Re: App running even after delete. Pointers on finding it and debugging?

Eric Malm <emalm@...>
 

Thanks, Tom. The errors about the 401 response code make me suspect that
the nsync-bulker doesn't have the correct basic-auth credentials for the
internal app-enumeration endpoint it queries on CC. Could you check whether
the diego.nsync.cc.basic_auth_username
and diego.nsync.cc.basic_auth_password properties in your Diego manifest
are the same as the cc.internal_api_user and cc.internal_api_password
properties in your CF manifest? There was also a previous pair of CF/Diego
release versions where those properties had different defaults for the user
names in the job specs, but I believe they match in CF v230 and Diego
v0.1450.0.

Best,
Eric

On Fri, Apr 8, 2016 at 5:33 PM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:

Yes, the logs are there.
I grepped the logs for error. I see a lot of:


{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6713"}}


{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6714"}}


{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6715"}}


{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6716"}}


{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6717"}}


Let me know if there's something specific you wish to find.


Tom

On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote:

Thanks, Tom, glad you were able to use veritas to find and remove the
stray apps. I'd like to know how they remained present in the first place.
Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your
deployment? That BOSH job has the responsibility of updating the Diego
DesiredLRPs to match the current set of CF apps, so if there are
synchronization errors they should be present in those logs.

Thanks,
Eric, CF Runtime Diego PM

On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:

It would be nice to figure out the root cause here.

Does having two crashed and two apps have some significance as to why
the delete failed, though appeared successful?

On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:

Thank you.

Veritas is quite informative. I found 2 apps running and 2 crashed.
I deleted them and all appears well.


On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Ok, I would use veritas to look at the Diego BBS, and confirm that it
still thinks the app is there. You can also go onto the router and query
its HTTP endpoint to confirm that the route you're seeing is also still
there: https://github.com/cloudfoundry/gorouter#instrumentation.
Lastly I would connect to the CCDB and confirm that the app and route are
*not* there. This will reduce the problem to figuring out why Diego isn't
being updated to know that the non-existing app is no longer desired.

On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

The route still exists. I was reluctant to delete it and have the
"app" still running. I wanted some way to track it down, not that it has
helped, other than let me know it is still running.

Pushed the app, with a different name/host, with no problems and it
runs as it should.

On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Tom,

So you're saying that none of the org/spaces shows the app or the
route, but the app continues to run and be routeable?

I could imagine this happen if some CC Bridge components are not
able to talk to either CC or Diego BBS, leaving the data in the Diego BBS
stale. In the case of stale info, Diego may not know that the LRP is no
longer desired, and it will do the safe thing of keeping it around, and
emitting its route to the gorouter, which just does what it's told (it
doesn't check whether CC knows about the route or not).

Are you able to push new apps or delete other apps with the Diego
backend?

Amit

On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

JT,

Thanks for responding.

This is a test runtime and small. I checked all orgs and spaces. No
routes matching the app.

Found the route information and the result:

{

"total_results": 0,

"total_pages": 1,

"prev_url": null,

"next_url": null,

"resources": []

}

To learn what the output may look like, I check existing routes
with apps and without. The output appears to be the same as if the app has
been deleted.

Even now, the app url still returns a page from the app, even
though it is deleted.

Thanks,

Tom

On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io>
wrote:

Tom,

Are you sure the route isn't bound to another application in
another org/space?

When you do `cf routes` it only show routes for the current space.
You can hit specific API endpoints though to get all the apps for a route.

For example, `cf
curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all
the apps for a particular route. Obviously replacing the route ID with the
correct ID. To find that, I recommend going through `CF_TRACE=true cf
routes` and grabbing the ID.

Let see if you can hunt it down that way.

Kind Regards,

JT

On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod <tom.sherrod(a)gmail.com
wrote:
cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0
Default to diego true.

Developer deployed a java application. Deleted the application:
cf delete <app> No errors.
The app still responds. The only thing left is the route.
I've not encountered this before. Delete has been delete and even
if route remains, 404 Not Found: Requested route ('<hostname.domain>') does
not exist. is returned.

Pointers on tracking this down appreciated.

Tom


Re: App running even after delete. Pointers on finding it and debugging?

Tom Sherrod <tom.sherrod@...>
 

Yes, the logs are there.
I grepped the logs for error. I see a lot of:

{"timestamp":"1460161846.254659176","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6713"}}

{"timestamp":"1460161876.286883593","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6714"}}

{"timestamp":"1460161906.315121412","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6715"}}

{"timestamp":"1460161936.352133274","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6716"}}

{"timestamp":"1460161966.383990765","source":"nsync-bulker","message":"nsync-bulker.sync.not-bumping-freshness-because-of","log_level":2,"data":{"
error":"invalid response code 401","session":"6717"}}


Let me know if there's something specific you wish to find.


Tom

On Fri, Apr 8, 2016 at 11:47 AM, Eric Malm <emalm(a)pivotal.io> wrote:

Thanks, Tom, glad you were able to use veritas to find and remove the
stray apps. I'd like to know how they remained present in the first place.
Do you have logs from the nsync-bulker jobs on the cc_bridge VMs in your
deployment? That BOSH job has the responsibility of updating the Diego
DesiredLRPs to match the current set of CF apps, so if there are
synchronization errors they should be present in those logs.

Thanks,
Eric, CF Runtime Diego PM

On Fri, Apr 8, 2016 at 8:09 AM, Kris Hicks <khicks(a)pivotal.io> wrote:

It would be nice to figure out the root cause here.

Does having two crashed and two apps have some significance as to why the
delete failed, though appeared successful?

On Friday, April 8, 2016, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:

Thank you.

Veritas is quite informative. I found 2 apps running and 2 crashed.
I deleted them and all appears well.


On Mon, Apr 4, 2016 at 7:03 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Ok, I would use veritas to look at the Diego BBS, and confirm that it
still thinks the app is there. You can also go onto the router and query
its HTTP endpoint to confirm that the route you're seeing is also still
there: https://github.com/cloudfoundry/gorouter#instrumentation.
Lastly I would connect to the CCDB and confirm that the app and route are
*not* there. This will reduce the problem to figuring out why Diego isn't
being updated to know that the non-existing app is no longer desired.

On Mon, Apr 4, 2016 at 3:47 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

The route still exists. I was reluctant to delete it and have the
"app" still running. I wanted some way to track it down, not that it has
helped, other than let me know it is still running.

Pushed the app, with a different name/host, with no problems and it
runs as it should.

On Mon, Apr 4, 2016 at 6:17 PM, Amit Gupta <agupta(a)pivotal.io> wrote:

Tom,

So you're saying that none of the org/spaces shows the app or the
route, but the app continues to run and be routeable?

I could imagine this happen if some CC Bridge components are not able
to talk to either CC or Diego BBS, leaving the data in the Diego BBS
stale. In the case of stale info, Diego may not know that the LRP is no
longer desired, and it will do the safe thing of keeping it around, and
emitting its route to the gorouter, which just does what it's told (it
doesn't check whether CC knows about the route or not).

Are you able to push new apps or delete other apps with the Diego
backend?

Amit

On Fri, Apr 1, 2016 at 1:00 PM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

JT,

Thanks for responding.

This is a test runtime and small. I checked all orgs and spaces. No
routes matching the app.

Found the route information and the result:

{

"total_results": 0,

"total_pages": 1,

"prev_url": null,

"next_url": null,

"resources": []

}

To learn what the output may look like, I check existing routes with
apps and without. The output appears to be the same as if the app has been
deleted.

Even now, the app url still returns a page from the app, even though
it is deleted.

Thanks,

Tom

On Fri, Apr 1, 2016 at 1:52 PM, JT Archie <jarchie(a)pivotal.io>
wrote:

Tom,

Are you sure the route isn't bound to another application in
another org/space?

When you do `cf routes` it only show routes for the current space.
You can hit specific API endpoints though to get all the apps for a route.

For example, `cf
curl /v2/routes/89fc2a5e-3a9b-4a88-a360-e405cdbd6f87/apps` will show all
the apps for a particular route. Obviously replacing the route ID with the
correct ID. To find that, I recommend going through `CF_TRACE=true cf
routes` and grabbing the ID.

Let see if you can hunt it down that way.

Kind Regards,

JT

On Fri, Apr 1, 2016 at 8:51 AM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

cf 230, diego 0.1450.0, etcd 27, garden-linux 0.330.0
Default to diego true.

Developer deployed a java application. Deleted the application: cf
delete <app> No errors.
The app still responds. The only thing left is the route.
I've not encountered this before. Delete has been delete and even
if route remains, 404 Not Found: Requested route ('<hostname.domain>') does
not exist. is returned.

Pointers on tracking this down appreciated.

Tom


Re: cf v233 api_z1/api_z2 failing

Ranga Rajagopalan
 

Hi Kara & Peter,

Thanks a lot for your help. That fixed the issue.

On Thu, Apr 7, 2016 at 9:04 PM, Ranga Rajagopalan <
ranga.rajagopalan(a)gmail.com> wrote:

HI Kara,

Thanks. Let me try with a valid app_domain.

On Thu, Apr 7, 2016 at 9:01 PM, Kara Alexandra <ardnaxelarak(a)gmail.com>
wrote:

Hi Ranga,

The only reason we were using bosh-lite.com for our app_domains was
because we were testing to reproduce on our local bosh-lite.
Using 'cfapp' I managed to reproduce the issue locally. My guess is that
this is because 'cfapp' is not a valid domain (it doesn't end with a valid
top-level domain), and I'm guessing that fixing that will at least fix part
of the problem if not all of it.

Thanks,

Kara

On Thu, Apr 7, 2016 at 8:27 PM, Ranga Rajagopalan <
ranga.rajagopalan(a)gmail.com> wrote:

Here's my deployment manifest. app_domains is set to cfapp. I can't find
bost-lite anywhere in the file at all.

On Thu, Apr 7, 2016 at 4:27 PM, Peter Goetz <peter.gtz(a)gmail.com> wrote:

Hi Ranga,

Looking at your logs we found an error that could possibly cause this
and it is related to the properties.apps_domain in the deployment manifest.
By setting it to 'b%()osh-lite.com' (using special characters), we
could reproduce the following error which we also found in your log file:

Encountered error: name
format\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/model/base.rb:1543:in
`save'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/shared_domain.rb:35:in
`block in
find_or_create'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:134:in
`_transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:108:in
`block in
transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/connecting.rb:249:in
`block in
synchronize'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/connection_pool/threaded.rb:103:in
`hold'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/connecting.rb:249:in
`synchronize'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/sequel-4.29.0/lib/sequel/database/transactions.rb:97:in
`transaction'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/app/models/runtime/shared_domain.rb:27:in
`find_or_create'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:57:in
`block in
create_seed_domains'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:56:in
`each'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:56:in
`create_seed_domains'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/seeds.rb:9:in
`write_seed_data'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/runner.rb:93:in
`block in
run!'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in
`call'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in
`run_machine'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.2.0/gems/eventmachine-1.0.9.1/lib/eventmachine.rb:193:in
`run'\n/var/vcap/data/packages/cloud_controller_ng/da452b34d79be56a0784c7e88d6b9c0e1811a9d8.1-f934c136e6019cf54ab7aa04a3c153657226e729/cloud_controller_ng/lib/cloud_controller/runner.rb:87:in
`run!'\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/bin/cloud_controller:8:in
`<main>'

Can you check the apps_domain property and see if there is anything
suspicious with it?

Thanks,
Peter & Kara

On Thu, Apr 7, 2016 at 2:32 PM Ranga Rajagopalan <
ranga.rajagopalan(a)gmail.com> wrote:

Hi Peter,

Attaching /var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log.gz
and /var/vcap/sys/log/cloud_controller_worker_ctl.log.gz. There isn't a
/var/vcap/sys/log/cloud_controller_worker/ directory on this node.

vcap(a)572afc33-0735-4727-8ff3-9dc6d7fa8af0:~$ ls /var/vcap/sys/log/
agent_ctl.err.log
cloud_controller_worker_ctl.log nginx_cc/
agent_ctl.log consul_agent/
nginx_ctl.err.log
cloud_controller_migration/ metron_agent/
nginx_ctl.log
cloud_controller_migration_ctl.err.log metron_agent_ctl.err.log
route_registrar/
cloud_controller_migration_ctl.log metron_agent_ctl.log
route_registrar_ctl.err.log
cloud_controller_ng/ monit/
route_registrar_ctl.log
cloud_controller_ng_ctl.err.log nfs_mounter/
statsd-injector/
cloud_controller_ng_ctl.log nfs_mounter_ctl.err.log
statsd-injector-ctl.err.log
cloud_controller_worker_ctl.err.log nfs_mounter_ctl.log
statsd-injector-ctl.log



On Thu, Apr 7, 2016 at 12:21 PM, Peter Goetz <peter.gtz(a)gmail.com>
wrote:

Hi Ranga,

To trouble-shoot this issue could you also provide the contents of
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.log
and /var/vcap/sys/log/cloud_controller_worker/cloud_controller_worker.log?
This should give us more details about what's going on. The ctl script logs
do not provide enough details.

Thanks,
Peter

On Wed, Apr 6, 2016 at 6:12 PM Ranga Rajagopalan <
ranga.rajagopalan(a)gmail.com> wrote:

I tried v231. Unfortunately, same issue.

--
Thanks,

Ranga

--
Thanks,

Ranga

--
Thanks,

Ranga


--
Thanks,

Ranga