Date   

new feature discuss: User can use CF to deploy APP in specific zone.

Liangbiao
 

Hi,
Currently, DEA can specified to a "zone", and Cloud Controller can schedule APP instance according to zone.(https://github.com/cloudfoundry/cloud_controller_ng/blob/965dbc4bdf65df89f382329aef39f86a916b3f05/lib/cloud_controller/dea/pool.rb#L47)
So, I think whether we can push it more further.
For example, APP developer can specify which zone to deploy the APP.

Regards,
Rexxar


Re: bosh-lite diego "found no compatible cell"

Ted Young
 

"found no compatible cell" is the error you will get when all diego Cells
have failed to deploy. Start by double checking that the cell is up and
running via `bosh vms`.

-Ted

On Tue, Nov 24, 2015 at 6:43 PM, Eric Malm <emalm(a)pivotal.io> wrote:

Hi, Christian,

Thanks for asking. From what you've described, it sounds like Cloud
Controller is able to tell Diego to run the app, but the app instance isn't
getting placed on a cell. It's probably worth looking at the logs of the
auctioneer, in /var/vcap/sys/log/auctioneer/auctioneer.stdout.log on the
brain_z1/0 Diego VM, and the cell rep, in
/var/vcap/sys/log/rep/rep.stdout.log on the cell_z1/0 Diego VM (assuming a
standard BOSH-Lite deployment). It might even be useful to follow those
logs in real-time with `tail -f` while you stop and start the CF app.

Also, what versions of CF, Diego, Garden-Linux, and the BOSH-Lite stemcell
do you have deployed?

Thanks,
Eric, CF Runtime Diego PM

On Mon, Nov 23, 2015 at 2:19 PM, Christian Stocker <chregu(a)liip.ch> wrote:

Hi

I installed bosh-lite on my mac according to the docs on
https://github.com/cloudfoundry-incubator/diego-release

That worked all fine, I can deploy apps without diego enabled and they
run as expected. But when I enable diego for an app and then restart it,
I get

0 of 1 instances running, 1 starting (found no compatible cell)

Any idea where to look at?

Greetings

christian


Re: bosh-lite diego "found no compatible cell"

Eric Malm <emalm@...>
 

Hi, Christian,

Thanks for asking. From what you've described, it sounds like Cloud
Controller is able to tell Diego to run the app, but the app instance isn't
getting placed on a cell. It's probably worth looking at the logs of the
auctioneer, in /var/vcap/sys/log/auctioneer/auctioneer.stdout.log on the
brain_z1/0 Diego VM, and the cell rep, in
/var/vcap/sys/log/rep/rep.stdout.log on the cell_z1/0 Diego VM (assuming a
standard BOSH-Lite deployment). It might even be useful to follow those
logs in real-time with `tail -f` while you stop and start the CF app.

Also, what versions of CF, Diego, Garden-Linux, and the BOSH-Lite stemcell
do you have deployed?

Thanks,
Eric, CF Runtime Diego PM

On Mon, Nov 23, 2015 at 2:19 PM, Christian Stocker <chregu(a)liip.ch> wrote:

Hi

I installed bosh-lite on my mac according to the docs on
https://github.com/cloudfoundry-incubator/diego-release

That worked all fine, I can deploy apps without diego enabled and they
run as expected. But when I enable diego for an app and then restart it,
I get

0 of 1 instances running, 1 starting (found no compatible cell)

Any idea where to look at?

Greetings

christian


Warden stemcell 3126 not usable for CF+Diego deployments to BOSH-Lite

Eric Malm <emalm@...>
 

Hi, all,

If you've tried using the 3126 Warden stemcell for your CF and Diego
deployments to BOSH-Lite, you will likely have discovered that Diego
doesn't deploy correctly. As it turns out, there is a change to the
resolvconf configuration in that version of that stemcell that prevents the
consul agent from providing DNS to CF and Diego components. Consequently,
some Diego components are unable even to start correctly, and Cloud
Controller will be unable to communicate with the Diego deployment.

The BOSH team is working on fixing the configuration issue in
https://www.pivotaltracker.com/n/projects/956238/stories/107958688, and
there are more details about the problem available at
https://github.com/cloudfoundry/bosh-lite/issues/315. For the meantime, we
recommend using the previous BOSH-Lite Warden stemcell version, 2776,
instead. I've already amended the BOSH-Lite instructions in the
diego-release README to obtain this specific version of the stemcell from
bosh.io, rather than the latest version.

Also, please note that this issue affects only the BOSH-Lite Warden
stemcell, and stemcells for non-BOSH-Lite IaaSes are correctly compatible
with consul agent-based DNS.

Thanks,
Eric Malm, CF Runtime Diego PM


connection draining for TCP Router in cf-routing-release

Shannon Coen
 

On the CAB call Dr. Nic asked about support in the routing tier for
connection draining. I asked him out-of-band to elaborate, then realized
this was a topic the community might be interested in. Nic explained that
he's looking for a TCP router to route requests from apps on CF to a
clustered service, and wants to allow graceful draining of requests before
a backend was moved.

When a backend for a route is removed from the routing table, the TCP
Router will prevent new requests for the route from being routed to that
backend, and will reject requests for the route when all associated
backends are removed. The routing table is updated via the Routing API; the
TCP router fetches its configuration by subscribing to the API via SSE, as
well as a periodic bulk fetch. When backends are removed for a route,
existing connections remain up until closed by either the client or
backend. We don't currently sever open connections after a timeout.

In CF, when Diego removes an app instance it sends a TERM to the process in
the container which has 10s to drain active connections before the
container is torn down and all the processes killed. In parallel the
backend will be removed from the route, preventing new connections.

Nic:

Does the existing behavior described above meet your needs, or would you
require a timeout and proactive connection severing by the router? I recall
we found this difficult using HAProxy last year, leading us to build the
Switchboard proxy for cf-mysql-release. Have you considered Switchboard?

In your use case could the IPs of your cluster nodes change at any time, or
only on a deploy? In either case, you could use the Routing API to
configure the router with the node addresses (similar to the way clients
must currently register routes via NATS).

Would you expect other clients to register routes with the same deployment
of the API, or would you isolated it to the deployment of your service? The
Routing API, like NATS, doesn't support multi-tenant isolation yet, so
multiple clients could potentially add unrelated backends for the same
route.

Finally, are you only interested in TCP routing; if so, I imagine you would
deploy the routing-release with only the API and TCP router jobs?

Shannon Coen
Product Manager, Cloud Foundry
Pivotal, Inc.


Re: [abacus] Refactor Aggregated Usage and Aggregated Rated Usage data model

Jean-Sebastien Delfino
 

Hi all,

Here's an update on this topic and the design discussions Assk, Ben and I
had in the last few days:

I'll start with a description of the problem we're trying to solve here:

Abacus currently computes and stores the aggregated usage at various levels
within an org in real time. Each time new usage for resource instances gets
submitted we compute your latest aggregated usage at the org, space, app,
resource and plan level, and store that in a new document keyed by the org
id and the current time.

We effectively write a history of your org's aggregated usage in the Abacus
database, and that design allows us to efficiently report your latest
usage, your usage history, or trigger usage limit alerts in real time for
example, simply because we always have your latest usage for a given time
in hand in a single doc, as opposed to having to run complex database
queries pulling all your usage data into an aggregation when it's needed.

So, that design is all good until somebody creates a thousand (or even a
hundred) apps in the org. With many apps, our aggregated usage (JSON) docs
get pretty big as we're keeping track of the aggregated usage for each app,
JSON is not very space-efficient at representing all that data (that's a
euphemism), and since we're writing a new doc for each new submitted usage,
we eventually overload our Couch database with these big JSON docs.

Long story short... this discussion is about trying to optimize our data
model for aggregated usage to fix that problem. It's also an example of the
typical tension in systems that need to stream a lot of data, compute some
aggregates, and make quick decisions based on them: (a) do you pro-actively
compute and store the aggregated values in real time as you're consuming
your stream of input data? or (b) do you just write the input data and then
run a mix of pseudo-real time and batch queries over and over on that data
to compute the aggregates later? Our current design is along the lines of
(a), but we're starting to also poke at ideas from the (b) camp to mitigate
some of the issues of the (a) camp.

The initial proposal described by Assk earlier in this thread was to split
the single org level doc containing all the usage aggregations within the
org into smaller docs: one doc per app for example (aka consumer in Abacus
as we support usage from other things than pure apps). That's what he was
calling 'normalized' usage, since the exercise of coming up with that new
structure would be similar to a 'normalization' of the data in the
relational database sense, as opposed to the 'denormalization' we went
through to design the structure of our current aggregated usage doc (a JSON
hierarchical structure including some data duplication).

Now, while that data 'normalization' would help reduce the size of the docs
and the amount of data written to record the history of your org's
aggregated usage, in the last few days we've also started to realize that
it would on the other hand increase the amount of data we'd have to read,
to retrieve all the little docs representing the current aggregated usage
and 'join' them into a complete view of the org's aggregated usage before
adding new usage to it...

Like I said before, a tension between two approaches, (a) writes a lot of
data, is cheap on reads, (b) writes the minimum, requires a lot of reads...
nothing's easy or perfect :) So the next step here is going to be an
evaluation of some of the trade-offs between:

a) write all the aggregated usage data for an org in one doc like we do now
but simplify and refactor a bit the JSON format we use to represent it, in
an attempt to make that JSON representation much smaller;

b) split the aggregated usage in separate docs, one per app, linked
together by a parent doc per org containing their ids, and optimize (with
caching for example) the reads and 'joins' of all the docs forming the
aggregated usage for the org;

c) a middle-ground approach where we'll store the aggregated usage per app
in separate docs, but maintain the aggregated usage at the upper levels
(org, space, resource, plan) in the parent doc linking the app usage docs
together, and explore what constrains or limitations that would impose on
our ability to trigger real time usage limit alerts at any org, space,
resource, plan, app etc level.

This is a rather complex subject, so please feel free to ask questions or
send any thoughts here, or in the tracker and Github issues referenced by
Assk earlier if that's easier. Thanks!

- Jean-Sebastien

On Fri, Nov 20, 2015 at 11:09 AM, Saravanakumar A Srinivasan <
sasrin(a)us.ibm.com> wrote:

Started to look into two user stories([1] and [2]) titled "Organize the
usage report data model for better querying and DB utilization"

Current state of Abacus processing pipeline starting from Usage
Accumulator:

a) Usage Accumulator processes metered usage for a resource instance,
accumulates the usage at resource instance scope and then forwards
accumulated usage for a resource instance to Usage Aggregator.
b) Usage Aggregator processes accumulated usage for a resource
instance, aggregates the usage at following scopes:
organization.resources,
organization.resources.plans,
organization.spaces.resources,
organization.spaces.resources.plans,
organization.spaces.consumers.resources and
organization.spaces.consumers.resources.plans, and then forwards
aggregated usage for an organization to Usage Rating Service.
c) Usage Rating Service processes aggregated usage for an
organization, rates the aggregated usage at following scopes:
organization.resources.plans,
organization.spaces.resources.plans, and
organization.spaces.consumers.resources.plans.
d) Usage Reporting Service processes rated usage for an organization
and summarizes usage and charge at all aggregation scopes. See [3] for a
sample Abacus usage report.


Initial thought on changes needed to optimize steps b, c, and d are

b) Usage Aggregator processes accumulated usage for a resource
instance and aggregates and rates the usage at a consumer scope -
equivalent to the scopes of organization.spaces.consumers.resources and
organization.spaces.consumers.resources.plans and then maintains a
normalized aggregated usage for an organization that contains references to
all consumer scoped documents that belong to the organization.
c) Eliminate Usage Rating Service and split the current rating step
across Usage Aggregator and Usage Reporting Service.
d) Usage Reporting Service processes a normalized aggregated usage for
an organization, uses references to get all consumer scoped documents that
belong to the organization, aggregates and rates consumer scoped usage at
all other scopes, and then summarizes usage and charge at all aggregation
scopes.

Any comments?


[1] https://www.pivotaltracker.com/story/show/107598654
[2] https://www.pivotaltracker.com/story/show/107598652
[3] https://gist.github.com/sasrin/697437b33d38bdddf825#file-report-json

Thanks,
Saravanakumar Srinivasan (Assk),

Bay Area Lab, 1001, E Hillsdale Blvd, Ste 400, Foster City, CA - 94404.
E-mail: sasrin(a)us.ibm.com
Phone: 650 645 8251 (T/L 367-8251)


Re: Unable to deploy application

Deepak Arn <arn.deepak1@...>
 

Hi,

I tried with the lower as well as higher limit, but the instruction is keep hanging and no packets recieved.

ubuntu(a)test:~$ ping github.com -M dont -s 1400
PING github.com (192.30.252.131) 1400(1428) bytes of data.
^C
--- github.com ping statistics ---
282 packets transmitted, 0 received, 100% packet loss, time 281921ms

ubuntu(a)test:~$ ping github.com -M dont -s 1420
PING github.com (192.30.252.128) 1420(1448) bytes of data.
^C
--- github.com ping statistics ---
136 packets transmitted, 0 received, 100% packet loss, time 135493ms

ubuntu(a)test:~$ ping github.com -M dont -s 1430
PING github.com (192.30.252.129) 1430(1458) bytes of data.
^C
--- github.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4024ms

ubuntu(a)test:~$ ping github.com -M dont -s 1434
PING github.com (192.30.252.129) 1434(1462) bytes of data.

^C
--- github.com ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9047ms

ubuntu(a)test:~$ ping github.com -M dont -s 1450
PING github.com (192.30.252.128) 1450(1478) bytes of data.
^C
--- github.com ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10027ms

ubuntu(a)test:~$ ping github.com -M dont -s 1500
PING github.com (192.30.252.128) 1500(1528) bytes of data.
^C
--- github.com ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5031ms

ubuntu(a)test:~$ ping github.com -M dont -s 1462
PING github.com (192.30.252.128) 1462(1490) bytes of data.
^C
--- github.com ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5027ms


Re: Unable to deploy application

CF Runtime
 

Hey Deepak,

I found that you provided some more information about your problem on
Github: https://github.com/cloudfoundry/cf-release/issues/823

Was there any message from the ping about why packets weren't being
received? Have you tried a smaller limit than 1426?

Natalie & Mikhail
OSS Release & Integration

On Fri, Nov 20, 2015 at 9:30 AM, Deepak Arn <arn.deepak1(a)gmail.com> wrote:

Hi,

I'm using Nova for compute resource


Re: REGARDING_api_z1/0_CANARY_UPDATE

CF Runtime
 

Have you checked the control script logs in the `/var/vcap/sys/log/`
folder? If the jobs are failing to start that's a good place to start. If
you send them to us we can tell you more.

Also, what infrastructure are you deploying cloud foundry to, and can you
send us the manifest you're using to deploy it?

Natalie & Mikhail
OSS Integration & Runtime

On Thu, Nov 19, 2015 at 1:19 AM, Parthiban A <senjiparthi(a)gmail.com> wrote:

Hello All,
Since, I was facing the following issue for very long time,
I have opened it as a separate thread. The problem am currently facing is

Error 400007: `api_z1/0' is not running after update

I have SSHed into the api_z1/0 VM and did a monit summary. It shows that

root(a)5c446a3d-3070-4d24-9f2e-1cff18218c07:/var/vcap/sys/log# monit summary
The Monit daemon 5.2.4 uptime: 20m

Process 'cloud_controller_ng' initializing
Process 'cloud_controller_worker_local_1' not monitored
Process 'cloud_controller_worker_local_2' not monitored
Process 'nginx_cc' initializing
Process 'metron_agent' running
File 'nfs_mounter' Does not exist
System 'system_5c446a3d-3070-4d24-9f2e-1cff18218c07' running

Could anyone help on this issue? Thanks.


Re: Staging Error while deploying application on OpenStack

D vidzz
 

Hi Daniel,

I tried curl -vv https://github.com/cloudfoundry/java-buildpack/ from the instance(on openstack) where CF is installed and that works.

Regarding offline buildpacks, CF already has offline buildpacks and I also added a new buildpack (java-custom), below is the output of cf buildpacks command:

buildpack position enabled locked filename
java-custom 1 true false java-buildpack-master.zip
java_buildpack 2 true false java-buildpack-v3.3.zip
ruby_buildpack 3 true false ruby_buildpack-cached-v1.6.7.zip
nodejs_buildpack 4 true false nodejs_buildpack-cached-v1.5.0.zip

I pushed my app as cf push Web2291 -b java-custom and also using the existing buildpack as cf push Web2291 -b java_buildpack

Both the time it gets stuck, see below log:

Updating app Web2291 in org DevBox / space Applications as admin...
OK

Uploading Web2291...
Uploading app files from: C:\Users\umroot\workspaceKeplerJee\Web2291
Uploading 15.3K, 29 files
Done uploading
OK

Stopping app Web2291 in org DevBox / space Applications as admin...
OK

Starting app Web2291 in org DevBox / space Applications as admin...


-----> Downloaded app package (1.1M)

and in the logs its same as before:

2015-11-24T16:39:25.25-0500 [DEA/0] OUT Got staging request for app with id c4b8522c-5157-4fa4-bb73-814f63603b23
2015-11-24T16:39:25.29-0500 [STG/0] OUT
2015-11-24T16:39:25.29-0500 [STG/0] ERR
2015-11-24T16:39:27.06-0500 [STG/0] OUT -----> Downloaded app package (1.1M)
2015-11-24T16:46:23.28-0500 [DEA/0] OUT Got staging request for app with id c4b8522c-5157-4fa4-bb73-814f63603b23
2015-11-24T16:46:23.32-0500 [STG/0] OUT
2015-11-24T16:46:23.32-0500 [STG/0] ERR
2015-11-24T16:46:25.66-0500 [STG/0] OUT -----> Downloaded app package (1.1M)
2015-11-24T16:46:47.06-0500 [DEA/0] OUT Got staging request for app with id c4b8522c-5157-4fa4-bb73-814f63603b23
2015-11-24T16:46:47.09-0500 [STG/0] OUT
2015-11-24T16:46:47.09-0500 [STG/0] ERR
2015-11-24T16:46:48.80-0500 [STG/0] OUT -----> Downloaded app package (1.1M)


Thanks,


Re: diego: disk filling up over time

Tom Sherrod <tom.sherrod@...>
 

Hi Eric,

Thank you.

I am responding below with what I have available. Unfortunately, when the
problem presents, developers are down so the current resolution is recreate
cells. Looking at one below 98% full, opportunity for additional details
may arise soon.
Answers below inline

- What are the exact errors you're seeing when CF users are trying to make
containers? The errors from CF CLI logs or rep/garden logs would be great
to see.
Did not capture detailed logs. FAILED StagingError was all that was
captured. I've asked to get more information on the next failure which may
be coming up soon, I'm looking at a cell with 98% filled. No issue reported
as of yet, of course, there are 8 cells to choose from.


- What's the total amount of disk space available on the volume attached
to /var/vcap/data? You should be able to see this from `df` command output.
/dev/vda3 22025756 20278880 604964 98% /var/vcap/data

tmpfs 1024 16 1008 2% /var/vcap/data/sys/run

/dev/loop0 122835 1552 117352 2% /tmp

/dev/loop1 20480000 17923904 1914816 91%
/var/vcap/data/garden-linux/btrfs_graph

cgroup 8216468 0 8216468 0% /tmp/garden-/cgroup
- How much space is the rep configured to allocate for its executor cache?
Is it the default 10GB provided by the rep's job spec in
https://github.com/cloudfoundry-incubator/diego-release/blob/v0.1398.0/jobs/rep/spec#L70-L72?
How much disk is actually used in /var/vcap/data/executor_cache (based on
reporting from `du`, say)?

Default (not listed in the manifest)

root(a)a0acd863-07e5-4964-8758-fcdf295d119d:/var/vcap/data/executor_cache# du

42876 .

- How much space have you directed garden-linux to allocate for its btrfs
store? This is provided via the diego.garden-linux.btrfs_store_size_mb BOSH
property, and with Diego 0.1398.0 I believe it has to be specified
explicitly. Also, how much space is actually used in the btrfs filesystem?
You should be able to inspect this with the btrfs tools available on the
cell VM in '/var/vcap/packages/btrfs-tools/bin'. I think running
`/var/vcap/packages/btrfs-tools/bin/btrfs filesystem usage
/var/vcap/data/garden-linux/btrfs_graph` should be a good starting point.
btrfs_store_size_mb: 20000

root(a)a0acd863-07e5-4964-8758-fcdf295d119d:/var/vcap/packages/btrfs-progs/bin#
./btrfs filesystem usage /var/vcap/data/garden-linux/btrfs_graph

Overall:

Device size: 19.53GiB

Device allocated: 17.79GiB

Device unallocated: 1.75GiB

Device missing: 0.00B

Used: 16.78GiB

Free (estimated): 1.83GiB (min: 976.89MiB)

Data ratio: 1.00

Metadata ratio: 2.00

Global reserve: 320.00MiB (used: 0.00B)

Data,single: Size:12.01GiB, Used:11.93GiB

/dev/loop1 12.01GiB

Metadata,single: Size:8.00MiB, Used:0.00B

/dev/loop1 8.00MiB

Metadata,DUP: Size:2.88GiB, Used:2.43GiB

/dev/loop1 5.75GiB

System,single: Size:4.00MiB, Used:0.00B

/dev/loop1 4.00MiB

System,DUP: Size:8.00MiB, Used:16.00KiB

/dev/loop1 16.00MiB

Unallocated:

/dev/loop1 1.75GiB




You may also find some useful information in the cf-dev thread from August
about overcommitting disk on Diego cells:
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/VBDM2TMHQSOFILSHRCV4G2CCPRBP5WKA/#VBDM2TMHQSOFILSHRCV4G2CCPRBP5WKA

Thanks,
Eric



On Wed, Nov 18, 2015 at 6:52 AM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

diego release 0.1398.0

After a couple of weeks of dev, the cells end up filling their disks. Did
I miss a clean up job somewhere?
Currently, once pushes start failing, I get bosh to recreate the machine.

Other options?

Thanks,
Tom


Re: Garden Port Assignment Story

Mike Youngstrom
 

Yes Will, that summary is essentially correct. But, for even more clarity
let me restate the complete story again and reason I want 92085170 to work
across stemcell upgrades. :)

Today if NATS goes down after 2 minutes the routers will drop their routing
tables and my entire CF deployment goes down. The routers behave this way
because of an experience Dieu had [0]. I don't like this I would prefer
for routers to not drop routing tables if it cannot connect to Nats.
Therefore, the routing team is adding 'prune_on_config_unavailable'. I
plan to set this to false to make my deployment less sensitive to NATS
failure. In doing so I am incurring more risk of mis routed stale routes.
I am hoping that 92085170 will help reduce some of that risk. Since one of
the times I personally have experienced stale route routing was during a
deploy I hope that Garden will consider a port selection technique that
will help ensure uniqueness across stemcell upgrades, something we
frequently do as part of a deploy.

Consequently a stateless solution like random assignment or a consistent
hash will work across stemcell upgrades.

Thanks,
Mike

[0]
https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/yuVYCZkMLG8/7t8FHnFzWEsJ

On Tue, Nov 24, 2015 at 3:44 AM, Will Pragnell <wpragnell(a)pivotal.io> wrote:

Hi Mike,

What I think you're saying is that once the new
`prune_on_config_unavailable` property is available in the router, and if
it's set to `false`, there's a case when NATs is not reachable from the
router in which potentially stale routes will continue to exist until the
router can reach NATs again. Is that correct?

(Sorry to repeat you back at yourself, just want to make sure I've
understood you correctly.)

Will

On 23 November 2015 at 19:02, Mike Youngstrom <youngm(a)gmail.com> wrote:

Hi Will,

Though I see the main reason for the issue assuming a healthy running
environment I've also experienced a deploy related issue that more unique
port assignment could help defend against. During one of our deploys the
routers finished deployed before the DEAs. When the DEAs started rolling,
for some reason some of our routers stopped getting route updates from
NATs. This caused their route tables to go stale and as apps started
rolling new apps started getting assigned ports previously held by other
apps. Which caused a number of our hosts to be mis-routed.

Though the root cause was probably some bug in the Nats client in
GoRouter the runtime team had apparently experienced a similar issue in the
past [0] which caused them to implement code that would delete stale routes
even then a router couldn't connect to NATs. The Router team is now
planning to optionally remove this failsafe [1]. I'm hoping that with the
removal of this failsafe (which I'm planning to take advantage of) this
tracker story will help protect us from the problem we experienced before
from happening again.

If the ports simply reset on a stemcell upgrade this issue provides no
defense for the problem we had before.

Does that make sense Will?

Mike

[0]
https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/yuVYCZkMLG8/7t8FHnFzWEsJ
[1] https://www.pivotaltracker.com/story/show/108659764

On Mon, Nov 23, 2015 at 11:11 AM, Will Pragnell <wpragnell(a)pivotal.io>
wrote:

Hi Mike,

What's the motivation for wanting rolling port assignment to persist
across e.g. stemcell upgrade? The motivation for this story is to prevent
stale routes from sending traffic to the wrong containers. Our assumption
is that stale routes won't ever exist for anything close to the amount of
time it takes BOSH to destroy and recreate a VM. Have we missed something
in making that assumption?

On your second point, I see your concern. We've talked about the
possibility of implementing FIFO semantics on free ports (when a port that
was in use becomes free, it goes to the end of the queue of available
ports) to decrease the chances of traffic reaching the wrong container as
far as possible. It's possible that the rolling ports approach is "good
enough" though. We're still trying to understand whether that's actually
the case.

The consistent hashing idea is interesting, but a few folks have
suggested that with a relatively small range of available ports (5000 by
default) that the chances of collision are actually higher than we'd want.
I'll see if someone wants to lay down some maths to give that idea some
credence.

Cheers,
Will

On 23 November 2015 at 08:47, Mike Youngstrom <youngm(a)gmail.com> wrote:

Since I cannot comment in tracker I'm starting this thread to discuss
story:
https://www.pivotaltracker.com/n/projects/1158420/stories/92085170

Some comments I have:

* Although I can see how a rolling port assignment could be maintained
across garden/diego restarts I'd also like the story to ensure that the
rolling port assignments get maintained across a Stemcell upgrade without
the need for persistent disks on each cell. Perhaps etcd?

* Another thing to keep in mind. Although a rolling port value may not
duplicate ports 100% of the time in a short lived container in a long lived
container it seems to me that a rolling port assignment becomes no more
successful than a random port assignment if the container lives long enough
for the port assignment loop to loop a few times.

* Has there been any consideration to using an incremental consistent
hash of the app_guid to assign ports? A consistent hash would have the
benefit of being stateless. It also would have the benefit of increasing
the likely hood that if a request is sent to a stale route it may be to the
correct app anyway.

Thoughts?

Mike


SSO Kerberos with Spring

Leumas Yajiv
 

Hi.

I am trying to integrate with application(Java/Spring) authentication with an SSO through Kerberos. Has anyone done this before, I am using tomcat 8, openjdk 7 with java-buildpack version 3.0. And I am using r170 release of CF.

I have gone through this documentation, https://spring.io/blog/2009/09/28/spring-security-kerberos-spnego-extension, but I fail to understand how that will work on CF.

ooo Leuma


Re: Garden Port Assignment Story

Will Pragnell <wpragnell@...>
 

Hi Mike,

What I think you're saying is that once the new
`prune_on_config_unavailable` property is available in the router, and if
it's set to `false`, there's a case when NATs is not reachable from the
router in which potentially stale routes will continue to exist until the
router can reach NATs again. Is that correct?

(Sorry to repeat you back at yourself, just want to make sure I've
understood you correctly.)

Will

On 23 November 2015 at 19:02, Mike Youngstrom <youngm(a)gmail.com> wrote:

Hi Will,

Though I see the main reason for the issue assuming a healthy running
environment I've also experienced a deploy related issue that more unique
port assignment could help defend against. During one of our deploys the
routers finished deployed before the DEAs. When the DEAs started rolling,
for some reason some of our routers stopped getting route updates from
NATs. This caused their route tables to go stale and as apps started
rolling new apps started getting assigned ports previously held by other
apps. Which caused a number of our hosts to be mis-routed.

Though the root cause was probably some bug in the Nats client in GoRouter
the runtime team had apparently experienced a similar issue in the past [0]
which caused them to implement code that would delete stale routes even
then a router couldn't connect to NATs. The Router team is now planning to
optionally remove this failsafe [1]. I'm hoping that with the removal of
this failsafe (which I'm planning to take advantage of) this tracker story
will help protect us from the problem we experienced before from happening
again.

If the ports simply reset on a stemcell upgrade this issue provides no
defense for the problem we had before.

Does that make sense Will?

Mike

[0]
https://groups.google.com/a/cloudfoundry.org/d/msg/vcap-dev/yuVYCZkMLG8/7t8FHnFzWEsJ
[1] https://www.pivotaltracker.com/story/show/108659764

On Mon, Nov 23, 2015 at 11:11 AM, Will Pragnell <wpragnell(a)pivotal.io>
wrote:

Hi Mike,

What's the motivation for wanting rolling port assignment to persist
across e.g. stemcell upgrade? The motivation for this story is to prevent
stale routes from sending traffic to the wrong containers. Our assumption
is that stale routes won't ever exist for anything close to the amount of
time it takes BOSH to destroy and recreate a VM. Have we missed something
in making that assumption?

On your second point, I see your concern. We've talked about the
possibility of implementing FIFO semantics on free ports (when a port that
was in use becomes free, it goes to the end of the queue of available
ports) to decrease the chances of traffic reaching the wrong container as
far as possible. It's possible that the rolling ports approach is "good
enough" though. We're still trying to understand whether that's actually
the case.

The consistent hashing idea is interesting, but a few folks have
suggested that with a relatively small range of available ports (5000 by
default) that the chances of collision are actually higher than we'd want.
I'll see if someone wants to lay down some maths to give that idea some
credence.

Cheers,
Will

On 23 November 2015 at 08:47, Mike Youngstrom <youngm(a)gmail.com> wrote:

Since I cannot comment in tracker I'm starting this thread to discuss
story:
https://www.pivotaltracker.com/n/projects/1158420/stories/92085170

Some comments I have:

* Although I can see how a rolling port assignment could be maintained
across garden/diego restarts I'd also like the story to ensure that the
rolling port assignments get maintained across a Stemcell upgrade without
the need for persistent disks on each cell. Perhaps etcd?

* Another thing to keep in mind. Although a rolling port value may not
duplicate ports 100% of the time in a short lived container in a long lived
container it seems to me that a rolling port assignment becomes no more
successful than a random port assignment if the container lives long enough
for the port assignment loop to loop a few times.

* Has there been any consideration to using an incremental consistent
hash of the app_guid to assign ports? A consistent hash would have the
benefit of being stateless. It also would have the benefit of increasing
the likely hood that if a request is sent to a stale route it may be to the
correct app anyway.

Thoughts?

Mike


Re: CF-RELEASE v202 UPLOAD ERROR

Parthiban Annadurai <senjiparthi@...>
 

Okay.. Let me try with it.. Thanks..

On 24 November 2015 at 14:02, ronak banka <ronakbanka.cse(a)gmail.com> wrote:

Subnet ranges on which your other components are provisioned.

allow_from_entries:
- 192.168.33.0/24




On Tue, Nov 24, 2015 at 5:16 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Hello Ronak,
Actually, previously i have given the values for
ALLOW_FROM_ENTRIES, after seeing some mail groups only i changed it to
NULL. Could you tell me which IP i need to give their or something else??
Thanks..

On 24 November 2015 at 13:23, ronak banka <ronakbanka.cse(a)gmail.com>
wrote:

Hi Parthiban,

In your manifest , there is a global property block

nfs_server:
address: 192.168.33.53
allow_from_entries:
- null
- null
share: null

allow from entries are provided for cc individual property and not for actual debian nfs server, that is possible reason cc is not able to write to nfs


https://github.com/cloudfoundry/cf-release/blob/master/jobs/debian_nfs_server/spec#L20

Thanks
Ronak



On Tue, Nov 24, 2015 at 3:42 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks Amit for your faster reply. FYI, I have shared my deployment
manifest too. I got struck in this issue for past couple of weeks. Thanks..

On 24 November 2015 at 12:00, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren
points out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file
which used for the deployment. Am also suspecting that the problem with the
NFS Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io
wrote:
Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or
attachment? Please remove any sensitive information like passwords, certs
and keys.

Thanks.


Re: CF-RELEASE v202 UPLOAD ERROR

Ronak Banka
 

Subnet ranges on which your other components are provisioned.

allow_from_entries:
- 192.168.33.0/24




On Tue, Nov 24, 2015 at 5:16 PM, Parthiban Annadurai <senjiparthi(a)gmail.com>
wrote:

Hello Ronak,
Actually, previously i have given the values for
ALLOW_FROM_ENTRIES, after seeing some mail groups only i changed it to
NULL. Could you tell me which IP i need to give their or something else??
Thanks..

On 24 November 2015 at 13:23, ronak banka <ronakbanka.cse(a)gmail.com>
wrote:

Hi Parthiban,

In your manifest , there is a global property block

nfs_server:
address: 192.168.33.53
allow_from_entries:
- null
- null
share: null

allow from entries are provided for cc individual property and not for actual debian nfs server, that is possible reason cc is not able to write to nfs


https://github.com/cloudfoundry/cf-release/blob/master/jobs/debian_nfs_server/spec#L20

Thanks
Ronak



On Tue, Nov 24, 2015 at 3:42 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks Amit for your faster reply. FYI, I have shared my deployment
manifest too. I got struck in this issue for past couple of weeks. Thanks..

On 24 November 2015 at 12:00, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren points
out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file
which used for the deployment. Am also suspecting that the problem with the
NFS Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io>
wrote:

Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or
attachment? Please remove any sensitive information like passwords, certs
and keys.

Thanks.


Re: CF-RELEASE v202 UPLOAD ERROR

Parthiban Annadurai <senjiparthi@...>
 

Hello Ronak,
Actually, previously i have given the values for
ALLOW_FROM_ENTRIES, after seeing some mail groups only i changed it to
NULL. Could you tell me which IP i need to give their or something else??
Thanks..

On 24 November 2015 at 13:23, ronak banka <ronakbanka.cse(a)gmail.com> wrote:

Hi Parthiban,

In your manifest , there is a global property block

nfs_server:
address: 192.168.33.53
allow_from_entries:
- null
- null
share: null

allow from entries are provided for cc individual property and not for actual debian nfs server, that is possible reason cc is not able to write to nfs


https://github.com/cloudfoundry/cf-release/blob/master/jobs/debian_nfs_server/spec#L20

Thanks
Ronak



On Tue, Nov 24, 2015 at 3:42 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks Amit for your faster reply. FYI, I have shared my deployment
manifest too. I got struck in this issue for past couple of weeks. Thanks..

On 24 November 2015 at 12:00, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren points
out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file
which used for the deployment. Am also suspecting that the problem with the
NFS Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io>
wrote:

Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or
attachment? Please remove any sensitive information like passwords, certs
and keys.

Thanks.


Re: CF-RELEASE v202 UPLOAD ERROR

Ronak Banka
 

Hi Parthiban,

In your manifest , there is a global property block

nfs_server:
address: 192.168.33.53
allow_from_entries:
- null
- null
share: null

allow from entries are provided for cc individual property and not for
actual debian nfs server, that is possible reason cc is not able to
write to nfs

https://github.com/cloudfoundry/cf-release/blob/master/jobs/debian_nfs_server/spec#L20

Thanks
Ronak



On Tue, Nov 24, 2015 at 3:42 PM, Parthiban Annadurai <senjiparthi(a)gmail.com>
wrote:

Thanks Amit for your faster reply. FYI, I have shared my deployment
manifest too. I got struck in this issue for past couple of weeks. Thanks..

On 24 November 2015 at 12:00, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren points
out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file
which used for the deployment. Am also suspecting that the problem with the
NFS Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io>
wrote:

Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or attachment?
Please remove any sensitive information like passwords, certs and keys.

Thanks.


Re: CF-RELEASE v202 UPLOAD ERROR

Parthiban Annadurai <senjiparthi@...>
 

Thanks Amit for your faster reply. FYI, I have shared my deployment
manifest too. I got struck in this issue for past couple of weeks. Thanks..

On 24 November 2015 at 12:00, Amit Gupta <agupta(a)pivotal.io> wrote:

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren points
out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <
senjiparthi(a)gmail.com> wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file
which used for the deployment. Am also suspecting that the problem with the
NFS Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io>
wrote:

Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or attachment?
Please remove any sensitive information like passwords, certs and keys.

Thanks.


Re: CF-RELEASE v202 UPLOAD ERROR

Amit Kumar Gupta
 

Hi Parthiban,

Sorry to hear your deployment is still getting stuck. As Warren points
out, based on your log output, it looks like an issue with NFS
configuration. I will ask the CAPI team, who are experts on cloud
controller and NFS server, to take a look at your question.

Best,
Amit

On Thu, Nov 19, 2015 at 8:11 PM, Parthiban Annadurai <senjiparthi(a)gmail.com>
wrote:

Thanks for your suggestions Warren. I am attaching the Manifest file which
used for the deployment. Am also suspecting that the problem with the NFS
Server Configuration only.

On 19 November 2015 at 22:32, Warren Fernandes <wfernandes(a)pivotal.io>
wrote:

Hey Parthiban,

It seems that there may be a misconfiguration in your manifest.
Did you configure the nfs_server properties?


https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L19-L22

The api_z1 pulls the above properties in here.
https://github.com/cloudfoundry/cf-release/blob/master/templates/cf-jobs.yml#L368
.

Is it possible to share your manifest with us via a gist or attachment?
Please remove any sensitive information like passwords, certs and keys.

Thanks.

6541 - 6560 of 9414