Date   

Re: A hanged etcd used by hm9000 makes an impact on the delayed detection time of crashed application instances

Gwenn Etourneau
 

Just one question what about moving to Diego to get ride of HM9000 / DEA ?

On Tue, Dec 15, 2015 at 9:44 PM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi,

I found that one of etcds hanged up delayed the detection of crashed
application instances, resulting in the slow recovery time. Although this
depended on the condition of which hm9000 processes were connecting to the
each etcd VM, it approximately took up to 15min to recover and I think it
too long delayed.

Does anyone know how to calculate time for hm9000 to detect a hanged etcd
VM
and switch to healthy etcds? I have encounted two different scenarios as
follows.

1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner
was connecting to the normal etcd. (About 8 min for analyzer to be
recovered. The other hm9000 analyzer took over instead.)
The analyzer seemed to be hanged up accidentally just after the
connected
etcd was hanged because "Analyzer completed succesfully" was not found in
the log.
After approximately 8 min passed, the other hm9000 analyzer acquired the
lock and started to work instead. And then it identified crashed instance
and enqueued start message. the crashed app was relaunched within ten min
after the detection.

2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner
was connecting to the hanged etcd. (About 15 min for listener to be
recovered. The same hm9000 listener seemed to be recovered somehow.)
The listener started to fail to sync heartbeats just after the connected
etcd was hanged. After 15min, "Save took too long. Not bumping freshness."
was showed in the listner's log and then analyzer also complained about the
old actual state: "Analyzer failed with error - Error:Actual state is not
fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had
recovered somehow and started to bump freshness periodically then analyzer
also started to analyze actual state and desied state and raised the
request
to start a crashed instance.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: [abacus] Accommodating for plans in a resource config

Benjamin Cheng
 

Not sure about that. AIUI with that refined design plans can now use
different metrics so usage gets aggregated at the plan level rather than
the resource level (as it wouldn't make sense to aggregate usage from
different plans metered using different metrics). That means that the
aggregation, summary and charge functions only apply to the plan level
rather than the resource level.


Assuming that my above statement that 'aggregation, summary and charge
functions only apply to the plan level' is correct, there's no 'common
section' anymore, so no problem with processing usage in that non-existent
common section anymore :) Makes sense?
Yes, I agree that it makes sense. We wouldn't want to deal with metrics/measures existing in specific plans mixing and matching with each other unless a real need pops up.


Re: [cf-bosh] PLEASE READ: BOSH-Lite stemcell broken

Dmitriy Kalinin <dkalinin@...>
 

we believe 3146 addresses the problem.

Sent from my iPhone

On Dec 15, 2015, at 1:52 PM, Aristoteles Neto <dds.neto(a)gmail.com> wrote:

I see there is a new version (3146).

Does that address the issues relating to 3126? Or should we stick with 2776 until advised otherwise?

Aristoteles Neto
dds.neto(a)gmail.com



On 1/12/2015, at 8:43, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey all,

TL;DR: Please DO NOT use BOSH-Lite stemcell 3126 for local development or CI, use the previous version, 2776.

This issue has been announced multiple times before, but people are still hitting it. Unfortunately it's a hard issue to diagnose, and by the time it happens you might not remember this email, but several different users and core CF development teams have sunk a fair bit of time tripping over this so apologies for the alarmist subject line.

BOSH-Lite stemcell 3126 is not compatible with consul-release. This only affects certain use cases, but it's recommended to err on the side of caution and not use this stemcell. Use 2776 for local development, and if you're deploying to BOSH-Lite in your CI pipelines, and you're pulling in the latest stemcell, make sure you don't pull in 3126. If you're using Concourse for CI, it's very simple to disable a particular version of a resource.

If you'd like a more nuanced explanation of what the issue is, and whether or not it's likely to affect your use case, please feel free to ask.

Best,
Amit


Re: How to estimate reconnection / failover time between gorouter and nats

Christopher Piraino <cpiraino@...>
 

Hi Masumi,

The sequence/estimation that you describe sounds accurate to us. I think
ideally we should configure that NATs reconnection logic to initiate a
reconnect before the stale_threshold value. We have put a story in our
icebox <https://www.pivotaltracker.com/story/show/110199022> for our PM to
prioritize.

We also have some upcoming work around being able to configure the router
to not prune routes when NATs is down. See this issue
<https://github.com/cloudfoundry/gorouter/issues/102> on the GoRouter with
related discussion.

Chris and Shash - CF Routing Team

On Mon, Dec 7, 2015 at 8:28 AM, Masumi Ito <msmi10f(a)gmail.com> wrote:

Hi,

Can anyone explain about the expected reconnection / failover time for
gorouter when one of the nats VMs hangs up accidentally?

The background of this question is that I found the gorouter had some
timeframe to return "404 Not found Err" for app requests temporarily when
one of the clusted nats was not responsive. This happened after about 2 min
and then recovered in another 2-3min. I understand it is mainly due to
pruning stale routes and reconnection / failover time to a healthy nats by
gorouter. First 2 min can be explained as droplet_stale_threshold value.
However I am wondering if what exactly happened in another 2-3min.

Note that bosh health monitor detected an unresponsive nats and recreated
it
finally however the gorouter had received "router.register" from DEAs
before
the recreation was complete. Therefore I think this indicates the failover
to the other nats rather than reconnecting to the recreated nats which was
previously down.

I believe some connection parameters in the yagnats and apcera/nats client
are keys for this.

- Timeout: timeout to create a new connection
- ReconnectWait: wait time before reconnect happens
- MaxReconnect: unlimited reconnect times if this value is -1
- PingInterval: interval of each pinging to check if a connection is
healthy
- MaxPingOut: trial times of pinging before determining reconnection is
necessary

1. When one of nats hangs up, the connection might still exist until TCP
timeout has been reached.

2. PingTimer periodically sends ping to check if the connection is stale
totally (PingInterval * MaxPingOut) times and concluds it is necessary to
reconnect to the next nats server.

3. Before reconecting it, the gorouter waits in ReconnectWait.

4. Create a new connection for the next nats server within Timeout.

5. After that, the gorouter starts to register app routes from DEAs through
the nats connected.

Therefore my rough estimation is:
PingInterval(2 min) * MaxPingOut(2) + ReconnectWait(500 millisec) +
Timeout(2 sec)

I would appreciate if someone could correct this rough explanation or give
some more details.

Regards,
Masumi



--
View this message in context:
http://cf-dev.70369.x6.nabble.com/How-to-estimate-reconnection-failover-time-between-gorouter-and-nats-tp2980.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: [cf-bosh] PLEASE READ: BOSH-Lite stemcell broken

Aristoteles Neto
 

I see there is a new version (3146).

Does that address the issues relating to 3126? Or should we stick with 2776 until advised otherwise?

Aristoteles Neto
dds.neto(a)gmail.com

On 1/12/2015, at 8:43, Amit Gupta <agupta(a)pivotal.io> wrote:

Hey all,

TL;DR: Please DO NOT use BOSH-Lite stemcell 3126 for local development or CI, use the previous version, 2776.

This issue has been announced multiple times before, but people are still hitting it. Unfortunately it's a hard issue to diagnose, and by the time it happens you might not remember this email, but several different users and core CF development teams have sunk a fair bit of time tripping over this so apologies for the alarmist subject line.

BOSH-Lite stemcell 3126 is not compatible with consul-release. This only affects certain use cases, but it's recommended to err on the side of caution and not use this stemcell. Use 2776 for local development, and if you're deploying to BOSH-Lite in your CI pipelines, and you're pulling in the latest stemcell, make sure you don't pull in 3126. If you're using Concourse for CI, it's very simple to disable a particular version of a resource.

If you'd like a more nuanced explanation of what the issue is, and whether or not it's likely to affect your use case, please feel free to ask.

Best,
Amit


[ANNOUNCE] CVE-2015-5350: Garden Nstar vulnerability

Chip Childers <cchilders@...>
 

CVE-2015-5350: Garden Nstar vulnerabilitySeverity:

High
Vendor:

Cloud Foundry Foundation
Versions Affected:

Garden versions 0.22.0-0.329.0
Description:

A vulnerability has been discovered in the garden-linux nstar executable
that allows access to files on the host system. By staging an application
on Cloud Foundry using Diego and Garden installations with a malicious
custom buildpack an end user could read files on the host system that the
BOSH-created vcap user has permissions to read and then package them into
their app droplet.
Affected Cloud Foundry Products and Versions:

-

All Garden versions prior to v0.330.0

Mitigation:

-

The Cloud Foundry project recommends that Cloud Foundry Deployments
using Diego and Garden upgrade to Garden Linux Release v0.330.0 or higher.
Diego release v0.1444.0 includes Garden Linux v.0.330.0.

Credit:

Julian Friedman

Will Pragnell

Eric Malm
References: Cloud Foundry:

* Garden-Linux-Release
<https://github.com/cloudfoundry-incubator/garden-linux-release>

* Diego-Release <https://github.com/cloudfoundry-incubator/diego-release>


Failing to push standalone java app

Rahul Gupta
 

Hi,

I am trying to push a standalone Java app that has a 'public static void main(..)' and uses other dependencies. I tried setting the classpath in the jar's MANIFEST.MF, created a new jar that also contains dependent jars in its root and did a cf push but that didn't help either - the 'cf push -p xxxxxxx.jar' fails while resolving runtime dependencies

e.g. ERR Exception in thread "main" java.lang.NoClassDefFoundError: com/XXX/client/AbcXyz


Here is the content of manifest.mf:
Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Built-By: smokingfly
Class-Path: XXX-123.jar AAA.789.jar
Created-By: Apache Maven 3.2.3
Build-Jdk: 1.8.0_40
Main-Class: com.cf.samples.TestClient

TestClient is the class with main method.

I could not find any documentation that could help me with this. Could someone please help?

Many thanks.


Re: [abacus] Accommodating for plans in a resource config

Jean-Sebastien Delfino
 

On Fri, Dec 11, 2015 at 4:47 PM, Benjamin Cheng <bscheng(a)us.ibm.com> wrote:

Abacus will want to support plans in its resource config (as mentioned in
issue #153 https://github.com/cloudfoundry-incubator/cf-abacus/issues/153)

Starting with a basic approach, there would be a plans property(an array)
added to the top-level of a resource config. The current metrics and
measures properties would be moved under that plans property. This will
allow them to be scoped to a plan.

+1 that makes sense to me as different plans may want to use different
measures, metrics, and metering, accumulation and aggregation functions.


Despite moving metrics and measures under plans, there will be a need of a
common sets of measures/metrics for plans to fall back on. This comes into
play in the report for example when summary/charge functions are running on
aggregated usage across all plans.
Not sure about that. AIUI with that refined design plans can now use
different metrics so usage gets aggregated at the plan level rather than
the resource level (as it wouldn't make sense to aggregate usage from
different plans metered using different metrics). That means that the
aggregation, summary and charge functions only apply to the plan level
rather than the resource level.



In terms of the common section, there's of a choice of leaving
measures/metrics on the top level as the common/default or putting those
under a different property name.

I think there's a couple of things to consider here:
-Defaulting for a plan to the common section if there is no formula
defined. This may require the plan to point to the common section or logic
that would automatically default to the common section (and subsequently
the absolute resource config defaults that are already in place).
-If there's no plan id passed(for example some of the charge/summary
calls), they would need to go this common section.

Assuming that my above statement that 'aggregation, summary and charge
functions only apply to the plan level' is correct, there's no 'common
section' anymore, so no problem with processing usage in that non-existent
common section anymore :) Makes sense?


Thoughts/Concerns/Suggestions?
- Jean-Sebastien


Re: Organization quota definition-questions

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Sorry, before I didn't reply some questions.

1. Didn't test it. In my tests, I defined a quota at org level but I will test it.
2. I answered with the pseudocode.
3. The space adquired the limits defined in the quota for the organization.

Juan Antonio


Re: Organization quota definition-questions

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Hi,

You have the reason. Disk quota is a parameter defined to app level only.
http://apidocs.cloudfoundry.org/213/apps/creating_an_app.html

When you create a new App, you define a set of parameters and one of them is disk of quota but when you define a Org Quota, disk_quota is not defined at that level.
http://apidocs.cloudfoundry.org/213/organization_quota_definitions/creating_a_organization_quota_definition.html

I am not sure if someone from Pivotal could confirm this fact, but I think that CC API doesn't have that feature at org/space level.

Anyway, at the moment, using the API, it is possible to do the same task but not in a direct way:

IDEA:

spaces = getSpacesFromOrg(org_guid)
long org_disk_quota = 0;
for each(space in spaces) {
apps = getAppsFromSpace(space_guid)
for each(app in apps) {
app_stat = getAppSummary(app_guid) or getAppStats(app_guid)
http://apidocs.cloudfoundry.org/226/apps/get_app_summary.html
http://apidocs.cloudfoundry.org/226/apps/get_detailed_stats_for_a_started_app.html
org_used_disk = app_stat.getDiskQuota();
}
}
System.out.println("Disk quota for current org: " + org_used_disk);

Juan Antonio


A hanged etcd used by hm9000 makes an impact on the delayed detection time of crashed application instances

Masumi Ito
 

Hi,

I found that one of etcds hanged up delayed the detection of crashed
application instances, resulting in the slow recovery time. Although this
depended on the condition of which hm9000 processes were connecting to the
each etcd VM, it approximately took up to 15min to recover and I think it
too long delayed.

Does anyone know how to calculate time for hm9000 to detect a hanged etcd VM
and switch to healthy etcds? I have encounted two different scenarios as
follows.

1. hm9000 analyzer was connecting to the hanged etcd however hm9000 listner
was connecting to the normal etcd. (About 8 min for analyzer to be
recovered. The other hm9000 analyzer took over instead.)
The analyzer seemed to be hanged up accidentally just after the connected
etcd was hanged because "Analyzer completed succesfully" was not found in
the log.
After approximately 8 min passed, the other hm9000 analyzer acquired the
lock and started to work instead. And then it identified crashed instance
and enqueued start message. the crashed app was relaunched within ten min
after the detection.

2. hm9000 analyzer was connecting to the normal etcd however hm9000 listner
was connecting to the hanged etcd. (About 15 min for listener to be
recovered. The same hm9000 listener seemed to be recovered somehow.)
The listener started to fail to sync heartbeats just after the connected
etcd was hanged. After 15min, "Save took too long. Not bumping freshness."
was showed in the listner's log and then analyzer also complained about the
old actual state: "Analyzer failed with error - Error:Actual state is not
fresh" and stopped analyzing tasks. After 10 sec hm9000 listener had
recovered somehow and started to bump freshness periodically then analyzer
also started to analyze actual state and desied state and raised the request
to start a crashed instance.

Regards,
Masumi



--
View this message in context: http://cf-dev.70369.x6.nabble.com/A-hanged-etcd-used-by-hm9000-makes-an-impact-on-the-delayed-detection-time-of-crashed-application-ins-tp3096.html
Sent from the CF Dev mailing list archive at Nabble.com.


Re: Organization quota definition-questions

Ponraj E
 

Hi Juan Antonio,

Thanks for the reply.

The API that you have mentioned gives me the memory usage of the org and not the disk quota/usage of the org. I need to know this info. In addition to that, I have added couple of more questions in my latest reply.

1. Sometimes the sum of space quota definition exceeds the org quota definition. Is this a valid use case or bug?
2. Currently at an org level, there is no API to display the disk quota limit/usage, but its only at the application level.How do we approach this?
3. Also at the space level, there is a possibility that a space not being associated with the space quota definition. So, how do we get the total resources available(like memory, services, routes) for this space?


Regards,
Ponraj


Re: Organization quota definition-questions

Juan Antonio Breña Moral <bren at juanantonio.info...>
 

Good morning,

yes it is possible.

If you observe PWS panel or Bluemix you can observe that information.

Every organization has binded a OrganizationQuota and this definition affects to every applicattion deployed in any spaces binded to that organization.

The REST methods used to get the definition is:

http://apidocs.cloudfoundry.org/213/organization_quota_definitions/retrieve_a_particular_organization_quota_definition.html

The method to read the memory used is:

http://apidocs.cloudfoundry.org/222/organizations/retrieving_organization_memory_usage.html

You have an example here:
https://github.com/prosociallearnEU/cf-nodejs-dashboard/blob/master/services/HomeService.js#L69-L79

Remember that the memory used is the active memory. You can many applications staged but stopped. When you sum memory to the counter is when you start a new application to set of applications running in a space.

Juan Antonio


回复:Re: about consul_agent's cert

于长江 <yuchangjiang at cmss.chinamobile.com...>
 

it works, thank you~




于长江
15101057694


原始邮件
发件人:Gwenn Etourneaugetourneau(a)pivotal.io
收件人:Discussions about Cloud Foundry projects and the system overall.cf-dev(a)lists.cloudfoundry.org
发送时间:2015年12月14日(周一) 12:48
主题:[cf-dev] Re: about consul_agent's cert


Please read the documentationhttp://docs.cloudfoundry.org/deploying/common/consul-security.html

On Mon, Dec 14, 2015 at 11:35 AM, 于长江 yuchangjiang(a)cmss.chinamobile.com wrote:

hi,
when i deploy cf-release, consul agent job failed start, i found the err log in the vm.


== Starting Consul agent...
== Error starting agent: Failed to start Consul server: Failed to parse any CA certificates
--------------------------------------------
then i found the configuration in cf’s manifest file is not correct,like this:


consul:
encrypt_keys:
- CONSUL_ENCRYPT_KEY
ca_cert: CONSUL_CA_CERT
server_cert: CONSUL_SERVER_CERT
server_key: CONSUL_SERVER_KEY
agent_cert: CONSUL_AGENT_CERT
agent_key: CONSUL_AGENT_KEY


i have no idea of how to complete these fields, can someone give me an example, thanks~


于长江
15101057694


Re: Organization quota definition-questions

Ponraj E
 

Hi,



Since the documentation for the quota definition is quite unclear at the moment, have more questions reg the same.

I want to display the resource consumption (memory,disk usage,etc) at the org and space level.

1. Sometimes the sum of space quota definition exceeds the org quota definition. Is this a valid use case or bug?
2. Currently at an org level, there is no API to display the disk quota limit/usage, but its only at the application level.How do we approach this?
3. Also at the space level, there is a possibility that a space not being associated with the space quota definition. So, how do we get the total resources available(like memory, services, routes) for this space?

Regards,
Ponraj


Organization quota definition-questions

Ponraj E
 

Hi,

Is it possible to get the disk quota at an organization level? As far as I see, the quota definition api doesnt return the disk quota[upper limit] info?

I want to calculate the used disc quota/Total disk quota for an organization.


Regards,
Ponraj


Re: Certificate management for non-Java applications

Daniel Mikusa
 

I think it depends on what language / runtime / library and where it's
looking for the default set of certs. It's easy with Java because it uses
it's own cert store and that file is owned by the vcap user. If a language
/ runtime / library is looking at `/etc/ssl/certs`, you can't change that
as a user (at least not from staging / runtime).

The best thing, just like Java, is if your applications provides you with
the facilities to configure its usage of certs. I'd imaging that all
language / runtime / libraries provide you with a way to override the
defaults and use your own certs.

Besides that, you might be able to set various environment variables to
point the default cert store to a different location. I'm not aware of a
standard one though, so it's likely going to depend on the specific
language / runtime / library and what it supports.

Dan


On Mon, Dec 14, 2015 at 6:01 PM, john mcteague <john.mcteague(a)gmail.com>
wrote:

Previous threads have focused on adding a trusted CA to the JDK's trust
store at application startup, a pattern that I have employed also.

We are facing increased demand from our non-Java developers to have the
same functionality. Whether it be custom CA's, certs for authentication
(against something like MQ for example) or for our internal LDAP server
which requires ldaps, we need a way to add user defined certificates at app
deploy time based on user requirements.

My work with Java buildpacks has resulted in a certificate as a service
style function; declare which cert from a certificate store should be
injected into the app at runtime. What I lack for non-java runtimes is a
reliable way to get those certs into the correct linux container directory
either during staging or at app startup.

Have others been able to establish a pattern around this? Without this
abiity we go from a polygot platform to simply Java only.

Thanks,
John


[Abacus] Tagging usage

KRuelY <kevinyudhiswara@...>
 

Hi,

I'm looking for a way to classify usage. My use case is this:

I have 2 type of usage:
1. Usage of type A.
2. Usage of type B.

The initial thought to do this with the current abacus is to create 2
separate plan(say plan a & plan b). This way, usage of type A would go under
'plan a' and usage of type B would go under 'plan b'.

Due to past design, I am unable to do so and hence I need to separate usage
under the same plan into 2 type: type A and type B.

A more detailed use case would be when pricing change in the middle of the
month. Here, I need to separate my usage into usage of type A that would use
the old pricing, and usage of type B that would use the new pricing.

The main reason of splitting them up is that I would like to have two
different bucket for the two usage because keeping the two usage in the same
bucket is not going to work(will give inaccurate result) due to:

Use cases:
1. Price changed on the 15th. Get quantity usage from beginning of the
month(A), and get quantity usage at the end of the month(B). The price would
be: A * old price + (B - A) * new price. This would work if the accumulation
formula is plain sum, but if the formula is max / average this would not
work.

2. Query beginning of the month to middle of the month and query middle of
the month to end of the month. querying from x to y is not supported, and
with the accumulating and retracting dataflow model that we have this is not
possible.

3. One way that would work is to utilize the flexibility of the meter,
accumulate, and aggregate functions in the resource config to separate usage
of type A and type B. I don't think this is a good design, but the point is
that we can utilize the flexibility of the functions in the resource config
to maybe generate a new idea.

Would tagging a usage be a good idea? For example Under plan we would have

plan:
bucket A:
aggregated_usage:
bucket B:
aggregated_usage:

my concern with this is we would have to add another (key, value) pair to
create another landmark window, and also the usage submitted would need
another field that will serve as the 'tag'.

Any thought or idea to implement this? Other than to create 2 separate plans
and tagging, the ideas mentioned above are not really viable.

Thanks!







--
View this message in context: http://cf-dev.70369.x6.nabble.com/Abacus-Tagging-usage-tp3089.html
Sent from the CF Dev mailing list archive at Nabble.com.


Certificate management for non-Java applications

john mcteague <john.mcteague@...>
 

Previous threads have focused on adding a trusted CA to the JDK's trust
store at application startup, a pattern that I have employed also.

We are facing increased demand from our non-Java developers to have the
same functionality. Whether it be custom CA's, certs for authentication
(against something like MQ for example) or for our internal LDAP server
which requires ldaps, we need a way to add user defined certificates at app
deploy time based on user requirements.

My work with Java buildpacks has resulted in a certificate as a service
style function; declare which cert from a certificate store should be
injected into the app at runtime. What I lack for non-java runtimes is a
reliable way to get those certs into the correct linux container directory
either during staging or at app startup.

Have others been able to establish a pattern around this? Without this
abiity we go from a polygot platform to simply Java only.

Thanks,
John


Re: Web sockets + Cloud Foundry

Matthew Sykes <matthew.sykes@...>
 

James Bayer posted about a fun experiment with web sockets a while back.
Might be a good starting point:

http://www.iamjambay.com/2013/12/send-interactive-commands-to-cloud.html

You need to make sure that the url uses the correct protocol and port for
your CF target. You can reference `doppler_logging_endpoint` from /v2/info
as a template.

On Mon, Dec 14, 2015 at 4:53 PM, Lakshman Mukkamalla (lmukkama) <
lmukkama(a)cisco.com> wrote:

Hi CF Dev team,
I am just starting to look into how web sockets based app can run in a
cloud foundry env. If you have any reference links/wiki’s that walk thru
this web sockets working, will help me here. I understand from cloud
foundry docs that it has some level of support with web sockets but will
give a try on a sample app in the meantime.

Thanks.


--
Matthew Sykes
matthew.sykes(a)gmail.com