Date   

Notice: Known stability issue running Diego cells on BOSH stemcell 3541.2 and later; advise use of 3468.26 instead

Eric Malm <emalm@...>
 

Hi, all,

FYI, the core CF dev teams have observed an issue running Diego cells on BOSH stemcell 3541.2 and later, and at the moment advise you to roll back to the 3468 stemcell line for those VMs. The next cf-deployment release will also revert to that stemcell line. The BOSH team is addressing the issue in https://www.pivotaltracker.com/story/show/155716654, and we expect the fix to be incorporated into the next stemcell in the 3541 line.

Problem: Diego cell VMs affected by this issue will suddenly be unable to make new Garden containers, although existing containers will continue to run. Within at most 10 minutes, the cell also will cease to accept new CF app instances or tasks, preventing execution failures but reducing available capacity for placement. This failure to make containers can cause `cf push` to fail or app instances not to be restarted, especially if this issue incapacitates enough cells in the environment.

Symptoms: App developers may observe that `cf push` or `cf restart` fails either with a container-creation error that includes "permission denied" or because of insufficient resources. CF operators may observe a persistently elevated number of Diego cell reps reporting a UnhealthyCell metric of 1 or reporting failures to create Garden containers, or increased rates of Task or LRP auction failures from the Diego auctioneer component.

Mitigations: In addition to rolling back to the 3468 stemcell line, restarting or recreating the affected cell VM with the BOSH CLI, running `sudo monit restart all` on it, or running `sudo chown vcap:vcap /var/vcap/data/garden /var/vcap/data/rep` and waiting at most 10 minutes will all suffice to mitigate the immediate effects of the issue. The last option, while the most manual, does have the benefit of not disrupting app instances and tasks that are already running on the Diego cell VM.

More details: Because of the delicate filesystem dance that Garden must perform to create container root filesystems, the Garden and Diego rep jobs on the Diego cell rely on certain directories on the /var/vcap/data volume to have very particular user ownership and access permissions.

In the 3541.2 stemcell, the bosh-agent added new behavior to change the ownership on those directories to root:vcap when it restarts, which is incompatible with what the garden and rep jobs require. Normally this is not a problem, as the garden and rep ctl scripts override that ownership when BOSH invokes them. If the bosh-agent is restarted for any reason, though, it will reset it without restarting those BOSH jobs, and Garden will then be unable to create the root filesystems for new containers.

Those of you who use PWS may have noticed an incident last Friday evening when `cf push` was unavailable for about half an hour. This incident was the result of an extended update to its BOSH director, which caused all of the bosh-agent processes on the VMs to restart and then to apply the permission change above on all of the Diego cells simultaneously. We resolved it by applying the `chown` mitigation above in the short term and then rolling back to the latest 3468 stemcell. A+ to the batch `bosh ssh` capability available in the v2 BOSH CLI, by the way.

Please let me know if you have any questions, and we apologize for any inconvenience this issue may have caused you.

Thanks,
Eric Malm, CF Diego PM


Re: Automating Security Groups and Brokered Services

Matt Cholick
 

That's great Guillaume, thanks!

That's exactly what I was after Sabha, thanks for sharing.

Looping in Matt McNeeney for the OSBPAI perspective:
is this concept generic enough to fit into OSBAPI spec? Is anything like Sabha's proposal anything you've thought about or on your radar?


Re: Automating Security Groups and Brokered Services

Sabha Parameswaran <sabhap@...>
 

I proposed an enhancement to let the service broker api to return the service endpoint info as part of the bind call so Cloud Controller or whatever is running the platform can create the ASG or its equivalent to let the app communicate to that endpoint. This lets the service broker be transparent to where its running and what platform or the credentials of the platform while the platform manage the required policy/security handling.

sorry for the long content:

Problem Statement

As more services get consumed by the applications (same application or across multiple apps) in a given space or boundary, it becomes a painful exercise in figuring out the endpoints and creating a customized ASG or its equivalent (which can change as more services get added/consumed). Platforms like CF do not allow application developer to change or administer App Security Groups and requires an Admin to open things up. With proliferation of Service Brokers and associated Services in the Services Marketplace, this becomes a burden every App Developer has to undergo.

Other Solutions

It is possible to build yet another service broker that can act as an intermediary to create and manage the Application Security Groups on behalf of set of a services used by the application. But this requires the Service Broker to be actively coupled with the Platform, requires administrative privileges, endpoint information about the platform while also requiring the Service Broker to be tightly tied to the underlying services (to know which to allow, what endpoints/ports etc.). This just makes it more complex and inefficient and breaks the model of Service Broker not requiring anything to know about the consuming Platform (CF or Kubernetes).

Proposed Solution


Enhance the Service Broker to return additional metadata on `bind` service call about the service (DNS name or IP, Subnet, Port ranges and any additional metadata). This would allow automatic creation of ASG or its equivalent for the various Platforms based on the services bound to a given application or consumed in an associated space or boundary on a bind call and update/delete the ASG on unbind of the service.


Reason for the Service Broker enhancement

Even if an ASG can be created automatically based purely on a service bind information returned by the Service Broker to the Platform, it is harder in some cases to lock down or derive the endpoint information. For instance, in case of APM service brokers, the bind information might contain only a license key without information on the remote service endpoint. The APM agents bundled via buildpacks or application might have some well known endpoints to reach out to.


Service Broker Enhancement


The service broker would return an additional json element that contains endpoint metadata as part of the ‘bind’ call. This metadata portion can contain details about the endpoint (ip or dns names or cidr), ports (list of ports, port ranges), protocol, the service instance guid and anything related to help the control/consuming platform create the appropriate network security policy




{
     "credentials": {
       "uri": "mycustomservice.instance-name",
       "username": "myuser",
       "password": "pass",
       "host": "myhost",

       "database": "dbname"
     },

     "service_endpoints": [ {

       "endpoint": "service.test.domain.com",
       "port": 3306,

       "protocol": "all",

       "sid": "653ba025-aa43-4e66-941a-4fea786d3755"
        ...

     }] ,

  ....
   }





Workflow

  1. Admin or user with requisite privileges registers the Service Broker with the Platform

  2. Admin user provides an allowed set of network ranges, ports or dns names to be used as the permitted superset of network endpoints to be associated with the Service Broker to the Platform.

    1. This would be specific to the Platform and not part of any service broker contact. This allows the administrator to lock down allowed Vs. disallowed services across the platform per service broker.

  3. The Platform saves the information in its Policy Engine against the Service.

  4. User creates an instance of a service from the Service Broker and binds to it.

  5. Service Broker provides the endpoint metadata during the bind call to the Platform

  6. Platform verifies if the provided endpoint matches or falls within the permitted set of endpoints associated with the service broker to decide to allow or not using its Policy Engine.

  7. Proceed with auto creation of the ASG or its equivalent based on the outcome for the app container.

  8. Allow or disallow the application container from communicating with the service.

  9. On unbind, update or delete the ASG for the associated app.

  10. Block outbound communication with removal of the ASG.




Workflow for App Security Creation


Can share the doc if there is more interest.

-Sabha




On Thu, Mar 8, 2018 at 10:21 AM, Guillaume Berche <bercheg@...> wrote:
Hi Matt,

We developed at Orange the following broker which matches your description,
https://github.com/orange-cloudfoundry/sec-group-broker-filter

Feedback is welcome.

Regards,

Guillaume.

On Thu, Mar 8, 2018 at 3:10 PM, Matt Cholick <cholick@...> wrote:
In a default deny situation, where the operator doesn't want to open up a foundation-wide security group at service installation time, it would be useful to create and bind a security group on the fly (that allows communication to the service deployment) at service instance creation time.

1. Developer creates service
2. Broker checks if if the ASG allowing communication exists
3. If not, broker binds the ASG to the app's space
4. Rest of flow works as normal

The developer would need to restart the app after service bind anyway, so the security group would get applied as part of that flow.

Has anyone built something this as an open source library? Have run across some folks that are interested in this as a cross-cutting broker behavior, to keep their traffic rules as restrictive as possible.

-Matt Cholick





--
Sabha Parameswaran
Platform Engineering, Cloud Foundry
Pivotal, Inc.


Re: Automating Security Groups and Brokered Services

Guillaume Berche
 

Hi Matt,

We developed at Orange the following broker which matches your description,
https://github.com/orange-cloudfoundry/sec-group-broker-filter

Feedback is welcome.

Regards,

Guillaume.

On Thu, Mar 8, 2018 at 3:10 PM, Matt Cholick <cholick@...> wrote:
In a default deny situation, where the operator doesn't want to open up a foundation-wide security group at service installation time, it would be useful to create and bind a security group on the fly (that allows communication to the service deployment) at service instance creation time.

1. Developer creates service
2. Broker checks if if the ASG allowing communication exists
3. If not, broker binds the ASG to the app's space
4. Rest of flow works as normal

The developer would need to restart the app after service bind anyway, so the security group would get applied as part of that flow.

Has anyone built something this as an open source library? Have run across some folks that are interested in this as a cross-cutting broker behavior, to keep their traffic rules as restrictive as possible.

-Matt Cholick



Automating Security Groups and Brokered Services

Matt Cholick
 

In a default deny situation, where the operator doesn't want to open up a foundation-wide security group at service installation time, it would be useful to create and bind a security group on the fly (that allows communication to the service deployment) at service instance creation time.

1. Developer creates service
2. Broker checks if if the ASG allowing communication exists
3. If not, broker binds the ASG to the app's space
4. Rest of flow works as normal

The developer would need to restart the app after service bind anyway, so the security group would get applied as part of that flow.

Has anyone built something this as an open source library? Have run across some folks that are interested in this as a cross-cutting broker behavior, to keep their traffic rules as restrictive as possible.

-Matt Cholick


Removing the (experimental) btrfs driver from Garden/Grootfs

Julz Friedman
 

Hi cf-dev, I wanted to write a quick email to make sure the community is aware that we are planning to remove the btrfs driver from Garden/Grootfs in the next Garden release (and to explain why).


Exposition:


- Grootfs is the rootfs management library used in Cloud Foundry. Initially Grootfs planned to use btrfs (a filesystem supporting fast snapshotting) as the main underlying filesystem for managing container layers.  


Conflict!


- Unfortunately as we attempted to run the btrfs-based Grootfs at scale we saw performance and reliability issues that we weren't able to overcome to our satisfaction. 

  

Rising Action:


- Based on our inability to successfully run the btrfs-based driver at production scale we instead moved to an Overlay-based implementation.  (We'd like to underline that these may not have been fundamental issues with btrfs and may rather have simply reflected our team's lack of skill in it-- but the situation was still that the team was not able to gain confidence in our ability to run btrfs at scale). 

- We kept the btrfs driver in the code with the intention of removing it before merging Groot back in to Garden / creating Grootfs 1.0. This did not happen: the option of opting in to btrfs survived. 

  

Climax:


- Grootfs has now been running in large environments for several months using the Overlay driver and we've been happy with its performance and stability.  

- The btrfs driver is still in the code, but we have very few tests around it, and feel uncomfortable with people running it, since we don't believe our team can support it. Functionally, we believe all users should be able to use the overlay driver instead (although we’re absolutely aware some users prefer btrfs as they have more knowledge of it in-house). 

- The team is [Planning to adopt Containerd][0] for creating and managing containers inside Garden, this should allow a better way of allowing consumers to support filesystems the garden team doesn’t have the bandwidth to directly test and support, by potentially allowing consumers to use the wide variety of upstream community supported drivers. 

  

Falling Action:


- We still have the btrfs driver in the code even though there’s a better solution and we don’t feel we can support it. This is storing up problems and technical debt and incurring a cost on all stories which touch the Grootfs code. We’ve decided given the team’s limited bandwidth, we need to stop doing this.  

  

Resolution:


- We intend to remove the btrfs driver in the next garden version (it should never have survived to a 1.0+ release, and we apologise for the confusion of removing it after this happened). 

- As far as we know, all consumers of the btrfs driver should be able to transition to the Overlay driver.

- Since garden's image plugin API is stable and backwards-compatible, users who wish to continue using btrfs - for example because they have skills in btrfs and prefer to support a btrfs solution rather than an Overlay solution - can use the previous version of the Grootfs release (or their own image plugin), but will not get updates/fixes.

- We've recently begun work on supporting using Containerd [0] to create and manage containers in Garden. Our hope is that when this work is complete users will be able to use any of the drivers the Containerd community provides, including btrfs (though we will likely still recommend, test and support only a specific default configuration). 


Please please please let us know if there is a reason the Overlay driver (or the ability to plug in your own image plugin or an old Grootfs version) does not work for you, or if you have other questions/concerns and we will try our best to find a good solution. We're on slack in the #garden channel, or reply to this email / email me directly.


[0]: https://lists.cloudfoundry.org/g/cf-dev/message/7699?p=,,,20,0,0,0::Created,,containerd,20,2,0,9499071


Thanks!

Julz




Re: Updates from the world of .NET! 🎉

Dr Nic Williams <drnicwilliams@...>
 

Thanks William & Zach. Your answers are very helpful.

Nic


From: cf-dev@... <cf-dev@...> on behalf of A William Martin <amartin@...>
Sent: Thursday, March 8, 2018 12:20:02 PM
To: cf-dev@...
Subject: Re: [cf-dev] Updates from the world of .NET! 🎉
 
I'm sorry to belabor this point, but just to be clear, there *is* a UI, it’s just different... I've included a screenshot while RDPing into a Windows 2016 cell deployed on Cloud Foundry on GCP. You can see there is a UI with a terminal windows with Powershell, the task manager and config window spawned, and a few hints of the container technology with Get-ComputeProcess run and winc, our Garden runtime OCI-compliant container plugin, shown. Those are the app instances deployed to this cf-deployment.

Thus, Dr. Nic, you can indeed still use RDP and get a physical UI, but it only contains a single command terminal. And Zach is right, you need to be more command-line savvy to navigate to the UIs you're accustomed to, and not all of them are available. But as some of our colleagues have noted before, you can do anything in Powershell. :-)

As a bonus, we also provide a windows-utilities-release to enable RDP and perform other Windows-specific operations (like KMS activation). The stemcells are quite secure by default.

I don't want to detract from Ash's call for feedback, though. The consumption of .NET on CF is entering a golden era, and we're eager to hear what the community needs!






On Wed, Mar 7, 2018 at 7:41 PM Zach Brown <zbrown@...> wrote:
It's true, no UI, but...

You've got `bosh ssh`, `cf ssh`, remote server debugging with Visual Studio, and Windows Event Log forwarding. 

Not sure if you're comfortable at a command line, Dr. Nic, but it's what all the cool kids are doing these days. (Well, that and emoji.)

On Thu, Mar 8, 2018 at 7:49 AM, A William Martin <amartin@...> wrote:
The 2016 stemcells are based on Windows Server Core, so not a "full" UI like you'd usually expect. The UI in Server Core does exist, but it's rather stripped down and starts with a single command prompt. To open new windows you need to spawn them manually. There's no iconic desktop.

William


On Wed, Mar 7, 2018 at 5:46 PM, Dr Nic Williams <drnicwilliams@...> wrote:
So many emoji!

Question re 2016 stemcell - will it have any Windows UI components? When I was working with the 2012 stemcell, and we needed to use RDP to get access to the VM, we needed the Windows UI. But I think the 2016 stemcell back then did not have Windows UI. Am I wrong about that; and/or does 2016 stemcell now have Windows UI?

Nic





--

Zach Brown | Product Marketing and Strategy

650-954-0427 - mobile

zbrown@...



Re: Updates from the world of .NET! 🎉

A William Martin
 

I'm sorry to belabor this point, but just to be clear, there *is* a UI, it’s just different... I've included a screenshot while RDPing into a Windows 2016 cell deployed on Cloud Foundry on GCP. You can see there is a UI with a terminal windows with Powershell, the task manager and config window spawned, and a few hints of the container technology with Get-ComputeProcess run and winc, our Garden runtime OCI-compliant container plugin, shown. Those are the app instances deployed to this cf-deployment.

Thus, Dr. Nic, you can indeed still use RDP and get a physical UI, but it only contains a single command terminal. And Zach is right, you need to be more command-line savvy to navigate to the UIs you're accustomed to, and not all of them are available. But as some of our colleagues have noted before, you can do anything in Powershell. :-)

As a bonus, we also provide a windows-utilities-release to enable RDP and perform other Windows-specific operations (like KMS activation). The stemcells are quite secure by default.

I don't want to detract from Ash's call for feedback, though. The consumption of .NET on CF is entering a golden era, and we're eager to hear what the community needs!






On Wed, Mar 7, 2018 at 7:41 PM Zach Brown <zbrown@...> wrote:
It's true, no UI, but...

You've got `bosh ssh`, `cf ssh`, remote server debugging with Visual Studio, and Windows Event Log forwarding. 

Not sure if you're comfortable at a command line, Dr. Nic, but it's what all the cool kids are doing these days. (Well, that and emoji.)

On Thu, Mar 8, 2018 at 7:49 AM, A William Martin <amartin@...> wrote:
The 2016 stemcells are based on Windows Server Core, so not a "full" UI like you'd usually expect. The UI in Server Core does exist, but it's rather stripped down and starts with a single command prompt. To open new windows you need to spawn them manually. There's no iconic desktop.

William


On Wed, Mar 7, 2018 at 5:46 PM, Dr Nic Williams <drnicwilliams@...> wrote:
So many emoji!

Question re 2016 stemcell - will it have any Windows UI components? When I was working with the 2012 stemcell, and we needed to use RDP to get access to the VM, we needed the Windows UI. But I think the 2016 stemcell back then did not have Windows UI. Am I wrong about that; and/or does 2016 stemcell now have Windows UI?

Nic





--

Zach Brown | Product Marketing and Strategy

650-954-0427 - mobile

zbrown@...



Re: Updates from the world of .NET! 🎉

Zach Brown
 

It's true, no UI, but...

You've got `bosh ssh`, `cf ssh`, remote server debugging with Visual Studio, and Windows Event Log forwarding. 

Not sure if you're comfortable at a command line, Dr. Nic, but it's what all the cool kids are doing these days. (Well, that and emoji.)

On Thu, Mar 8, 2018 at 7:49 AM, A William Martin <amartin@...> wrote:
The 2016 stemcells are based on Windows Server Core, so not a "full" UI like you'd usually expect. The UI in Server Core does exist, but it's rather stripped down and starts with a single command prompt. To open new windows you need to spawn them manually. There's no iconic desktop.

William


On Wed, Mar 7, 2018 at 5:46 PM, Dr Nic Williams <drnicwilliams@...> wrote:
So many emoji!

Question re 2016 stemcell - will it have any Windows UI components? When I was working with the 2012 stemcell, and we needed to use RDP to get access to the VM, we needed the Windows UI. But I think the 2016 stemcell back then did not have Windows UI. Am I wrong about that; and/or does 2016 stemcell now have Windows UI?

Nic





--

Zach Brown | Product Marketing and Strategy

650-954-0427 - mobile

zbrown@...



Re: Updates from the world of .NET! 🎉

A William Martin
 

The 2016 stemcells are based on Windows Server Core, so not a "full" UI like you'd usually expect. The UI in Server Core does exist, but it's rather stripped down and starts with a single command prompt. To open new windows you need to spawn them manually. There's no iconic desktop.

William


On Wed, Mar 7, 2018 at 5:46 PM, Dr Nic Williams <drnicwilliams@...> wrote:
So many emoji!

Question re 2016 stemcell - will it have any Windows UI components? When I was working with the 2012 stemcell, and we needed to use RDP to get access to the VM, we needed the Windows UI. But I think the 2016 stemcell back then did not have Windows UI. Am I wrong about that; and/or does 2016 stemcell now have Windows UI?

Nic



Re: Updates from the world of .NET! 🎉

Dr Nic Williams <drnicwilliams@...>
 

So many emoji!

Question re 2016 stemcell - will it have any Windows UI components? When I was working with the 2012 stemcell, and we needed to use RDP to get access to the VM, we needed the Windows UI. But I think the 2016 stemcell back then did not have Windows UI. Am I wrong about that; and/or does 2016 stemcell now have Windows UI?

Nic


Updates from the world of .NET! 🎉

Ashley Hathaway
 

Why hello there!


In an effort to solicit more feedback and understand actual user problems to help define our roadmap & backlog I wanted to reach out and say howdy (howdy!) and share a bit about the exciting times in Windows CF land.


The Windows runtime team here at Pivotal in NYC have been super productive streamlining and improving a lot of our processes as well as adding new features to Cloud Foundry’s Windows runtime. Here are the highlights:


  • 😎The Windows2016 stack/runtime has been hanging out in experimental for a while. It’s soon (like within this month) to be supported in OSS. Kind of sort of a huge deal. This means Windows containers for realsies.

  • 🔐Another pretty epic update is cf ssh capabilities. With the cf CLI you can now securely log into Windows remote host VMs running CF app instances.

  • 🐈One of the newer improvements has been around acceptance suites. The standard version is CATS (CF Acceptance Test Suite) but we’ve always had a second version for Windows (aka WATS). But no more! We aim to have one test suite to rule them all. The team has worked to identify any overlapping processes and are just one PR away from integrating the suites! Yay simplicity! Feel free to check out the epic here.

  • 🔈We’re working on Volume Services with the Diego Persistence team and hope to ship sometime soon.

  • 🚙We’ve started work to bring Envoy and route integrity to Windows.

  • 🎁The ContainerD runtime initiative with Windows will begin in April w/ the Garden team. Hooray simplicity.

  • 🎊Finally, the Cloud Foundry Summit! Next month! Are you going? Give us a shout! And be sure to check out our sessions on Remote Debugging with two of our engineers and a short demo with our two of our PM’s.


And if you haven’t already head over to the CF Docs to get started w/ the Windows CLI.


So dear developer, what are we missing!? Are docs the best they can be? What can we improve? Feel free to reach out with any feedback or questions or ping us on the #garden-windows CF Slack!


Til next time,

Ashley


@Ash_Hathaway



Re: Rotating cf-deployment certificates

Iryna Shustava
 

Hey Mike,

Check out this doc regarding CA rotation with CredHub: https://github.com/pivotal-cf/credhub-release/blob/master/docs/ca-rotation.md.

Cheers,
Iryna


On Tue, Mar 6, 2018 at 2:44 PM, Aaron Huber <aaron.m.huber@...> wrote:
This one-liner will grab all the certs out of the vars files used by the bosh-cli and print out the expiration dates which is useful for a quick check:

openssl crl2pkcs7 -nocrl -certfile <(sed -n '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' *vars.yml | sed -e 's/^[ \t]*//') | openssl pkcs7 -print_certs -text -noout | sed -e 's/^[ \t]*//' | grep -E "Issuer:|Subject:|Not\ After\ :" | awk '{ if ((NR % 3) == 1) printf("\n*******\n\n"); print; }'

Aaron



Proposal for modest readability improvements to Diego component logs

Eric Malm <emalm@...>
 

Hi, all,

The Diego team is planning to make some modest improvements to the readability of the Diego component logs. Primarily, we'd like to make each log-line's timestamp ISO 8601/RFC 3339 compliant (that is, of the form "2018-03-06T12:34:56.789012345Z") and its log level a human-readable string. Details are in the proposal document at https://docs.google.com/document/d/1D3GK2IUGQz_3fCuuLPNWz7Yb8eiKKZ9k-78LwNHWtuU/edit, on which we would certainly appreciate your feedback.

Thanks,
Eric Malm, CF Diego PM


Re: Rotating cf-deployment certificates

Aaron Huber
 

This one-liner will grab all the certs out of the vars files used by the bosh-cli and print out the expiration dates which is useful for a quick check:

openssl crl2pkcs7 -nocrl -certfile <(sed -n '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' *vars.yml | sed -e 's/^[ \t]*//') | openssl pkcs7 -print_certs -text -noout | sed -e 's/^[ \t]*//' | grep -E "Issuer:|Subject:|Not\ After\ :" | awk '{ if ((NR % 3) == 1) printf("\n*******\n\n"); print; }'

Aaron


Re: Rotating cf-deployment certificates

Mike Youngstrom
 

So, reading through the document you provided Iryna and re-reading David's rotation steps everything now makes sense when using bosh-cli generated certs.

Are there steps to do the same with credhub managed certificates?

Thanks,
Mike

On Tue, Mar 6, 2018 at 1:22 PM, Mike Youngstrom <youngm@...> wrote:
Thanks for the clarification Iryna!  I'll do some more studying and respond if I have further questions.

Mike

On Tue, Mar 6, 2018 at 12:07 PM, Iryna Shustava <ishustava@...> wrote:
Hey Mike,

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

We expect no app routability or log availability downtime during the 3 step CA/cert rotation. That is because during Step 1 we make all components trust both CAs - the old one and the new one. During step 2, when we roll out new leaf certificates, all components should trust both CAs, so the certificate switch will happen without downtime. You will, however, see some cf push downtime, as David mentioned.

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

The default expiration is the same for CredHub and BOSH CLI. BOSH CLI does not allow you to change certificate expiration period, but CredHub does. You can do so by adding the duration property measured in days to your certificate or CA variable in the manifest.

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

If you're interested, this doc describes reasons behind cf push downtime during CA rotation.

Thanks!
Iryna, CF Release Integration Team


On Tue, Mar 6, 2018 at 9:21 AM, Mike Youngstrom <youngm@...> wrote:
Thanks for the heads up David.  I have questions about the rotation process.

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

Thanks,
Mike

On Fri, Mar 2, 2018 at 11:32 AM, David Sabeti <dsabeti@...> wrote:
Hey cf-dev,

The Release Integration team has had a few reports from other CF engineering teams that their long-running environments have had their internal TLS certificates expire. Since certificates generated by the BOSH CLI get a one-year expiration date, and it's been about a year since early adopters started using cf-deployment, we suspect that some older environments in the CF community are fast approaching this issue as well. We hope to provide enough of a warning that folks in the community can address this.

Check your certificate expiration dates
This is pretty simple to do. You can copy a certificate -- service_cf_internal_ca is a good one to try -- and paste it into the form on this site: https://www.sslshopper.com/certificate-decoder.html. You'll find the expiration date in the "Valid To" section. If your certificates going to expire soon, continue to the process below.

How to rotate certificates
This is not an easy process, but it's doable. I'll warn you right now that, during the transition, your CF will experience `cf push` downtime, but apps should remain available. Also, if you're deploying with the windows-cell.yml or secure-service-credentials.yml ops-files, the process will be a bit more complicated, so please reach out to the RelInt team for help.
  1. Deploy with concatenated CA certificates
    1. Generate new certs by running
    2. bosh int cf-deployment.yml [-o ... ] --vars-store new-vars.yml -v system_domain=$SYSTEM_DOMAIN
    3. For each new CA cert, concatenate the new CA certificate to both the `ca` and `certificate` field.
    4. Deploy
  2. Deploy with new leaf certificates
    1. For each leaf certificate in your vars-store, replace with the corresponding certificate from new-vars.yml. These leaf certificates are signed by the new CA's.
    2. Deploy. When the api instances roll, users will no longer be able to push apps, until you remove the old CA certificates.
  3. Deploy without the old CA certificates.
    1. For each CA certificate in your vars-store, remove the first certificate in the `ca` and `certificate` fields. The result should be that only the new CA certificates created in step 1.1 should be included in your vars-store.
The RelInt team has also worked through a process for rotating certificates that have already expired. If you have any questions or concerns, jump into the #release-integration channel in the Cloud Foundry slack and feel free to get a hold of the team there.

Thanks!
CF Release Integration







Re: Rotating cf-deployment certificates

Mike Youngstrom
 

Thanks for the clarification Iryna!  I'll do some more studying and respond if I have further questions.

Mike

On Tue, Mar 6, 2018 at 12:07 PM, Iryna Shustava <ishustava@...> wrote:
Hey Mike,

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

We expect no app routability or log availability downtime during the 3 step CA/cert rotation. That is because during Step 1 we make all components trust both CAs - the old one and the new one. During step 2, when we roll out new leaf certificates, all components should trust both CAs, so the certificate switch will happen without downtime. You will, however, see some cf push downtime, as David mentioned.

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

The default expiration is the same for CredHub and BOSH CLI. BOSH CLI does not allow you to change certificate expiration period, but CredHub does. You can do so by adding the duration property measured in days to your certificate or CA variable in the manifest.

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

If you're interested, this doc describes reasons behind cf push downtime during CA rotation.

Thanks!
Iryna, CF Release Integration Team


On Tue, Mar 6, 2018 at 9:21 AM, Mike Youngstrom <youngm@...> wrote:
Thanks for the heads up David.  I have questions about the rotation process.

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

Thanks,
Mike

On Fri, Mar 2, 2018 at 11:32 AM, David Sabeti <dsabeti@...> wrote:
Hey cf-dev,

The Release Integration team has had a few reports from other CF engineering teams that their long-running environments have had their internal TLS certificates expire. Since certificates generated by the BOSH CLI get a one-year expiration date, and it's been about a year since early adopters started using cf-deployment, we suspect that some older environments in the CF community are fast approaching this issue as well. We hope to provide enough of a warning that folks in the community can address this.

Check your certificate expiration dates
This is pretty simple to do. You can copy a certificate -- service_cf_internal_ca is a good one to try -- and paste it into the form on this site: https://www.sslshopper.com/certificate-decoder.html. You'll find the expiration date in the "Valid To" section. If your certificates going to expire soon, continue to the process below.

How to rotate certificates
This is not an easy process, but it's doable. I'll warn you right now that, during the transition, your CF will experience `cf push` downtime, but apps should remain available. Also, if you're deploying with the windows-cell.yml or secure-service-credentials.yml ops-files, the process will be a bit more complicated, so please reach out to the RelInt team for help.
  1. Deploy with concatenated CA certificates
    1. Generate new certs by running
    2. bosh int cf-deployment.yml [-o ... ] --vars-store new-vars.yml -v system_domain=$SYSTEM_DOMAIN
    3. For each new CA cert, concatenate the new CA certificate to both the `ca` and `certificate` field.
    4. Deploy
  2. Deploy with new leaf certificates
    1. For each leaf certificate in your vars-store, replace with the corresponding certificate from new-vars.yml. These leaf certificates are signed by the new CA's.
    2. Deploy. When the api instances roll, users will no longer be able to push apps, until you remove the old CA certificates.
  3. Deploy without the old CA certificates.
    1. For each CA certificate in your vars-store, remove the first certificate in the `ca` and `certificate` fields. The result should be that only the new CA certificates created in step 1.1 should be included in your vars-store.
The RelInt team has also worked through a process for rotating certificates that have already expired. If you have any questions or concerns, jump into the #release-integration channel in the Cloud Foundry slack and feel free to get a hold of the team there.

Thanks!
CF Release Integration






Re: Rotating cf-deployment certificates

Iryna Shustava
 

Hey Mike,

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

We expect no app routability or log availability downtime during the 3 step CA/cert rotation. That is because during Step 1 we make all components trust both CAs - the old one and the new one. During step 2, when we roll out new leaf certificates, all components should trust both CAs, so the certificate switch will happen without downtime. You will, however, see some cf push downtime, as David mentioned.

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

The default expiration is the same for CredHub and BOSH CLI. BOSH CLI does not allow you to change certificate expiration period, but CredHub does. You can do so by adding the duration property measured in days to your certificate or CA variable in the manifest.

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

If you're interested, this doc describes reasons behind cf push downtime during CA rotation.

Thanks!
Iryna, CF Release Integration Team


On Tue, Mar 6, 2018 at 9:21 AM, Mike Youngstrom <youngm@...> wrote:
Thanks for the heads up David.  I have questions about the rotation process.

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

Thanks,
Mike

On Fri, Mar 2, 2018 at 11:32 AM, David Sabeti <dsabeti@...> wrote:
Hey cf-dev,

The Release Integration team has had a few reports from other CF engineering teams that their long-running environments have had their internal TLS certificates expire. Since certificates generated by the BOSH CLI get a one-year expiration date, and it's been about a year since early adopters started using cf-deployment, we suspect that some older environments in the CF community are fast approaching this issue as well. We hope to provide enough of a warning that folks in the community can address this.

Check your certificate expiration dates
This is pretty simple to do. You can copy a certificate -- service_cf_internal_ca is a good one to try -- and paste it into the form on this site: https://www.sslshopper.com/certificate-decoder.html. You'll find the expiration date in the "Valid To" section. If your certificates going to expire soon, continue to the process below.

How to rotate certificates
This is not an easy process, but it's doable. I'll warn you right now that, during the transition, your CF will experience `cf push` downtime, but apps should remain available. Also, if you're deploying with the windows-cell.yml or secure-service-credentials.yml ops-files, the process will be a bit more complicated, so please reach out to the RelInt team for help.
  1. Deploy with concatenated CA certificates
    1. Generate new certs by running
    2. bosh int cf-deployment.yml [-o ... ] --vars-store new-vars.yml -v system_domain=$SYSTEM_DOMAIN
    3. For each new CA cert, concatenate the new CA certificate to both the `ca` and `certificate` field.
    4. Deploy
  2. Deploy with new leaf certificates
    1. For each leaf certificate in your vars-store, replace with the corresponding certificate from new-vars.yml. These leaf certificates are signed by the new CA's.
    2. Deploy. When the api instances roll, users will no longer be able to push apps, until you remove the old CA certificates.
  3. Deploy without the old CA certificates.
    1. For each CA certificate in your vars-store, remove the first certificate in the `ca` and `certificate` fields. The result should be that only the new CA certificates created in step 1.1 should be included in your vars-store.
The RelInt team has also worked through a process for rotating certificates that have already expired. If you have any questions or concerns, jump into the #release-integration channel in the Cloud Foundry slack and feel free to get a hold of the team there.

Thanks!
CF Release Integration





Re: Rotating cf-deployment certificates

Mike Youngstrom
 

Thanks for the heads up David.  I have questions about the rotation process.

Although all applications may remain up while re-deploying I imagine things like loggregator will stop working mid deploy when doppler and metron certs no longer match.  Perhaps reps will be unable to properly drain when their certs don't match?  Does that sound correct?

Is the expiration default the same for certificates created by credhub?  Are you aware of any way to increase the default expiration date for credhub or bosh-cli?

Long term are core teams working towards zero downtime cert rotation capabilities?  Or do you foresee the need to rotate with some service impact an issue long term?

Thanks,
Mike

On Fri, Mar 2, 2018 at 11:32 AM, David Sabeti <dsabeti@...> wrote:
Hey cf-dev,

The Release Integration team has had a few reports from other CF engineering teams that their long-running environments have had their internal TLS certificates expire. Since certificates generated by the BOSH CLI get a one-year expiration date, and it's been about a year since early adopters started using cf-deployment, we suspect that some older environments in the CF community are fast approaching this issue as well. We hope to provide enough of a warning that folks in the community can address this.

Check your certificate expiration dates
This is pretty simple to do. You can copy a certificate -- service_cf_internal_ca is a good one to try -- and paste it into the form on this site: https://www.sslshopper.com/certificate-decoder.html. You'll find the expiration date in the "Valid To" section. If your certificates going to expire soon, continue to the process below.

How to rotate certificates
This is not an easy process, but it's doable. I'll warn you right now that, during the transition, your CF will experience `cf push` downtime, but apps should remain available. Also, if you're deploying with the windows-cell.yml or secure-service-credentials.yml ops-files, the process will be a bit more complicated, so please reach out to the RelInt team for help.
  1. Deploy with concatenated CA certificates
    1. Generate new certs by running
    2. bosh int cf-deployment.yml [-o ... ] --vars-store new-vars.yml -v system_domain=$SYSTEM_DOMAIN
    3. For each new CA cert, concatenate the new CA certificate to both the `ca` and `certificate` field.
    4. Deploy
  2. Deploy with new leaf certificates
    1. For each leaf certificate in your vars-store, replace with the corresponding certificate from new-vars.yml. These leaf certificates are signed by the new CA's.
    2. Deploy. When the api instances roll, users will no longer be able to push apps, until you remove the old CA certificates.
  3. Deploy without the old CA certificates.
    1. For each CA certificate in your vars-store, remove the first certificate in the `ca` and `certificate` fields. The result should be that only the new CA certificates created in step 1.1 should be included in your vars-store.
The RelInt team has also worked through a process for rotating certificates that have already expired. If you have any questions or concerns, jump into the #release-integration channel in the Cloud Foundry slack and feel free to get a hold of the team there.

Thanks!
CF Release Integration




Re: Rotating cf-deployment certificates

Benjamin Gandon
 

Hello Carlo,

I'm definitely interested by your step that checks if any of the certs are close to the expiration date!
If you can share this on a Github somewhere it would be perfect!

Cdt,
/Benjamin GANDON (depuis mon iPhone)

Le 6 mars 2018 à 03:05, Carlo Alberto Ferraris <carlo.ferraris@...> a écrit :

Just a couple of random notes about this:
- since we have a lot of certificates in our deployment manifest (not just the CF/diego ones) we actually have a step in our deployment process that automatically checks if any of them is close to the expiration date (or invalid for other reasons) if anybody is interested we can publish it out somewhere
- would be nice to have the cert generation scripts prompt for the desired validity of the certificates (to avoid surprises)

1561 - 1580 of 9389