cf-deployment 3.0
- provide a reliable mechanism for cf component teams to integrate and release major changes
- mitigate fear of major point releases in the minds of operators/cf-consumers
How long will 1.x, and 2.x cf-deployments be maintained with security patches? Without that, it sounds like there’s potential for a lot of organizations to be faced with breaking changes and instability every time they upgrade (if upgrade cycles internally take a month or two, and major versions are coming out as often or more), not to mention the difficulties of jumping multiple major versions at once.
From:
<cf-dev@...> on behalf of Josh Collins <jcollins@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Tuesday, July 3, 2018 at 6:16 PM
To: cf-eng <cf-eng@...>, cf-pm <cf-pm@...>, "cf-dev@..." <cf-dev@...>, CF Dev <cf-dev-eng@...>
Subject: [External] [cf-dev] cf-deployment 3.0
Hey Y'all,
Cf-deployment 3.0 is around the corner.
We're going to go 3.0 in 2-3 weeks.
We released cf-deployment 2.0 on June 18th and included 'breaking' changes.
Breaking changes in the context of cf-d are changes which would require special attention from operators for the deployment to succeed. Executing the same bosh deploy command/args run used in the previous deployment may fail depending on which ops files and features operators had deployed with in the past.
Going forward, we'd like to introduce a more regular (~monthly) cadence to major point releases of cf-deployment.
The goal is two-fold and in-order-of-importance:
- provide a reliable mechanism for cf component teams to integrate and release major changes
- mitigate fear of major point releases in the minds of operators/cf-consumers
As of today, we've got one PR that includes breaking changes and I'm putting out a call to y'all.
If you've got what you'd consider to be a breaking change that warrants going out in a major point release of cf-deployment, please submit your PRs and reach out to the RelInt team as soon as you're able to so we can come up to speed and support you!
Cheers,
Josh
Because the RelInt team's primary goal is to support the CF Foundation engineering teams and their ability to validate their commits in CI, we need to focus more on keeping up-to-date with their changes. We want to set a release cadence that's aligned with, and ideally increases, the velocity of the teams. Take a look at the what happened with container networking when they wanted to ship 2.0...
Thanks for reaching out Geoff!
Dear Josh,
You are correct, in the past the RelInt team hasn't provided security releases. Instead, the credo was to go forward with the regular releases to also get the newest security fixes. This, however, was only easily possible because *the newer version did not introduce breaking changes with potentially big impact at the same time*.
I understand your mission of helping other teams increase their velocity. Maintaining multiple branches with fixes is certainly not fun, and I agree that it makes sense to try to avoid this if possible. I'm not sure I get the container networking 2.0 reference, though. Could you elaborate a bit more on this and how it is related to the current discussion?
Thanks and warm regards
Marco
From: <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Wednesday, 11. July 2018 at 20:43
To: "cf-dev@..." <cf-dev@...>
Subject: Re: [cf-dev] cf-deployment 3.0
The Release Integration team hasn't provided security releases in the past -- for neither cf-release nor cf-deployment -- and doing so would be burdensome and impede the evolution of cf-deployment. Therefore, we're not currently planning to start providing security patches. But we appreciate the feedback and will keep an eye on the problem.
Because the RelInt team's primary goal is to support the CF Foundation engineering teams and their ability to validate their commits in CI, we need to focus more on keeping up-to-date with their changes. We want to set a release cadence that's aligned with,
and ideally increases, the velocity of the teams. Take a look at the what happened with container networking when they wanted to ship 2.0...
Thanks for reaching out Geoff!
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From:
<cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From: <cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
From: cf-dev@... <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
CTO, Cloud Foundry Foundation
1.267.250.0815
Food for thought: One of the challenges here is that maintaining patches for past coordinated releases is expensive (both in time and CI costs). In the CF ecosystem, this has traditionally been the responsibility of the downstream commercial distributions.This isn't to say that there isn't a solution that can help all downstream users (including non-commercial users AND the distros), yet not burden the Rel Int team too much. I'm not sure what that solution is though...--On Mon, Jul 16, 2018 at 9:47 AM Franks, Geoff <geoff.franks@...> wrote:I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From: <cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
From: cf-dev@... <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
I was about to mention that I indeed enjoyed the existing CF model of releases which roughly translated to “you better run fast” for consumers.
The thing I found needed some tweaking in the existing model was the approach to including fixes for prio very high CVEs. Often times, in our quest to run fast and keep systems secure as fast as possible, we ended up pulling in a bunch of features which required additional validation and essentially slowed us down in our effort of rolling things out to production.
I felt that the better approach to support people that can keep the speed would have been to always provide fixes for prio very high CVEs as cherry-picks based on the latest released version (and then of course also include those fixes into the next “regular” release, too).
Based on the comments so far, it sounds like for consumers “you better run fast” will actually be harder with the newly proposed approach. But maybe I’m not fully understanding the concepts, so it would be great to get some more details on the plans.
Regards,
Bernd
From: <cf-dev@...> on behalf of Chip Childers <cchilders@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Wednesday, 18. July 2018 at 19:38
To: "cf-dev@..." <cf-dev@...>
Subject: Re: [cf-dev] cf-deployment 3.0
Food for thought: One of the challenges here is that maintaining patches for past coordinated releases is expensive (both in time and CI costs). In the CF ecosystem, this has traditionally been the responsibility of the downstream commercial distributions.
This isn't to say that there isn't a solution that can help all downstream users (including non-commercial users AND the distros), yet not burden the Rel Int team too much. I'm not sure what that solution is though...
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From: <cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
From: cf-dev@... <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh
--
Chip Childers
CTO, Cloud Foundry Foundation
1.267.250.0815
I was about to mention that I indeed enjoyed the existing CF model of releases which roughly translated to “you better run fast” for consumers.
The thing I found needed some tweaking in the existing model was the approach to including fixes for prio very high CVEs. Often times, in our quest to run fast and keep systems secure as fast as possible, we ended up pulling in a bunch of features which required additional validation and essentially slowed us down in our effort of rolling things out to production.
I felt that the better approach to support people that can keep the speed would have been to always provide fixes for prio very high CVEs as cherry-picks based on the latest released version (and then of course also include those fixes into the next “regular” release, too).
Based on the comments so far, it sounds like for consumers “you better run fast” will actually be harder with the newly proposed approach. But maybe I’m not fully understanding the concepts, so it would be great to get some more details on the plans.
Regards,
Bernd
From: <cf-dev@...> on behalf of Chip Childers <cchilders@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Wednesday, 18. July 2018 at 19:38
To: "cf-dev@..." <cf-dev@...>
Subject: Re: [cf-dev] cf-deployment 3.0
Food for thought: One of the challenges here is that maintaining patches for past coordinated releases is expensive (both in time and CI costs). In the CF ecosystem, this has traditionally been the responsibility of the downstream commercial distributions.
This isn't to say that there isn't a solution that can help all downstream users (including non-commercial users AND the distros), yet not burden the Rel Int team too much. I'm not sure what that solution is though...
On Mon, Jul 16, 2018 at 9:47 AM Franks, Geoff <geoff.franks@...> wrote:
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From: <cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
From: cf-dev@... <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh--
Chip Childers
CTO, Cloud Foundry Foundation
1.267.250.0815
As the previous project lead for RelInt, I want to speak to Marco's concerns directly. We _definitely_ considered the operator as an important persona during any decision-making; if anything, we were overcommitted to that persona, evidenced by the fact that we became at times an obstacle to CFF dev teams out of fear of making a breaking changes for operators.There's clearly some concern that operators won't be able to keep up with breaking changes. However, one impact of making breaking changes more frequently -- and, even better, on a schedule -- is to reduce the difficulty of adapting to them. To build a bit on what Josh said earlier in his example about cf-networking 2.0, as we pushed off releasing a major version of cf-deployment, more backwards-incompatible updates were stockpiled in the backlog. In the end, cf-deployment 2.0 included **seven** breaking changes instead of merely one or two.To link this back to Marco's story -- "As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible" -- this is already a problem with cf-deployment. As others have mentioned, there's no back-porting of cf-deployment after major version bumps, so operators already have to accommodate breaking changes in order to get CVE fixes. I understand that the proposal means that this happens more often, but it also means that major version bumps will be more predictable and less risky.[0]I wasn't sure if it was worth rehashing the days of cf-release or not, but since Jesse broached the subject, I'd give his comments a +1 all around. One of the ways I understood Josh's proposal was as an important course correction. If cf-release was too free-wheeling in making breaking changes, cf-deployment has been too conservative. The proposal for a regular cadence of breaking changes seems like a balance between those two. Similarly, this is a re-balancing with regards to the personas as well: based on experience, the RelInt team has learned that it should be more willing to release breaking changes for operators in order to empower the CFF dev teams.SabetiAlso _formerly_ of the RelInt team[0] Bernd has an interesting point about providing patch updates only to the latest release of cf-deployment, as a way to provide operators with a CVE-fix-only release. Providing such releases is also non-trivial work that I'm not sure the RelInt team would prioritize. Also, RelInt ships minor releases twice per week, so the changesets are typically small. Still, it seems a bit more palatable than any kind of LTS because it assists operators in living up to the "you better run fast."On Wed, Jul 18, 2018 at 10:59 AM Krannich, Bernd <bernd.krannich@...> wrote:I was about to mention that I indeed enjoyed the existing CF model of releases which roughly translated to “you better run fast” for consumers.
The thing I found needed some tweaking in the existing model was the approach to including fixes for prio very high CVEs. Often times, in our quest to run fast and keep systems secure as fast as possible, we ended up pulling in a bunch of features which required additional validation and essentially slowed us down in our effort of rolling things out to production.
I felt that the better approach to support people that can keep the speed would have been to always provide fixes for prio very high CVEs as cherry-picks based on the latest released version (and then of course also include those fixes into the next “regular” release, too).
Based on the comments so far, it sounds like for consumers “you better run fast” will actually be harder with the newly proposed approach. But maybe I’m not fully understanding the concepts, so it would be great to get some more details on the plans.
Regards,
Bernd
From: <cf-dev@...> on behalf of Chip Childers <cchilders@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Wednesday, 18. July 2018 at 19:38
To: "cf-dev@..." <cf-dev@...>
Subject: Re: [cf-dev] cf-deployment 3.0
Food for thought: One of the challenges here is that maintaining patches for past coordinated releases is expensive (both in time and CI costs). In the CF ecosystem, this has traditionally been the responsibility of the downstream commercial distributions.
This isn't to say that there isn't a solution that can help all downstream users (including non-commercial users AND the distros), yet not burden the Rel Int team too much. I'm not sure what that solution is though...
On Mon, Jul 16, 2018 at 9:47 AM Franks, Geoff <geoff.franks@...> wrote:
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
From: <cf-dev@...> on behalf of Marco Voelz <marco.voelz@...>
Reply-To: "cf-dev@..." <cf-dev@...>
Date: Monday, July 16, 2018 at 1:34 AM
To: "cf-dev@..." <cf-dev@...>
Subject: [External] Re: [cf-dev] cf-deployment 3.0
Dear Josh,
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way – please correct me here if I'm wrong:
- integrate networking release 2.0 into cf-deployment,
- integrate other PRs with breaking changes
- bumping cf-deployment to a new major version, given above changes
- merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
- the development teams are happy, because they shipped as soon as they were ready to
- operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible"
Does that sound reasonable?
Warm regards
Marco
From: cf-dev@... <cf-dev@...> on behalf of Josh Collins <jcollins@...>
Sent: Friday, July 13, 2018 11:39:30 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Hi Marco,
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,
Josh--
Chip Childers
CTO, Cloud Foundry Foundation
1.267.250.0815
Thanks Geoff, Marco, Chip, Jesse, Bernd, and David for sharing your feedback and thoughts. You’ve expressed valid concerns and provided valuable context that I take to heart. I really appreciate the time and effort required for meaningful dialogue about the impacts of the proposed release cadence.
While the RelInt team's primary goal remains supporting the CF Foundation engineering teams and their ability to validate their commits in CI, your points underscore a tension we’re acutely aware of.
We’re trying to meet the needs of both the CFF Contributor and Operator and the ‘trick’ is to find a sustainable balance between the two. However, on occasions where we must prioritize one over the other we’re going to favor the CFF Contributor.
I mentioned this earlier, but it’s worth restating that the RelInt team doesn’t have any plans provide LTS support and as Chip and Jesse pointed out that has traditionally been a value-added service provided by commercial vendors.
In the spirit of iteration, I’d like to propose we proceed with the release cadence I originally outlined and see how it goes.
Again, thank you for providing such valuable feedback.
Cheers,
Josh Collins
Dear Josh, dear David,
Thanks David for sharing your past experiences in the RelInt team. I can sympathize with the stories you shared and understand the motivation for the planned changes better.
Now that cf-deployment 3.0 is there, let me tell you "how it went": It now means you have to switch to bosh-dns to receive security updates.
There is a number of reasons why we didn't introduce bosh-dns yet in our production system:
- This ~200 lines of .yml just for aliasing DNS names [1], as the story making this obsolete isn't done yet [2]
- This needs to be replicated e.g. in the ops-file to rename the network [3] which makes it even more terrible to maintain
- There were open issues [4] that are important for larger-scale deployments. I give you that this is fixed now with dns-release 1.8.0 – but this came after you released cf-deployment 3.0
- Parts of the above issue try a fix by introducing an experimental flag to get feedback from teams. Given this actually *is* an issue, I'd want to wait what comes out of this.
- Other teams are still surprised from time to time by bosh-dns behavior and are looking into whether this might have implications they need to deal with [5]
- adopt bosh-dns *right now* although we don't feel good about it,
- try to bring back consul for a while (not even sure that's possible) and otherwise follow cf-deployment 3.0
- backport security fixes only to a cf-deployment 2.x based production env
Sent: Wednesday, July 18, 2018 11:54:06 PM
To: cf-dev@...
Subject: Re: [cf-dev] cf-deployment 3.0
Thanks Geoff, Marco, Chip, Jesse, Bernd, and David for sharing your feedback and thoughts. You’ve expressed valid concerns and provided valuable context that I take to heart. I really appreciate the time and effort required for meaningful dialogue about the impacts of the proposed release cadence.
While the RelInt team's primary goal remains supporting the CF Foundation engineering teams and their ability to validate their commits in CI, your points underscore a tension we’re acutely aware of.
We’re trying to meet the needs of both the CFF Contributor and Operator and the ‘trick’ is to find a sustainable balance between the two. However, on occasions where we must prioritize one over the other we’re going to favor the CFF Contributor.
I mentioned this earlier, but it’s worth restating that the RelInt team doesn’t have any plans provide LTS support and as Chip and Jesse pointed out that has traditionally been a value-added service provided by commercial vendors.
In the spirit of iteration, I’d like to propose we proceed with the release cadence I originally outlined and see how it goes.
Again, thank you for providing such valuable feedback.
Cheers,
Josh Collins