toggle quoted messageShow quoted text
As the previous project lead for RelInt, I want to speak to Marco's concerns directly. We _definitely_ considered the operator as an important persona during any decision-making; if anything, we were overcommitted to that persona, evidenced by the fact that we became at times an obstacle to CFF dev teams out of fear of making a breaking changes for operators.
There's clearly some concern that operators won't be able to keep up with breaking changes. However, one impact of making breaking changes more frequently -- and, even better, on a schedule -- is to reduce the difficulty of adapting to them. To build a bit on what Josh said earlier in his example about cf-networking 2.0, as we pushed off releasing a major version of cf-deployment, more backwards-incompatible updates were stockpiled in the backlog. In the end, cf-deployment 2.0 included **seven** breaking changes instead of merely one or two.
To link this back to Marco's story -- "As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known vulnerabilities as soon as possible" -- this is already a problem with cf-deployment. As others have mentioned, there's no back-porting of cf-deployment after major version bumps, so operators already have to accommodate breaking changes in order to get CVE fixes. I understand that the proposal means that this happens more often, but it also means that major version bumps will be more predictable and less risky.
I wasn't sure if it was worth rehashing the days of cf-release or not, but since Jesse broached the subject, I'd give his comments a +1 all around. One of the ways I understood Josh's proposal was as an important course correction. If cf-release was too free-wheeling in making breaking changes, cf-deployment has been too conservative. The proposal for a regular cadence of breaking changes seems like a balance between those two. Similarly, this is a re-balancing with regards to the personas as well: based on experience, the RelInt team has learned that it should be more willing to release breaking changes for operators in order to empower the CFF dev teams.
Also _formerly_ of the RelInt team
 Bernd has an interesting point about providing patch updates only to the latest release of cf-deployment, as a way to provide operators with a CVE-fix-only release. Providing such releases is also non-trivial work that I'm not sure the RelInt team would prioritize. Also, RelInt ships minor releases twice per week, so the changesets are typically small. Still, it seems a bit more palatable than any kind of LTS because it assists operators in living up to the "you better run fast."
I was about to mention that I indeed enjoyed the existing CF model of releases which roughly translated to “you better run fast” for consumers.
The thing I found needed some tweaking in the existing model was the approach to including fixes for prio very high CVEs. Often times, in our quest to run fast and keep systems secure as fast as possible, we ended up
pulling in a bunch of features which required additional validation and essentially slowed us down in our effort of rolling things out to production.
I felt that the better approach to support people that can keep the speed would have been to always provide fixes for prio very high CVEs as cherry-picks based on the latest released version (and then of course also include
those fixes into the next “regular” release, too).
Based on the comments so far, it sounds like for consumers “you better run fast” will actually be harder with the newly proposed approach. But maybe I’m not fully understanding the concepts, so it would be great to get
some more details on the plans.
Subject: Re: [cf-dev] cf-deployment 3.0
Food for thought: One of the challenges here is that maintaining patches for past coordinated releases is expensive (both in time and CI costs). In the CF ecosystem, this has traditionally been the responsibility of the downstream commercial
This isn't to say that there isn't a solution that can help all downstream users (including non-commercial users AND the distros), yet not burden the Rel Int team too much. I'm not sure what that solution is though...
I’m going to agree with Marco’s concerns here. Making life harder and less stable for the end users of CF has a real potential to alienate and push away the CF
userbase altogether, even if it’s just in appearance (seeing monthly major releases of a product may cause new organizations to hesitate to migrate, until the release process appears more stable.
Thanks for the context, I wasn't aware of what happened before the release of networking 2.0. To stick with your example, though: From what you are saying I have understood that you would rather have done it this way
– please correct me here if I'm wrong:
integrate networking release 2.0 into cf-deployment,
integrate other PRs with breaking changes
bumping cf-deployment to a new major version, given above changes
merging the CVE fixes only into the new major version of cf-deployment
With this process, you would have achieved the following:
the development teams are happy, because they shipped as soon as they were ready to
operators are grumpy, because they have to bump networking to a new major version and adopt to other breaking changes in order to fix CVEs
I'm not saying you have to turn this tradeoff the other way around, but in my opinion this doesn't seem very consumer friendly.
In your team's mission, you have clearly stated that your goal is to enable development teams to maintain a high velocity. I'd like to stress
that we shouldn't leave the operators and users out of the picture here. In the end, you're developing for them, not for yourself.
I'm not sure if the consumer/operator persona is a thing for RelInt, but if that's the case, here's something I'd like to hold true for whatever
change RelInt makes to their process:
"As an operator of CF, I'd like to consume CVE fixes with as little changes to my existing installation as possible, such that I close known
vulnerabilities as soon as possible"
Does that sound reasonable?
I'm happy to provide more context on the container networking 2.0 reference.
The container networking team submitted a PR to cf-deployment with changes required for them to ship v2.0.
RelInt deferred the container networking team's PR for a few weeks due to competing priorities including multiple CVE's fixes.
During the deferral time, a few other PRs were submitted which included breaking changes.
These additional changes took much more time to integrate and validate than anticipated and in the end, the container networking team's 2.0 release was published in cf-d about 5 weeks after it was ready to go.
The introduction of a regular cadence aims to mitigate this type of delay in the future. Had we had one at the time, the networking team would have timed it's PR to align and we would have been poised to accept and publish it quickly.
We believe this will help teams confidently plan for, communicate about, and negotiate integrating their releases into cf-deployment.
And hopefully enable the RelInt team to integrate and ship major releases more seamlessly.
This is an evolving process so we'll see how things roll in the coming months and make adjustments where it makes sense to do so.
I appreciate and welcome any and all feedback along the way.
Thanks very much,