Proposal: Improving Security for HTTP Ingress to CFAR Application Containers


Eric Malm <emalm@...>
 

Hi, everyone,

Building on the features and technologies the CF Diego and Routing teams have introduced into the CF App Runtime to improve application routing consistency, security, and stability (https://lists.cloudfoundry.org/g/cf-dev/topic/11900235#7744, which we have often called "route integrity"), the Diego team intends to make it possible for platform operators to opt into improving the security of how traffic ingresses into application containers. In particular, operators would be able to opt into ensuring that only CF system components, or even only the gorouter HTTP routers, would be able to connect to application containers from the infrastructure-provided network.

The full proposal document is available at https://docs.google.com/document/d/1DjapCLbdgGBmpuWt2P2PV-qm_vUwI_9IZHae9TbN_Pw/edit, and we welcome your comments and questions on the document or on this mailing list thread.

Some areas on which we would particularly like community feedback:

- This secured configuration would initially be incompatible with CF SSH, and would likely never be compatible with TCP routing, as the Routing team has focused its efforts on replacing both the Gorouter and the TCP routers with Istio-configured gateway Envoy proxies. Would those incompatibilities prohibit you as platform operators from opting into this improved security in environments where you would particularly like to enforce it?

- As part of enforcing this more secure configuration, the Diego cell components no longer map ports on their host VM directly to application ports inside the container. Each app instance currently receives the value of its host-side port in its CF_INSTANCE_PORT environment variable, though, and it is also exposed in the response from CC's app stats endpoint. For a variety of reasons (primarily the general availability of container networking and default app-security-group policies), we expect that these values are no longer useful for applications, and so we would like to deprecate them as part of this work and not to supply them in this optional, more secure configuration. Before we do so, we would like to know whether your applications, libraries, or other CF-related tools currently use this information and, if so, to what end.

Thanks,
Eric Malm, for the CF Diego team


Mike Youngstrom <youngm@...>
 

- This secured configuration would initially be incompatible with CF SSH, and would likely never be compatible with TCP routing, as the Routing team has focused its efforts on replacing both the Gorouter and the TCP routers with Istio-configured gateway Envoy proxies. Would those incompatibilities prohibit you as platform operators from opting into this improved security in environments where you would particularly like to enforce it?

We would love to have greater route integrity and security for our HTTP clients.  If configuring this would provide those improvements to HTTP but not for TCP and SSH, we would still deploying it just to get the benefits for HTTP.  Is that what you're asking? 
 
- As part of enforcing this more secure configuration, the Diego cell components no longer map ports on their host VM directly to application ports inside the container. Each app instance currently receives the value of its host-side port in its CF_INSTANCE_PORT environment variable, though, and it is also exposed in the response from CC's app stats endpoint. For a variety of reasons (primarily the general availability of container networking and default app-security-group policies), we expect that these values are no longer useful for applications, and so we would like to deprecate them as part of this work and not to supply them in this optional, more secure configuration. Before we do so, we would like to know whether your applications, libraries, or other CF-related tools currently use this information and, if so, to what end.
 
We don't use these values as part of an API or client library.  However, we do find it useful, on occasion, to know real network ip address of the cell an application is running on usually for firewall or other network debugging activities.  We don't ever use the port information.  I imagine we can find this information other ways but the CC api is currently the simplest way our application developers could self service find this data.  I don't think this is a big issue.  Just noting that my team (and perhaps others) will need to devise a different way to provide this information to app developers.
 
Thanks,
Mike


Eric Malm <emalm@...>
 

Hey, Mike,

On Wed, Aug 15, 2018 at 10:36 AM, Mike Youngstrom <youngm@...> wrote:
- This secured configuration would initially be incompatible with CF SSH, and would likely never be compatible with TCP routing, as the Routing team has focused its efforts on replacing both the Gorouter and the TCP routers with Istio-configured gateway Envoy proxies. Would those incompatibilities prohibit you as platform operators from opting into this improved security in environments where you would particularly like to enforce it?

We would love to have greater route integrity and security for our HTTP clients.  If configuring this would provide those improvements to HTTP but not for TCP and SSH, we would still deploying it just to get the benefits for HTTP.  Is that what you're asking? 

Not quite: in the initial form of this more secured configuration, neither CF SSH nor TCP routing will work at all, as their gateway/front/edge routers would not have the network pathway into the container that they currently recognize. The Diego team is actually done at this point with the diego-release features required to enable that initial secured configuration, and will soon contribute an experimental operations file to opt into it to cf-deployment shortly and then focus on our approach to make CF SSH work again in this secure mode. We don't expect the current TCP routing tier ever to work with this configuration, though, as the Routing team is instead focused on the Istio integration effort as a longer-term plan to replace both the HTTP and TCP routing tiers. So I'm interested in knowing whether you'd be able to enable this extra security in any of your CF environments if either (a) CF SSH doesn't work as a result (short-term obstacle, resolved in a month or so) or (b) TCP routing doesn't work (longer-term obstacle, resolved only with Istio integration).
 
We don't use these values as part of an API or client library.  However, we do find it useful, on occasion, to know real network ip address of the cell an application is running on usually for firewall or other network debugging activities.  We don't ever use the port information.  I imagine we can find this information other ways but the CC api is currently the simplest way our application developers could self service find this data.  I don't think this is a big issue.  Just noting that my team (and perhaps others) will need to devise a different way to provide this information to app developers.

Great, thanks for the feedback! The CF_INSTANCE_IP environment variable will continue to contain the cell VM's IP inside the container environment, and it'll likewise still be present in the response from the CC stats endpoint, so it sounds like those network-debugging activities would be unaffected. It'd of course be great to hear the specifics of how having that cell VM IP has been useful to your developers or to you as the platform operators in resolving those network-related issues, though.

Thanks again,
Eric


Mike Youngstrom <youngm@...>
 

Got it.
 
Not quite: in the initial form of this more secured configuration, neither CF SSH nor TCP routing will work at all, as their gateway/front/edge routers would not have the network pathway into the container that they currently recognize. The Diego team is actually done at this point with the diego-release features required to enable that initial secured configuration, and will soon contribute an experimental operations file to opt into it to cf-deployment shortly and then focus on our approach to make CF SSH work again in this secure mode. We don't expect the current TCP routing tier ever to work with this configuration, though, as the Routing team is instead focused on the Istio integration effort as a longer-term plan to replace both the HTTP and TCP routing tiers. So I'm interested in knowing whether you'd be able to enable this extra security in any of your CF environments if either (a) CF SSH doesn't work as a result (short-term obstacle, resolved in a month or so) or (b) TCP routing doesn't work (longer-term obstacle, resolved only with Istio integration).

We would probably not deploy the improved http ingress until tcp and ssh are both working.  Our priority from the improvements in http ingress have been more on the reliable side and less on the security side.  We haven't run into any NATS mis-routing requests issues for a while and we do have TCP customers.  So, we would probably prefer to keep riding our current NATS good luck streak than disrupt our TCP customers.
 
Great, thanks for the feedback! The CF_INSTANCE_IP environment variable will continue to contain the cell VM's IP inside the container environment, and it'll likewise still be present in the response from the CC stats endpoint, so it sounds like those network-debugging activities would be unaffected. It'd of course be great to hear the specifics of how having that cell VM IP has been useful to your developers or to you as the platform operators in resolving those network-related issues, though.

That's perfect.  We don't use the port.

Some examples where cell ip address has come in handy:
* Mostly firewall debugging.  Our firewall situation is a mess.  Lots of manual work and issues to debug where sometimes knowing the specific cell ip address can help in debugging a problem.
* Occasionally customers want to do tcpdumps from a destination server.  The ip of the cell hosting the source app instance helps reduce the tcpdump scope.  Unfortunately, in tcpdump situations the CF operations team usually gets involved anyway to grab a tcpdump from the cell side since last time I checked we couldn't take tcpdumps from in a container.  So, these scenarios are usually not very self service anyway.

Hope that helps.

Mike


Eric Malm <emalm@...>
 

We would probably not deploy the improved http ingress until tcp and ssh are both working.  Our priority from the improvements in http ingress have been more on the reliable side and less on the security side.  We haven't run into any NATS mis-routing requests issues for a while and we do have TCP customers.  So, we would probably prefer to keep riding our current NATS good luck streak than disrupt our TCP customers.

Great, thanks for letting me know. Just to clarify, route integrity on its own (enabled via https://github.com/cloudfoundry/cf-deployment/blob/master/operations/experimental/enable-routing-integrity.yml) is intended primarily to improve HTTP routing reliability in the face of NATS and other component failures, and is still compatible with both CF SSH and TCP routing. This proposed more secure mode is an opt-in configuration that depends on but is separate from that "basic" route integrity.

Also, we on the Diego team think that we've finally finished ironing out a minor edge case around making sure incoming requests are handled correctly when app instances are being shutdown gracefully, so we expect to work with Release Integration in the next month or so to promote that route-integrity ops file to be a stable one and then later to be the default configuration in cf-deployment.
 
That's perfect.  We don't use the port.

Some examples where cell ip address has come in handy:
* Mostly firewall debugging.  Our firewall situation is a mess.  Lots of manual work and issues to debug where sometimes knowing the specific cell ip address can help in debugging a problem.
* Occasionally customers want to do tcpdumps from a destination server.  The ip of the cell hosting the source app instance helps reduce the tcpdump scope.  Unfortunately, in tcpdump situations the CF operations team usually gets involved anyway to grab a tcpdump from the cell side since last time I checked we couldn't take tcpdumps from in a container.  So, these scenarios are usually not very self service anyway.

Hope that helps.

Certainly, thanks for the details!

Best,
Eric


Mike Youngstrom <youngm@...>
 

We would probably not deploy the improved http ingress until tcp and ssh are both working.  Our priority from the improvements in http ingress have been more on the reliable side and less on the security side.  We haven't run into any NATS mis-routing requests issues for a while and we do have TCP customers.  So, we would probably prefer to keep riding our current NATS good luck streak than disrupt our TCP customers.

Great, thanks for letting me know. Just to clarify, route integrity on its own (enabled via https://github.com/cloudfoundry/cf-deployment/blob/master/operations/experimental/enable-routing-integrity.yml) is intended primarily to improve HTTP routing reliability in the face of NATS and other component failures, and is still compatible with both CF SSH and TCP routing. This proposed more secure mode is an opt-in configuration that depends on but is separate from that "basic" route integrity.

Also, we on the Diego team think that we've finally finished ironing out a minor edge case around making sure incoming requests are handled correctly when app instances are being shutdown gracefully, so we expect to work with Release Integration in the next month or so to promote that route-integrity ops file to be a stable one and then later to be the default configuration in cf-deployment.

Oh man, after re-reading your email it now makes sense.  To be honest I didn't actually read the document you provided since it wasn't open for read to everyone so I just assumed what was in there instead.  Sorry.

Typically in our environments we use network firewalls to force that ingress into the network zones holding CF instances only happen through Enterprise load balancers and only then to specific components, e.g. gorouter, ssh-proxy, tcp router, etc., and use security groups to stop apps talking directly to other containers.  Though I imagine in the future we may deploy to environments with less strict network firewall setups.  In such an environment this configuration option would be very useful and we probably would use it without TCP routing support if we had such a situation.  But we don't currently.

Thanks for helping me through this email. :)

Mike


Eric Malm <emalm@...>
 

On Thu, Aug 16, 2018 at 9:16 PM, Mike Youngstrom <youngm@...> wrote:
Oh man, after re-reading your email it now makes sense.  To be honest I didn't actually read the document you provided since it wasn't open for read to everyone so I just assumed what was in there instead.  Sorry.

Huh, something weird has been going on with the permissions on that document: this is the third time now that I've had to change them back to allowing global comments. If for some reason that reverts back to a more restricted mode (on this proposal document or any others I've posted here) please let me know via an access request or via email or Slack and I'll correct it again.
 
Typically in our environments we use network firewalls to force that ingress into the network zones holding CF instances only happen through Enterprise load balancers and only then to specific components, e.g. gorouter, ssh-proxy, tcp router, etc., and use security groups to stop apps talking directly to other containers.  Though I imagine in the future we may deploy to environments with less strict network firewall setups.  In such an environment this configuration option would be very useful and we probably would use it without TCP routing support if we had such a situation.  But we don't currently.

Thanks for helping me through this email. :)

Sure, thanks for the extra feedback, and for hanging in there!

Best,
Eric