container restart on logout


DHR
 

Hi,

Last year when ‘cf ssh’ functionality was being discussed, I’m pretty sure that the concept of automatically restarting containers following an SSH session was discussed.
It was to protect against creating app container snowflakes.

I’m fairly sure that protection hasn’t been introduced yet: I tested cf ssh-ing into a PCFDEV container today, writing a file & was able to log back in to the container later and see that it was still present.

Is this feature or any other app container snowflake protection still planned?
I couldn’t see anything in the Diego backlog (https://www.pivotaltracker.com/n/projects/1003146 <https://www.pivotaltracker.com/n/projects/1003146>).

Thanks
Dave


Jon Price
 

This is something that has been on our wishlist as well but I haven't seen any discussion about it in quite some time. Here is one of the original discussions about it: https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GCFOOYRUT5ARBMUHDGINID46KFNORNYM/

It would go a long way with our security team if we could have some sort of recycling policy for containers in some of our more secure environments.

Jon Price
Intel Corporation


DHR
 

Thanks Jon. The financial services clients I have worked with would also like the ability to turn on ‘cf ssh’ support in production, safe in the knowledge that app teams won’t abuse it by creating app snowflakes.

I see that the audit trail mentioned in the thread you posted have been implemented in ‘cf events’. Like this:

time event actor description
2016-12-19T16:20:36.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T15:30:33.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T12:00:53.00+0000 audit.app.ssh-authorized user index: 0


That said: I still think the container recycle functionality, available as say a feature flag, would be really appreciated by the large enterprise community.

On 19 Dec 2016, at 18:25, Jon Price <jon.price(a)intel.com> wrote:

This is something that has been on our wishlist as well but I haven't seen any discussion about it in quite some time. Here is one of the original discussions about it: https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GCFOOYRUT5ARBMUHDGINID46KFNORNYM/

It would go a long way with our security team if we could have some sort of recycling policy for containers in some of our more secure environments.

Jon Price
Intel Corporation


Daniel Jones
 

Plus one!

An implementation whereby the recycling behaviour can be feature-flagged by
space or globally would be nice, so you could turn it off whilst debugging
in a space, and then re-enable it when you've finished debugging via a
series of short-lived SSH sessions.

Regards,
Daniel Jones - CTO
+44 (0)79 8000 9153
@DanielJonesEB <https://twitter.com/DanielJonesEB>
*EngineerBetter* Ltd <http://www.engineerbetter.com> - UK Cloud Foundry
Specialists

On Tue, Dec 20, 2016 at 8:06 AM, DHR <lists(a)dhrapson.com> wrote:

Thanks Jon. The financial services clients I have worked with would also
like the ability to turn on ‘cf ssh’ support in production, safe in the
knowledge that app teams won’t abuse it by creating app snowflakes.

I see that the audit trail mentioned in the thread you posted have been
implemented in ‘cf events’. Like this:

time event actor
description
2016-12-19T16:20:36.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T15:30:33.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T12:00:53.00+0000 audit.app.ssh-authorized user index: 0


That said: I still think the container recycle functionality, available as
say a feature flag, would be really appreciated by the large enterprise
community.

On 19 Dec 2016, at 18:25, Jon Price <jon.price(a)intel.com> wrote:

This is something that has been on our wishlist as well but I haven't
seen any discussion about it in quite some time. Here is one of the
original discussions about it: https://lists.cloudfoundry.
org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/
GCFOOYRUT5ARBMUHDGINID46KFNORNYM/

It would go a long way with our security team if we could have some sort
of recycling policy for containers in some of our more secure environments.

Jon Price
Intel Corporation


David Illsley <davidillsley@...>
 

I have no idea why the idea hasn't be implemented, but pondering it, it
seems like it's hard to do because of the cases you mention. Some people
need a policy that 'app teams won’t abuse it by creating app snowflakes',
and in some (most?) cases you need the flexibility to do debugging as you
mentioned.

I think it's possible to combine the SSH authorized events, and the
instance uptime details from the API to build audit capability - identify
instances which have been SSH'd to and not recycled within some time period
(eg 1 hour). You could have either some escalations process to get a human
to do something about it (in case there's a reason an hour wasn't enough),
or more brutally, give the audit code the ability to do a restart instance.



On Tue, Dec 20, 2016 at 12:48 PM, Daniel Jones <
daniel.jones(a)engineerbetter.com> wrote:

Plus one!

An implementation whereby the recycling behaviour can be feature-flagged
by space or globally would be nice, so you could turn it off whilst
debugging in a space, and then re-enable it when you've finished debugging
via a series of short-lived SSH sessions.

Regards,
Daniel Jones - CTO
+44 (0)79 8000 9153 <07980%20009153>
@DanielJonesEB <https://twitter.com/DanielJonesEB>
*EngineerBetter* Ltd <http://www.engineerbetter.com> - UK Cloud Foundry
Specialists

On Tue, Dec 20, 2016 at 8:06 AM, DHR <lists(a)dhrapson.com> wrote:

Thanks Jon. The financial services clients I have worked with would also
like the ability to turn on ‘cf ssh’ support in production, safe in the
knowledge that app teams won’t abuse it by creating app snowflakes.

I see that the audit trail mentioned in the thread you posted have been
implemented in ‘cf events’. Like this:

time event actor
description
2016-12-19T16:20:36.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T15:30:33.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T12:00:53.00+0000 audit.app.ssh-authorized user index: 0


That said: I still think the container recycle functionality, available
as say a feature flag, would be really appreciated by the large enterprise
community.

On 19 Dec 2016, at 18:25, Jon Price <jon.price(a)intel.com> wrote:

This is something that has been on our wishlist as well but I haven't
seen any discussion about it in quite some time. Here is one of the
original discussions about it: https://lists.cloudfoundry.org
/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GCFOOY
RUT5ARBMUHDGINID46KFNORNYM/

It would go a long way with our security team if we could have some
sort of recycling policy for containers in some of our more secure
environments.

Jon Price
Intel Corporation


Daniel Jones
 

Hmm, here's an idea that I haven't through and so is probably rubbish...

How about an immutability enforcer? Recursively checksum the expanded
contents of a droplet, and kill-with-fire anything that doesn't match it.
It'd need to be optional for folks storing ephemeral data on their
ephemeral disk, and a non-invasive (ie no changes to CF components)
implementation would *depend* on `cf ssh` or a chained buildpack, but maybe
that's a nice compromise that could be quicker to develop than waiting for
mainline code changes to CF?

Regards,
Daniel Jones - CTO
+44 (0)79 8000 9153
@DanielJonesEB <https://twitter.com/DanielJonesEB>
*EngineerBetter* Ltd <http://www.engineerbetter.com> - UK Cloud Foundry
Specialists

On Thu, Dec 22, 2016 at 10:01 AM, David Illsley <davidillsley(a)gmail.com>
wrote:

I have no idea why the idea hasn't be implemented, but pondering it, it
seems like it's hard to do because of the cases you mention. Some people
need a policy that 'app teams won’t abuse it by creating app snowflakes',
and in some (most?) cases you need the flexibility to do debugging as you
mentioned.

I think it's possible to combine the SSH authorized events, and the
instance uptime details from the API to build audit capability - identify
instances which have been SSH'd to and not recycled within some time period
(eg 1 hour). You could have either some escalations process to get a human
to do something about it (in case there's a reason an hour wasn't enough),
or more brutally, give the audit code the ability to do a restart instance.



On Tue, Dec 20, 2016 at 12:48 PM, Daniel Jones <
daniel.jones(a)engineerbetter.com> wrote:

Plus one!

An implementation whereby the recycling behaviour can be feature-flagged
by space or globally would be nice, so you could turn it off whilst
debugging in a space, and then re-enable it when you've finished debugging
via a series of short-lived SSH sessions.

Regards,
Daniel Jones - CTO
+44 (0)79 8000 9153 <07980%20009153>
@DanielJonesEB <https://twitter.com/DanielJonesEB>
*EngineerBetter* Ltd <http://www.engineerbetter.com> - UK Cloud Foundry
Specialists

On Tue, Dec 20, 2016 at 8:06 AM, DHR <lists(a)dhrapson.com> wrote:

Thanks Jon. The financial services clients I have worked with would also
like the ability to turn on ‘cf ssh’ support in production, safe in the
knowledge that app teams won’t abuse it by creating app snowflakes.

I see that the audit trail mentioned in the thread you posted have been
implemented in ‘cf events’. Like this:

time event actor
description
2016-12-19T16:20:36.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T15:30:33.00+0000 audit.app.ssh-authorized user index: 0
2016-12-19T12:00:53.00+0000 audit.app.ssh-authorized user index: 0


That said: I still think the container recycle functionality, available
as say a feature flag, would be really appreciated by the large enterprise
community.

On 19 Dec 2016, at 18:25, Jon Price <jon.price(a)intel.com> wrote:

This is something that has been on our wishlist as well but I haven't
seen any discussion about it in quite some time. Here is one of the
original discussions about it: https://lists.cloudfoundry.org
/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/GCFOOYRU
T5ARBMUHDGINID46KFNORNYM/

It would go a long way with our security team if we could have some
sort of recycling policy for containers in some of our more secure
environments.

Jon Price
Intel Corporation


Graham Bleach
 

On 23 December 2016 at 09:21, Daniel Jones
<daniel.jones(a)engineerbetter.com> wrote:
Hmm, here's an idea that I haven't through and so is probably rubbish...

How about an immutability enforcer? Recursively checksum the expanded
contents of a droplet, and kill-with-fire anything that doesn't match it.
It'd need to be optional for folks storing ephemeral data on their ephemeral
disk, and a non-invasive (ie no changes to CF components) implementation
would depend on `cf ssh` or a chained buildpack, but maybe that's a nice
compromise that could be quicker to develop than waiting for mainline code
changes to CF?
An idea we've been kicking around is to ensure that app instance
containers never live longer than a certain time (eg. 3, 6, 12 or 24
hours).

This would ensure that we'd catch cases where apps weren't able to
cope with being rescheduled to different cells. It'd also strongly
discourage manual tweaks via ssh. It'd probably be useful for people
deploying apps to be able to initiate an aggressive version of this
behaviour to run in their testing pipelines, prior to production
deployment, to catch regressions in keeping state in app instances.

There's a naive implementation in my head that would work fine on
smaller installations by looping through app instances returned by the
API and restarting them.

Cheers,
Graham


Stefan Mayr
 

Am 23.12.2016 um 10:36 schrieb Graham Bleach:
On 23 December 2016 at 09:21, Daniel Jones
<daniel.jones(a)engineerbetter.com> wrote:
Hmm, here's an idea that I haven't through and so is probably rubbish...

How about an immutability enforcer? Recursively checksum the expanded
contents of a droplet, and kill-with-fire anything that doesn't match it.
It'd need to be optional for folks storing ephemeral data on their ephemeral
disk, and a non-invasive (ie no changes to CF components) implementation
would depend on `cf ssh` or a chained buildpack, but maybe that's a nice
compromise that could be quicker to develop than waiting for mainline code
changes to CF?
An idea we've been kicking around is to ensure that app instance
containers never live longer than a certain time (eg. 3, 6, 12 or 24
hours).

This would ensure that we'd catch cases where apps weren't able to
cope with being rescheduled to different cells. It'd also strongly
discourage manual tweaks via ssh. It'd probably be useful for people
deploying apps to be able to initiate an aggressive version of this
behaviour to run in their testing pipelines, prior to production
deployment, to catch regressions in keeping state in app instances.

There's a naive implementation in my head that would work fine on
smaller installations by looping through app instances returned by the
API and restarting them.

Cheers,
Graham
How to cope with the following issues?

Temporary data: some software still uses sessions, file uploads or
caches which are buffered or written to disk (Java/Tomcat, PHP, ...).
While it is okay to loose this data when a container is restarted (after
you had some time to work with this data) it becomes a problem when
every write can cause the recreation of this container. How should an
upload form work if every upload can kill the container? I'm only
refering the processing of the upload - not permanently storing it.

Single instances: recreating app containers when there are more than two
should not cause to many issues. But if there is only one instance you
have two choices:
- kill the running container and start a new one -> short downtime
- start a second instance and kill the first one afterwards -> problem
if the application is only allowed to run with one instance (singleton).

One-shot tasks: a slight variation of the single instance problem and
the question if you are allowed to restart a oneshot task

Happy holidays,

Stefan


Graham Bleach
 

Hi Stefan,

On 23 December 2016 at 13:52, Stefan Mayr <stefan(a)mayr-stefan.de> wrote:
Am 23.12.2016 um 10:36 schrieb Graham Bleach:
On 23 December 2016 at 09:21, Daniel Jones
<daniel.jones(a)engineerbetter.com> wrote:
Hmm, here's an idea that I haven't through and so is probably rubbish...

How about an immutability enforcer? Recursively checksum the expanded
contents of a droplet, and kill-with-fire anything that doesn't match it.
It'd need to be optional for folks storing ephemeral data on their ephemeral
disk, and a non-invasive (ie no changes to CF components) implementation
would depend on `cf ssh` or a chained buildpack, but maybe that's a nice
compromise that could be quicker to develop than waiting for mainline code
changes to CF?
An idea we've been kicking around is to ensure that app instance
containers never live longer than a certain time (eg. 3, 6, 12 or 24
hours).

This would ensure that we'd catch cases where apps weren't able to
cope with being rescheduled to different cells. It'd also strongly
discourage manual tweaks via ssh. It'd probably be useful for people
deploying apps to be able to initiate an aggressive version of this
behaviour to run in their testing pipelines, prior to production
deployment, to catch regressions in keeping state in app instances.

There's a naive implementation in my head that would work fine on
smaller installations by looping through app instances returned by the
API and restarting them.

Cheers,
Graham
How to cope with the following issues?

Temporary data: some software still uses sessions, file uploads or
caches which are buffered or written to disk (Java/Tomcat, PHP, ...).
While it is okay to loose this data when a container is restarted (after
you had some time to work with this data) it becomes a problem when
every write can cause the recreation of this container. How should an
upload form work if every upload can kill the container? I'm only
refering the processing of the upload - not permanently storing it.
I think this was in response to Dan's immutability enforcement
proposal, so I'll let him respond :)

Single instances: recreating app containers when there are more than two
should not cause to many issues. But if there is only one instance you
have two choices:
- kill the running container and start a new one -> short downtime
- start a second instance and kill the first one afterwards -> problem
if the application is only allowed to run with one instance (singleton).
App instances go away when the cells get replaced (eg. stemcell
update) or fail, so apps need to be able to cope with it. If you're
not comfortable with downtime then the app probably shouldn't be
single instance.

For my naive "loop through all the app instances" script I'd be
inclined to check that the restarted instance was healthy again before
moving onto the next one.

One-shot tasks: a slight variation of the single instance problem and
the question if you are allowed to restart a oneshot task
Tasks feel less safe to interrupt than app instances. I'm unclear what
happens to a running task when the cell gets destroyed and therefore
if there's some reasonable upper bound on how long a task should take
to complete.

--
Technical Architect
Government Digital Service