Strict CPU quotas proposal


Carlo Alberto Ferraris
 

It is our understanding that currently application instances get CPU quotas assigned via cgroup CPU shares on the container running them[1]. This effectively sets a "minimum quota" of CPU time each container is guaranteed to have available, but leaves the maximum amount of CPU time unbounded.

This may be fine in the average case, but can have some pretty annoying effects in certain edge cases.

For the sake of discussion, let's consider 2 Diego cells having the same number N of containers running, with the same amount of cpu shares, memory and disk assigned. From the perspective of Diego, these two cells are perfectly balanced. Now let's assume that (because we're very unlucky):
- we have one instance of the same app running on each one of the 2 cells
- the other N-1 instances on cell 1 use all available CPU
- the other N-1 instances on cell 2 are using no CPU
In this case the net effect is that the instance of the app in cell 2 will have N times the CPU time as the instance in cell 1.

It would be desirable, from our perspective, to be able to control such "performance swings" because they may lead users into overestimating their available processor resources, potentially leading to inability of certain instances to effectively serve their share of traffic.

What we propose is to add the (opt-in) ability for CF operators to control the upper bound of how much CPU time a container can use. Specifically we suggest to teach garden (runc) how to set also cpu.cfs_quota_us and cpu.cfs_period_us (see CpuQuota and CpuPeriod in [3] and [2] for details). To be absolutely explicit: this would be a operator-controlled feature, not a user-controlled one.

What follows is a draft of a proposal about how to provide this functionality. It is intended as basis for discussion more than as a fully fleshed-out proposal.

Concretely this would likely require exposing two additional tunables for Diego reps (names are placeholders):
- cpu_quota_period_us
- cpu_quota_burst_ratio

cpu_quota_period_us would be set as is as the value of cpu.cfs_period_us. This is the time window used for allocating CPU time (see [2] for details). Setting it to 0 (default) would disable strict CPU quotas. Values 1000<=cpu_quota_period_us<=1000000 are valid and will enable strict CPU quotas. Any other value is illegal. It is illegal to set cpu_quota_period_us to 0 if cpu_quota_burst_ratio is not 0.

cpu_quota_burst_ratio would instead be used to compute the value of cpu.cfs_quota_us based on the number of cores in the host (n_cores), the number of shares assigned to the container (cpu_shares) and the maximum number of shares that can be assigned to running containers (cpu_shares_max) as follows:

cpu.cfs_quota_us = cpu_quota_period_us * n_cores * cpu_quota_burst_ratio * cpu_shares / cpu_shares_max

cpu_shares_max can be calculated, if we ignore instance_min_cpu_share_limit, instance_max_cpu_share_limit and other limits, as:

cpu_shares_max = memory_mb * memory_overcommit_factor / instance_memory_to_cpu_share_ratio

Setting cpu_quota_burst_ratio to 0 (default) would disable strict CPU quotas. Values of cpu_quota_burst_ratio>=1 are valid and enable strict CPU quotas. Any other value is illegal. It is illegal to set cpu_quota_burst_ratio to 0 if cpu_quota_period_us is not 0.

For example:
- setting it to 1 will ensure that each container can use at least and at most (cpu_shares / max_cpu_shares) of the total processor time every cpu_quota_period_us, thereby virtually eliminating performance fluctuations across instances
- setting it to 1.5 will ensure that each container can use at least (cpu_shares / max_cpu_shares) and at most (1.5 * cpu_shares / max_cpu_shares) of the total processor time every cpu_quota_period_us, thereby allowing applications to use up to 50% over their "CPU shares quota"

Notes:
- the above would be correct if no other processes were running outside of the application containers: this is obviously not true but limiting resource usage of the system components is largely outside of the scope of this proposal
- cpu_quota_period_us should be exposed because it allows to control the latency/throughput trade-off (see [2] for why)
- 0 < cpu_quota_burst_ratio < 1 is defined as illegal in the current proposal because we can't come up with a good scenario where such values may make sense
- illegal values for cpu_quota_burst_ratio and cpu_quota_period_us should cause early errors (bosh deploy and rep startup)
- we haven't looked into the equivalent change for garden windows, but we're aware that similar functionality exists in the windows APIs [4]

[1]: https://github.com/cloudfoundry/guardian/blob/e9346b9849a33cddb037628c79057c4c80d4e3d8/rundmc/bundlerules/limits.go#L15-L16
[2]: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
[3]: https://godoc.org/github.com/opencontainers/runc/libcontainer/configs#Resources
[4]: https://msdn.microsoft.com/en-us/library/windows/desktop/hh448384(v=vs.85).aspx


Chip Childers <cchilders@...>
 

+Will and Julz

Thoughts gents?

On Tue, Oct 25, 2016 at 10:23 PM Carlo Alberto Ferraris <
carlo.ferraris(a)rakuten.com> wrote:

It is our understanding that currently application instances get CPU
quotas assigned via cgroup CPU shares on the container running them[1].
This effectively sets a "minimum quota" of CPU time each container is
guaranteed to have available, but leaves the maximum amount of CPU time
unbounded.

This may be fine in the average case, but can have some pretty annoying
effects in certain edge cases.

For the sake of discussion, let's consider 2 Diego cells having the same
number N of containers running, with the same amount of cpu shares, memory
and disk assigned. From the perspective of Diego, these two cells are
perfectly balanced. Now let's assume that (because we're very unlucky):
- we have one instance of the same app running on each one of the 2 cells
- the other N-1 instances on cell 1 use all available CPU
- the other N-1 instances on cell 2 are using no CPU
In this case the net effect is that the instance of the app in cell 2 will
have N times the CPU time as the instance in cell 1.

It would be desirable, from our perspective, to be able to control such
"performance swings" because they may lead users into overestimating their
available processor resources, potentially leading to inability of certain
instances to effectively serve their share of traffic.

What we propose is to add the (opt-in) ability for CF operators to control
the upper bound of how much CPU time a container can use. Specifically we
suggest to teach garden (runc) how to set also cpu.cfs_quota_us and
cpu.cfs_period_us (see CpuQuota and CpuPeriod in [3] and [2] for details).
To be absolutely explicit: this would be a operator-controlled feature, not
a user-controlled one.

What follows is a draft of a proposal about how to provide this
functionality. It is intended as basis for discussion more than as a fully
fleshed-out proposal.

Concretely this would likely require exposing two additional tunables for
Diego reps (names are placeholders):
- cpu_quota_period_us
- cpu_quota_burst_ratio

cpu_quota_period_us would be set as is as the value of cpu.cfs_period_us.
This is the time window used for allocating CPU time (see [2] for details).
Setting it to 0 (default) would disable strict CPU quotas. Values
1000<=cpu_quota_period_us<=1000000 are valid and will enable strict CPU
quotas. Any other value is illegal. It is illegal to set
cpu_quota_period_us to 0 if cpu_quota_burst_ratio is not 0.

cpu_quota_burst_ratio would instead be used to compute the value of
cpu.cfs_quota_us based on the number of cores in the host (n_cores), the
number of shares assigned to the container (cpu_shares) and the maximum
number of shares that can be assigned to running containers
(cpu_shares_max) as follows:

cpu.cfs_quota_us = cpu_quota_period_us * n_cores * cpu_quota_burst_ratio *
cpu_shares / cpu_shares_max

cpu_shares_max can be calculated, if we ignore
instance_min_cpu_share_limit, instance_max_cpu_share_limit and other
limits, as:

cpu_shares_max = memory_mb * memory_overcommit_factor /
instance_memory_to_cpu_share_ratio

Setting cpu_quota_burst_ratio to 0 (default) would disable strict CPU
quotas. Values of cpu_quota_burst_ratio>=1 are valid and enable strict CPU
quotas. Any other value is illegal. It is illegal to set
cpu_quota_burst_ratio to 0 if cpu_quota_period_us is not 0.

For example:
- setting it to 1 will ensure that each container can use at least and at
most (cpu_shares / max_cpu_shares) of the total processor time every
cpu_quota_period_us, thereby virtually eliminating performance fluctuations
across instances
- setting it to 1.5 will ensure that each container can use at least
(cpu_shares / max_cpu_shares) and at most (1.5 * cpu_shares /
max_cpu_shares) of the total processor time every cpu_quota_period_us,
thereby allowing applications to use up to 50% over their "CPU shares quota"

Notes:
- the above would be correct if no other processes were running outside of
the application containers: this is obviously not true but limiting
resource usage of the system components is largely outside of the scope of
this proposal
- cpu_quota_period_us should be exposed because it allows to control the
latency/throughput trade-off (see [2] for why)
- 0 < cpu_quota_burst_ratio < 1 is defined as illegal in the current
proposal because we can't come up with a good scenario where such values
may make sense
- illegal values for cpu_quota_burst_ratio and cpu_quota_period_us should
cause early errors (bosh deploy and rep startup)
- we haven't looked into the equivalent change for garden windows, but
we're aware that similar functionality exists in the windows APIs [4]

[1]:
https://github.com/cloudfoundry/guardian/blob/e9346b9849a33cddb037628c79057c4c80d4e3d8/rundmc/bundlerules/limits.go#L15-L16
[2]: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
[3]:
https://godoc.org/github.com/opencontainers/runc/libcontainer/configs#Resources
[4]:
https://msdn.microsoft.com/en-us/library/windows/desktop/hh448384(v=vs.85).aspx
--
Chip Childers
VP Technology, Cloud Foundry Foundation
1.267.250.0815


Julz Friedman
 

The use case seems reasonable to me. Even if it's technically free (no
marginal cost) to let processes consume the whole machine when it's
available, I can easily see a commercial and technical case for
constraining that.

I think the simplest approach is to have it apply statically to all
containers, in which case it could be added to the garden startup flags and
configured via bosh. There's maybe an argument that we should have it in
the ContainerSpec for future extensibility if we ever wanted the user to be
able to configure it, but that seems like it complicates the math a lot
(not to mention adding a lot more moving parts) and there's nothing that
blocks us going down that path later if we start with a startup property
applied to all containers and find it's not enough. I'd like other folks,
especially Eric and Will to weigh in on that though.

In terms of making it happen garden would happily accept a PR for this, I
think it's a pretty simple change. Otherwise I'll put a story in our
backlog somewhere.

On Mon, 31 Oct 2016 at 14:35, Chip Childers <cchilders(a)cloudfoundry.org>
wrote:

+Will and Julz

Thoughts gents?


On Tue, Oct 25, 2016 at 10:23 PM Carlo Alberto Ferraris <
carlo.ferraris(a)rakuten.com> wrote:

It is our understanding that currently application instances get CPU
quotas assigned via cgroup CPU shares on the container running them[1].
This effectively sets a "minimum quota" of CPU time each container is
guaranteed to have available, but leaves the maximum amount of CPU time
unbounded.

This may be fine in the average case, but can have some pretty annoying
effects in certain edge cases.

For the sake of discussion, let's consider 2 Diego cells having the same
number N of containers running, with the same amount of cpu shares, memory
and disk assigned. From the perspective of Diego, these two cells are
perfectly balanced. Now let's assume that (because we're very unlucky):
- we have one instance of the same app running on each one of the 2 cells
- the other N-1 instances on cell 1 use all available CPU
- the other N-1 instances on cell 2 are using no CPU
In this case the net effect is that the instance of the app in cell 2 will
have N times the CPU time as the instance in cell 1.

It would be desirable, from our perspective, to be able to control such
"performance swings" because they may lead users into overestimating their
available processor resources, potentially leading to inability of certain
instances to effectively serve their share of traffic.

What we propose is to add the (opt-in) ability for CF operators to control
the upper bound of how much CPU time a container can use. Specifically we
suggest to teach garden (runc) how to set also cpu.cfs_quota_us and
cpu.cfs_period_us (see CpuQuota and CpuPeriod in [3] and [2] for details).
To be absolutely explicit: this would be a operator-controlled feature, not
a user-controlled one.

What follows is a draft of a proposal about how to provide this
functionality. It is intended as basis for discussion more than as a fully
fleshed-out proposal.

Concretely this would likely require exposing two additional tunables for
Diego reps (names are placeholders):
- cpu_quota_period_us
- cpu_quota_burst_ratio

cpu_quota_period_us would be set as is as the value of cpu.cfs_period_us.
This is the time window used for allocating CPU time (see [2] for details).
Setting it to 0 (default) would disable strict CPU quotas. Values
1000<=cpu_quota_period_us<=1000000 are valid and will enable strict CPU
quotas. Any other value is illegal. It is illegal to set
cpu_quota_period_us to 0 if cpu_quota_burst_ratio is not 0.

cpu_quota_burst_ratio would instead be used to compute the value of
cpu.cfs_quota_us based on the number of cores in the host (n_cores), the
number of shares assigned to the container (cpu_shares) and the maximum
number of shares that can be assigned to running containers
(cpu_shares_max) as follows:

cpu.cfs_quota_us = cpu_quota_period_us * n_cores * cpu_quota_burst_ratio *
cpu_shares / cpu_shares_max

cpu_shares_max can be calculated, if we ignore
instance_min_cpu_share_limit, instance_max_cpu_share_limit and other
limits, as:

cpu_shares_max = memory_mb * memory_overcommit_factor /
instance_memory_to_cpu_share_ratio

Setting cpu_quota_burst_ratio to 0 (default) would disable strict CPU
quotas. Values of cpu_quota_burst_ratio>=1 are valid and enable strict CPU
quotas. Any other value is illegal. It is illegal to set
cpu_quota_burst_ratio to 0 if cpu_quota_period_us is not 0.

For example:
- setting it to 1 will ensure that each container can use at least and at
most (cpu_shares / max_cpu_shares) of the total processor time every
cpu_quota_period_us, thereby virtually eliminating performance fluctuations
across instances
- setting it to 1.5 will ensure that each container can use at least
(cpu_shares / max_cpu_shares) and at most (1.5 * cpu_shares /
max_cpu_shares) of the total processor time every cpu_quota_period_us,
thereby allowing applications to use up to 50% over their "CPU shares quota"

Notes:
- the above would be correct if no other processes were running outside of
the application containers: this is obviously not true but limiting
resource usage of the system components is largely outside of the scope of
this proposal
- cpu_quota_period_us should be exposed because it allows to control the
latency/throughput trade-off (see [2] for why)
- 0 < cpu_quota_burst_ratio < 1 is defined as illegal in the current
proposal because we can't come up with a good scenario where such values
may make sense
- illegal values for cpu_quota_burst_ratio and cpu_quota_period_us should
cause early errors (bosh deploy and rep startup)
- we haven't looked into the equivalent change for garden windows, but
we're aware that similar functionality exists in the windows APIs [4]

[1]:
https://github.com/cloudfoundry/guardian/blob/e9346b9849a33cddb037628c79057c4c80d4e3d8/rundmc/bundlerules/limits.go#L15-L16
[2]: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
[3]:
https://godoc.org/github.com/opencontainers/runc/libcontainer/configs#Resources
[4]:
https://msdn.microsoft.com/en-us/library/windows/desktop/hh448384(v=vs.85).aspx

--
Chip Childers
VP Technology, Cloud Foundry Foundation
1.267.250.0815