Strict CPU quotas proposal
Carlo Alberto Ferraris
It is our understanding that currently application instances get CPU quotas assigned via cgroup CPU shares on the container running them. This effectively sets a "minimum quota" of CPU time each container is guaranteed to have available, but leaves the maximum amount of CPU time unbounded.
This may be fine in the average case, but can have some pretty annoying effects in certain edge cases.
For the sake of discussion, let's consider 2 Diego cells having the same number N of containers running, with the same amount of cpu shares, memory and disk assigned. From the perspective of Diego, these two cells are perfectly balanced. Now let's assume that (because we're very unlucky):
- we have one instance of the same app running on each one of the 2 cells
- the other N-1 instances on cell 1 use all available CPU
- the other N-1 instances on cell 2 are using no CPU
In this case the net effect is that the instance of the app in cell 2 will have N times the CPU time as the instance in cell 1.
It would be desirable, from our perspective, to be able to control such "performance swings" because they may lead users into overestimating their available processor resources, potentially leading to inability of certain instances to effectively serve their share of traffic.
What we propose is to add the (opt-in) ability for CF operators to control the upper bound of how much CPU time a container can use. Specifically we suggest to teach garden (runc) how to set also cpu.cfs_quota_us and cpu.cfs_period_us (see CpuQuota and CpuPeriod in  and  for details). To be absolutely explicit: this would be a operator-controlled feature, not a user-controlled one.
What follows is a draft of a proposal about how to provide this functionality. It is intended as basis for discussion more than as a fully fleshed-out proposal.
Concretely this would likely require exposing two additional tunables for Diego reps (names are placeholders):
cpu_quota_period_us would be set as is as the value of cpu.cfs_period_us. This is the time window used for allocating CPU time (see  for details). Setting it to 0 (default) would disable strict CPU quotas. Values 1000<=cpu_quota_period_us<=1000000 are valid and will enable strict CPU quotas. Any other value is illegal. It is illegal to set cpu_quota_period_us to 0 if cpu_quota_burst_ratio is not 0.
cpu_quota_burst_ratio would instead be used to compute the value of cpu.cfs_quota_us based on the number of cores in the host (n_cores), the number of shares assigned to the container (cpu_shares) and the maximum number of shares that can be assigned to running containers (cpu_shares_max) as follows:
cpu.cfs_quota_us = cpu_quota_period_us * n_cores * cpu_quota_burst_ratio * cpu_shares / cpu_shares_max
cpu_shares_max can be calculated, if we ignore instance_min_cpu_share_limit, instance_max_cpu_share_limit and other limits, as:
cpu_shares_max = memory_mb * memory_overcommit_factor / instance_memory_to_cpu_share_ratio
Setting cpu_quota_burst_ratio to 0 (default) would disable strict CPU quotas. Values of cpu_quota_burst_ratio>=1 are valid and enable strict CPU quotas. Any other value is illegal. It is illegal to set cpu_quota_burst_ratio to 0 if cpu_quota_period_us is not 0.
- setting it to 1 will ensure that each container can use at least and at most (cpu_shares / max_cpu_shares) of the total processor time every cpu_quota_period_us, thereby virtually eliminating performance fluctuations across instances
- setting it to 1.5 will ensure that each container can use at least (cpu_shares / max_cpu_shares) and at most (1.5 * cpu_shares / max_cpu_shares) of the total processor time every cpu_quota_period_us, thereby allowing applications to use up to 50% over their "CPU shares quota"
- the above would be correct if no other processes were running outside of the application containers: this is obviously not true but limiting resource usage of the system components is largely outside of the scope of this proposal
- cpu_quota_period_us should be exposed because it allows to control the latency/throughput trade-off (see  for why)
- 0 < cpu_quota_burst_ratio < 1 is defined as illegal in the current proposal because we can't come up with a good scenario where such values may make sense
- illegal values for cpu_quota_burst_ratio and cpu_quota_period_us should cause early errors (bosh deploy and rep startup)
- we haven't looked into the equivalent change for garden windows, but we're aware that similar functionality exists in the windows APIs