Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all
susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be
fine, and I believe the earliest BOSH-Lite Vagrant box with one of those
kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was
a one-off build that Canonical supplied in advance of the release of the
official kernel package with the fix (https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1
and the 'official' package with kernel 3.19.0-49 still has the AUFS bug.
On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org>
Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar
issues with the 53 garden containers I use in Bosh-Lite.
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux
v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I *might* have upgraded
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd
v36 + cf-mysql v26 + other
Will the linux-image-3.19.0-49-generic fix the issue, as it was done in
this 2016-02-08 commit
stemcell 3192 ?
As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and
I would be happy to get a confirmation that (1) my bosh-lite deployment was
hit by the AUFS bug, and that (2) the new kernel I installed will get me
off this operational nightmare.
Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :
Warden also uses aufs for its containers' overlay filesystems, so we
expect the same issue to affect the DEAs on these stemcell versions. I'm
not aware of a deliberate attempt to reproduce it on the DEAs, though.
On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?
On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io>
A bug with AUFS  was introduced in version 3.19.0-40 of the linux
kernel. This bug can cause containers to end up with unkillable zombie
processes with high CPU usage. This can happen any time a container is
supposed to be destroyed.
This affects both Garden-Linux and Warden (and Docker). If you see
significant slowdown or increased CPU usage on DEAs or Diego cells, it
might well be this. It will probably build slowly up over time, so you may
not notice anything for a while depending on the rate of app instance churn
on your deployment.
The bad version of the kernel is present in stemcell 3160 and later. I
can't recommend using older stemcells because the bad kernel versions also
include fixes for several high severity security vulnerabilities (at least
[2-5], there may be others I've missed). Were it not for these, rolling
back to stemcell 3157 would be the fix.
We're waiting for a fix to make its way into the kernel, and the BOSH
team will produce a stemcell with the fix as soon as possible. In the
meantime, I'd suggest simply keeping a closer eye than usual on your DEAs
and Diego cells.
If this issue occurs, the only solution is to recreate that machine.
While we've not had any actual reports of this issue occurring for Cloud
Foundry deployments in the wild yet, we're confident that the issue will be
occurring. The Diego team have seen it in testing, and several teams have
encountered the issue with their Concourse workers, which also use
As always, please get in touch out if you have any questions.
Will - Garden PM