Re: AUFS bug in Linux kernel


Benjamin Gandon
 

Very neat!
Thanks a lot Eric.

Le 11 avr. 2016 à 17:46, Eric Malm <emalm(a)pivotal.io> a écrit :

Hi, Benjamin,

Yes, the BOSH-Lite boxes with kernel 3.19.0-40 through 3.19.0-50 are all susceptible to the AUFS bug. Kernel versions 3.19.0-51 and later will be fine, and I believe the earliest BOSH-Lite Vagrant box with one of those kernel versions is 9000.102.0. The 3.19.0-49 kernel that went into 3192 was a one-off build that Canonical supplied in advance of the release of the official kernel package with the fix (https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1 <https://launchpad.net/ubuntu/+source/linux-lts-vivid/3.19.0-51.57~14.04.1>), and the 'official' package with kernel 3.19.0-49 still has the AUFS bug.

Thanks,
Eric

On Mon, Apr 11, 2016 at 8:36 AM, Benjamin Gandon <benjamin(a)gandon.org <mailto:benjamin(a)gandon.org>> wrote:
Hi,

Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.

Config :
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it)
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other

Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262> for stemcell 3192 ?

As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.

Thanks!


Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io <mailto:emalm(a)pivotal.io>> a écrit :

Hi, Mike,

Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.

Thanks,
Eric

On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?

Mike

On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote:
A bug with AUFS [1] was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.

This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.

The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.

We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.

If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.

As always, please get in touch out if you have any questions.

Will - Garden PM

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043>
[2]: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/>
[3]: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/>
[4]: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/>
[5]: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.