toggle quoted messageShow quoted text
Sorry for the late up, but would this hit bosh-lite too?
Because after it has run for a while, I’m experiencing severe similar issues with the 53 garden containers I use in Bosh-Lite.
- Bosh-lite v9000.91.0 (i.e. bosh v250 + warden-cpi v29 + garden-linux v0.331.0) and the kernel is 3.19.0-47.53~14.04.1 (I might have upgraded it)
- Deployment: cf v231 + Diego v0.1434.0 + Garden-linux v0.333.0 + Etcd v36 + cf-mysql v26 + other
Will the linux-image-3.19.0-49-generic fix the issue, as it was done in this 2016-02-08 commit <https://github.com/cloudfoundry/bosh/commit/750c5e7ed70b1d7753500ca725590c1c0eac1262
> for stemcell 3192 ?
As a safety measure, I decided to upgrade to kernel 3.19.0-58-generic and I would be happy to get a confirmation that (1) my bosh-lite deployment was hit by the AUFS bug, and that (2) the new kernel I installed will get me off this operational nightmare.
Le 28 janv. 2016 à 02:06, Eric Malm <emalm(a)pivotal.io> a écrit :
Warden also uses aufs for its containers' overlay filesystems, so we expect the same issue to affect the DEAs on these stemcell versions. I'm not aware of a deliberate attempt to reproduce it on the DEAs, though.
On Wed, Jan 27, 2016 at 4:08 PM, Mike Youngstrom <youngm(a)gmail.com <mailto:youngm(a)gmail.com>> wrote:
Thanks Will. Does anyone know if this bug could also impacts Warden?
On Wed, Jan 27, 2016 at 9:50 AM, Will Pragnell <wpragnell(a)pivotal.io <mailto:wpragnell(a)pivotal.io>> wrote:
A bug with AUFS  was introduced in version 3.19.0-40 of the linux kernel. This bug can cause containers to end up with unkillable zombie processes with high CPU usage. This can happen any time a container is supposed to be destroyed.
This affects both Garden-Linux and Warden (and Docker). If you see significant slowdown or increased CPU usage on DEAs or Diego cells, it might well be this. It will probably build slowly up over time, so you may not notice anything for a while depending on the rate of app instance churn on your deployment.
The bad version of the kernel is present in stemcell 3160 and later. I can't recommend using older stemcells because the bad kernel versions also include fixes for several high severity security vulnerabilities (at least [2-5], there may be others I've missed). Were it not for these, rolling back to stemcell 3157 would be the fix.
We're waiting for a fix to make its way into the kernel, and the BOSH team will produce a stemcell with the fix as soon as possible. In the meantime, I'd suggest simply keeping a closer eye than usual on your DEAs and Diego cells.
If this issue occurs, the only solution is to recreate that machine. While we've not had any actual reports of this issue occurring for Cloud Foundry deployments in the wild yet, we're confident that the issue will be occurring. The Diego team have seen it in testing, and several teams have encountered the issue with their Concourse workers, which also use Garden-Linux.
As always, please get in touch out if you have any questions.
Will - Garden PM
: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043>
: http://www.ubuntu.com/usn/usn-2857-1/ <http://www.ubuntu.com/usn/usn-2857-1/>
: http://www.ubuntu.com/usn/usn-2868-1/ <http://www.ubuntu.com/usn/usn-2868-1/>
: http://www.ubuntu.com/usn/usn-2869-1/ <http://www.ubuntu.com/usn/usn-2869-1/>
: http://www.ubuntu.com/usn/usn-2871-2/ <http://www.ubuntu.com/usn/usn-2871-2/>