toggle quoted messageShow quoted text
Thanks for all the diagnostic data! It looks like the actual size of the
btrfs volume (the 'Device allocated: 17.79GiB' line from the btrfs tool
output) is quite close to the total size of the ~21 GiB volume mounted on
/var/vcap/data. Since that volume also contains other files (such as the
BOSH-deployed jobs and packages and component log files) that on the cell
VM can add up to several GB, I think your cells eventually reach the point
where the sparse btrfs volume has expanded to fill all the remaining space
on the volume.
Also, from the small amount of data in the executor cache, it looks like
the cells are taking on only Docker-image workloads, rather than
buildpack-based apps. Unfortunately, with Diego 0.1398.0 and the
accompanying garden-linux version, there are a few deficiencies with disk
management for Docker-image apps. The main one that's likely exacerbating
this situation is that garden-linux doesn't clean up the docker image
layers after a Garden container based on them is destroyed. Consequently,
as you pull in more and more docker image layers over time, they use up
more and more space in the btrfs volume that's never recovered. As of
version 0.307.0, garden-linux-release does clean up those unused layers
correctly, and someone from the Garden team might recall whether that
happens in an earlier release too (maybe 0.306.0, but not earlier that
If you're currently tied to Diego 0.1398.0 to match compatibility with your
deployed CF version, the best way to manage these issues might be to
increase the size of the ephemeral disk attached to your VMs (if you're
able to), to set the garden.btrfs_store_size_mb property so that the
maximum size of the btrfs volume is 3-4 GB less than the size of that
ephemeral disk attached to /var/vcap/data, and then to monitor the disk
usage on that volume and to recreate cells when they use more than, say,
90% of that disk volume. Recreating a cell should cause it to evacuate its
instances to other cells in the deployment, so you wouldn't incur downtime
for the apps.
On Tue, Nov 24, 2015 at 7:32 AM, Tom Sherrod <tom.sherrod(a)gmail.com> wrote:
I am responding below with what I have available. Unfortunately, when the
problem presents, developers are down so the current resolution is recreate
cells. Looking at one below 98% full, opportunity for additional details
may arise soon.
Answers below inline
- What are the exact errors you're seeing when CF users are trying to make
containers? The errors from CF CLI logs or rep/garden logs would be greatDid not capture detailed logs. FAILED StagingError was all that was
captured. I've asked to get more information on the next failure which may
be coming up soon, I'm looking at a cell with 98% filled. No issue reported
as of yet, of course, there are 8 cells to choose from.
- What's the total amount of disk space available on the volume attached/dev/vda3 22025756 20278880 604964 98% /var/vcap/data
to /var/vcap/data? You should be able to see this from `df` command output.
tmpfs 1024 16 1008 2% /var/vcap/data/sys/run
/dev/loop0 122835 1552 117352 2% /tmp
/dev/loop1 20480000 17923904 1914816 91%
cgroup 8216468 0 8216468 0% /tmp/garden-/cgroup
- How much space is the rep configured to allocate for its executor
cache? Is it the default 10GB provided by the rep's job spec in
How much disk is actually used in /var/vcap/data/executor_cache (based on
reporting from `du`, say)?
Default (not listed in the manifest)
- How much space have you directed garden-linux to allocate for its btrfsbtrfs_store_size_mb: 20000
store? This is provided via the diego.garden-linux.btrfs_store_size_mb BOSH
property, and with Diego 0.1398.0 I believe it has to be specified
explicitly. Also, how much space is actually used in the btrfs filesystem?
You should be able to inspect this with the btrfs tools available on the
cell VM in '/var/vcap/packages/btrfs-tools/bin'. I think running
`/var/vcap/packages/btrfs-tools/bin/btrfs filesystem usage
/var/vcap/data/garden-linux/btrfs_graph` should be a good starting point.
./btrfs filesystem usage /var/vcap/data/garden-linux/btrfs_graph
Device size: 19.53GiB
Device allocated: 17.79GiB
Device unallocated: 1.75GiB
Device missing: 0.00B
Free (estimated): 1.83GiB (min: 976.89MiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 320.00MiB (used: 0.00B)
Data,single: Size:12.01GiB, Used:11.93GiB
Metadata,single: Size:8.00MiB, Used:0.00B
Metadata,DUP: Size:2.88GiB, Used:2.43GiB
System,single: Size:4.00MiB, Used:0.00B
System,DUP: Size:8.00MiB, Used:16.00KiB
You may also find some useful information in the cf-dev thread from
August about overcommitting disk on Diego cells:
On Wed, Nov 18, 2015 at 6:52 AM, Tom Sherrod <tom.sherrod(a)gmail.com>
diego release 0.1398.0
After a couple of weeks of dev, the cells end up filling their disks.
Did I miss a clean up job somewhere?
Currently, once pushes start failing, I get bosh to recreate the machine.