Re: diego: disk filling up over time


Tom Sherrod <tom.sherrod@...>
 

Hi Eric,

Thank you.

I am responding below with what I have available. Unfortunately, when the
problem presents, developers are down so the current resolution is recreate
cells. Looking at one below 98% full, opportunity for additional details
may arise soon.
Answers below inline

- What are the exact errors you're seeing when CF users are trying to make
containers? The errors from CF CLI logs or rep/garden logs would be great
to see.
Did not capture detailed logs. FAILED StagingError was all that was
captured. I've asked to get more information on the next failure which may
be coming up soon, I'm looking at a cell with 98% filled. No issue reported
as of yet, of course, there are 8 cells to choose from.


- What's the total amount of disk space available on the volume attached
to /var/vcap/data? You should be able to see this from `df` command output.
/dev/vda3 22025756 20278880 604964 98% /var/vcap/data

tmpfs 1024 16 1008 2% /var/vcap/data/sys/run

/dev/loop0 122835 1552 117352 2% /tmp

/dev/loop1 20480000 17923904 1914816 91%
/var/vcap/data/garden-linux/btrfs_graph

cgroup 8216468 0 8216468 0% /tmp/garden-/cgroup
- How much space is the rep configured to allocate for its executor cache?
Is it the default 10GB provided by the rep's job spec in
https://github.com/cloudfoundry-incubator/diego-release/blob/v0.1398.0/jobs/rep/spec#L70-L72?
How much disk is actually used in /var/vcap/data/executor_cache (based on
reporting from `du`, say)?

Default (not listed in the manifest)

root(a)a0acd863-07e5-4964-8758-fcdf295d119d:/var/vcap/data/executor_cache# du

42876 .

- How much space have you directed garden-linux to allocate for its btrfs
store? This is provided via the diego.garden-linux.btrfs_store_size_mb BOSH
property, and with Diego 0.1398.0 I believe it has to be specified
explicitly. Also, how much space is actually used in the btrfs filesystem?
You should be able to inspect this with the btrfs tools available on the
cell VM in '/var/vcap/packages/btrfs-tools/bin'. I think running
`/var/vcap/packages/btrfs-tools/bin/btrfs filesystem usage
/var/vcap/data/garden-linux/btrfs_graph` should be a good starting point.
btrfs_store_size_mb: 20000

root(a)a0acd863-07e5-4964-8758-fcdf295d119d:/var/vcap/packages/btrfs-progs/bin#
./btrfs filesystem usage /var/vcap/data/garden-linux/btrfs_graph

Overall:

Device size: 19.53GiB

Device allocated: 17.79GiB

Device unallocated: 1.75GiB

Device missing: 0.00B

Used: 16.78GiB

Free (estimated): 1.83GiB (min: 976.89MiB)

Data ratio: 1.00

Metadata ratio: 2.00

Global reserve: 320.00MiB (used: 0.00B)

Data,single: Size:12.01GiB, Used:11.93GiB

/dev/loop1 12.01GiB

Metadata,single: Size:8.00MiB, Used:0.00B

/dev/loop1 8.00MiB

Metadata,DUP: Size:2.88GiB, Used:2.43GiB

/dev/loop1 5.75GiB

System,single: Size:4.00MiB, Used:0.00B

/dev/loop1 4.00MiB

System,DUP: Size:8.00MiB, Used:16.00KiB

/dev/loop1 16.00MiB

Unallocated:

/dev/loop1 1.75GiB




You may also find some useful information in the cf-dev thread from August
about overcommitting disk on Diego cells:
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/thread/VBDM2TMHQSOFILSHRCV4G2CCPRBP5WKA/#VBDM2TMHQSOFILSHRCV4G2CCPRBP5WKA

Thanks,
Eric



On Wed, Nov 18, 2015 at 6:52 AM, Tom Sherrod <tom.sherrod(a)gmail.com>
wrote:

diego release 0.1398.0

After a couple of weeks of dev, the cells end up filling their disks. Did
I miss a clean up job somewhere?
Currently, once pushes start failing, I get bosh to recreate the machine.

Other options?

Thanks,
Tom

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.