rep fd keep increasing until 'too many open files' and cell in bad status

Qiu Jie QJ Li <liqiujie@...>

Hi, CF developers
We met a problem that rep fd keep increasing until 'too many open files'.

Our cloudfoundry env was built on kubenetes cluster, it had 3 VMs under it.  1 for diego-cell (4core * 16G) and 2 for others.   When we did stress test, we used 10+ threads to push/start/stop/../delete apps continuously with 10s thinktime between each step. It began with 0 errors, but always ended with cell in bad status hours later.    App stage failed with 'can't communicate with compatible cells' and 'too many open files' in rep.stdout.log . We began to monitor the # of files under /proc/<rep-pid>/fd due to the 'too many open files' hint and noticed that the # of files was steady at first, then from a point, it kept increasing, even after the push app test was completely stopped, the increasing file number seems like the cause of 'too many open files' and most likely would cause the node(VM) unreachable in the end.

Why would this fd keep increasing? Was there some leak or something couldn't be released?  

I had opened an issue in rep repository more details. Please let us know what extra detailed info you need to know.

Thanks a lot.

Qiu Jie (Sophy) Li

Join { to automatically receive all group messages.