cf v231: Issue with new webdav blobstore job


Rich Wohlstadter
 

Hi,

We recently upgraded to cf v231 and switched over from using nfs to the new webdav nginx service. We have one environment where the blobstore is very large. The monit startup script for the blobstore job includes a recursive chown of the blobstore disk (chown -R vcap:vcap $RUN_DIR $LOG_DIR $DATA) which depending on the speed of our storage can sometimes take a long enough time for monit to have issues and try and start it again. The first one will finish, but monit will try and start another one due to the delay and logging will start showing errors binding to port 80 and monit will eventually give up saying execution failed. Does that recursive chown need to be there? I compared the blobstore job to the old debian nfs job and the nfs job just did a chown on the toplevel /var/vcap/store/shared directory. This is causing us issues in this environment whenever we need to update/restart that vm.

Rich


Marco Nicosia
 

On Tue, Mar 22, 2016 at 2:07 PM, Rich Wohlstadter <lethwin(a)gmail.com> wrote:

Hi,

We recently upgraded to cf v231 and switched over from using nfs to the
new webdav nginx service. We have one environment where the blobstore is
very large. The monit startup script for the blobstore job includes a
recursive chown of the blobstore disk (chown -R vcap:vcap $RUN_DIR $LOG_DIR
$DATA) which depending on the speed of our storage can sometimes take a
long enough time for monit to have issues and try and start it again. The
first one will finish, but monit will try and start another one due to the
delay and logging will start showing errors binding to port 80 and monit
will eventually give up saying execution failed. Does that recursive chown
need to be there? I compared the blobstore job to the old debian nfs job
and the nfs job just did a chown on the toplevel /var/vcap/store/shared
directory. This is causing us issues in this environment whenever we need
to update/restart that vm.

We solved a similar problem in MySQL by doing that work in a pre-start
<https://bosh.io/docs/pre-start.html> script. Timeouts don't apply to pre-
and post-start phases, so you can do lengthy transformations there.

Would that be a reasonable solution for the WebDav transformation?

Unfortunately, those phases haven't got timeouts. The release author is
responsible for any failures which result in an infinite hang.

--
Marco Nicosia
Product Manager
Pivotal Software, Inc.



Rich


Rich Wohlstadter
 

That would probably work. I'm wondering if the recursive is even neccessary. The nfs service didnt do that. Why would the webdav replacement? If it does indeed need to be done, then I'm thinking it should be done outside the monit startup or in a way that does not delay webdav from starting otherwise its going to be a continuing issue depending on customer blobstore sizes.

-Rich


Nicholas Calugar
 

Hi Rich,

We investigated and determined the recursive chown was not necessary. The
commit is still working it's way through CI, but will hopefully make it in
the next CF-Release:

https://www.pivotaltracker.com/story/show/116176983

Thanks,

Nick

On Wed, Mar 23, 2016 at 6:02 AM Rich Wohlstadter <lethwin(a)gmail.com> wrote:

That would probably work. I'm wondering if the recursive is even
neccessary. The nfs service didnt do that. Why would the webdav
replacement? If it does indeed need to be done, then I'm thinking it
should be done outside the monit startup or in a way that does not delay
webdav from starting otherwise its going to be a continuing issue depending
on customer blobstore sizes.

-Rich
--
Nicholas Calugar
CAPI Product Manager
Pivotal Software, Inc.


Rich Wohlstadter
 

Excellent. Thanks for your investigation on this.

-Rich