Re: [vcap-dev] Java OOM debugging


Daniel Jones
 

Hi Lari et al,

Thanks for your help Lari.

David and I are pairing on this issue, and we're yet to resolve it. We're
in the process of creating a repeatable test case (our most crashy app
makes calls to external services that need mocking), but in the meantime,
here's what we've seen.

Between Java Buildpack commit e89e546 and 17162df, we see apps crashing
with Warden out of memory errors. killjava.sh never fires, and this has led
us to believe that the kernel is shooting a cgroup process in the head
after the cgroup oversteps its memory limit. We cannot find any evidence of
the OOM killer firing in any logs, but we may not be looking in the right
place.

The JBP is setting heap to be 70%, metaspace to be 15% (with max set to the
same as initial), 5% for "stack", 5% for "normalised stack" and 10% for
"native". We do not understand why this adds up to 105%, but haven't looked
into the JBP algorithm yet. Any pointers on what "normalised stack" is
would be much appreciated, as this doesn't appear in the list of heuristics
supplied via app env.

Other team members tried applying the same settings that you suggested -
thanks for this. Apps still crash with these settings, albeit less
frequently.

After reading the blog you linked to (
http://java.dzone.com/articles/java-8-permgen-metaspace) we wondered
whether the increased *reserved *metaspace claimed after metaspace GC might
be causing a problem; however we reused the test code to create a metaspace
leak in a CF app and saw metaspace GCs occur correctly, and memory usage
never grow over MaxMetaspaceSize. This figures, as the committed metaspace
is still less than MaxMetaspaceSize, and the reserved appears to be
whatever RAM is free across the whole DEA.

We noted that an Oracle blog (
https://blogs.oracle.com/poonam/entry/about_g1_garbage_collector_permanent)
mentions that the metaspace size parameters are approximate. We're
currently wondering if native allocations by Tomcat (APR, NIO) are taking
up more container memory, and so when the metaspace fills, it's creeping
slightly over the limit and triggering the kernel's OOM killer.

Any suggestions would be much appreciated. We've tried to resist tweaking
heuristics blindly, but are running out of options as we're struggling to
figure out how the Java process is using *committed* memory. pmap seems to
show virtual memory, and so it's hard to see if things like the metaspace
or NIO ByteBuffers are nabbing too much and trigger the kernel's OOM killer.

Thanks for all your help,

Daniel Jones & David Head-Rapson

On Wed, Apr 29, 2015 at 8:07 PM, Lari Hotari <Lari(a)hotari.net> wrote:

Hi,

I created a few tools to debug OOM problems since the application I was
responsible for running on CF was failing constantly because of OOM
problems. The problems I had, turned out not to be actual memory leaks in
the Java application.

In the "cf events appname" log I would get entries like this:
2015-xx-xxTxx:xx:xx.00-0400 app.crash appname index: 1,
reason: CRASHED, exit_description: out of memory, exit_status: 255

These type of entries are produced when the container goes over it's
memory resource limits. It doesn't mean that there is a memory leak in the
Java application. The container gets killed by the Linux kernel oom killer (
https://github.com/cloudfoundry/warden/blob/master/warden/README.md#limit-handle-mem-value)
based on the resource limits set to the warden container.

The memory limit is specified in number of bytes. It is enforced using the
control group associated with the container. When a container exceeds this
limit, one or more of its processes will be killed by the kernel.
Additionally, the Warden will be notified that an OOM happened and it
subsequently tears down the container.

In my case it never got killed by the killjava.sh script that gets called
in the java-buildpack when an OOM happens in Java.

This is the tool I built to debug the problems:
https://github.com/lhotari/java-buildpack-diagnostics-app
I deployed that app as part of the forked buildpack I'm using.
Please read the readme about what it's limitations are. It worked for me,
but it might not work for you. It's opensource and you can fork it. :)

There is a solution in my toolcase for creating a heapdump and uploading
that to S3:

https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/HeapDumpServlet.groovy
The readme explains how to setup Amazon S3 keys for this:
https://github.com/lhotari/java-buildpack-diagnostics-app#amazon-s3-setup
Once you get a dump, you can then analyse the dump in a java profiler tool
like YourKit.

I also have a solution that forks the java-buildpack modifies killjava.sh
and adds a script that uploads the heapdump to S3 in the case of OOM:

https://github.com/lhotari/java-buildpack/commit/2d654b80f3bf1a0e0f1bae4f29cb85f56f5f8c46

In java-buildpack-diagnostics-app I have also other tools for getting
Linux operation system specific memory information, for example:


https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemoryInfoServlet.groovy

https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MemorySmapServlet.groovy

https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/MallocInfoServlet.groovy

These tools are handy for looking at details of the Java process RSS
memory usage growth.

There is also a solution for getting ssh shell access inside your
application with tmate.io:

https://github.com/lhotari/java-buildpack-diagnostics-app/blob/master/src/main/groovy/io/github/lhotari/jbpdiagnostics/TmateSshServlet.groovy
(this version is only compatible with the new "cflinuxfs2" stack)

It looks like there are serious problems on CloudFoundry with the memory
sizing calculation. An application that doesn't have a OOM problem will get
killed by the oom killer because the Java process will go over the memory
limits.
I filed this issue:
https://github.com/cloudfoundry/java-buildpack/issues/157 , but that
might not cover everything.

The workaround for that in my case was to add a native key under
memory_sizes in open_jdk_jre.yml and set the minimum to 330M (that is for a
2GB total memory).
see example
https://github.com/grails-samples/java-buildpack/blob/22e0f6a/config/open_jdk_jre.yml#L25
that was how I got the app I'm running on CF to stay within the memory
bounds. I'm sure there is now also a way to get the keys without forking
the buildpack. I could have also adjusted the percentage portions, but I
wanted to set a hard minimum for this case.

It was also required to do some other tuning.

I added this to JAVA_OPTS:
-XX:CompressedClassSpaceSize=256M -XX:InitialCodeCacheSize=64M
-XX:CodeCacheExpansionSize=1M -XX:CodeCacheMinimumFreeSpace=1M
-XX:ReservedCodeCacheSize=200M -XX:MinMetaspaceExpansion=1M
-XX:MaxMetaspaceExpansion=8M -XX:MaxDirectMemorySize=96M
while trying to keep the Java process from growing in RSS memory size.

The memory overhead of a 64 bit Java process on Linux can be reduced by
specifying these environment variables:

stack: cflinuxfs2
.
.
.
env:
MALLOC_ARENA_MAX: 2
MALLOC_MMAP_THRESHOLD_: 131072
MALLOC_TRIM_THRESHOLD_: 131072
MALLOC_TOP_PAD_: 131072
MALLOC_MMAP_MAX_: 65536

MALLOC_ARENA_MAX works only on cflinuxfs2 stack (the lucid64 stack has a
buggy version of glibc).

explanation about MALLOC_ARENA_MAX from Heroku:
https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior
some measurement data how it reduces memory consumption:
https://devcenter.heroku.com/articles/testing-cedar-14-memory-use

I have created a PR to add this to CF java-buildpack:
https://github.com/cloudfoundry/java-buildpack/pull/160

I also created an issues
https://github.com/cloudfoundry/java-buildpack/issues/163 and
https://github.com/cloudfoundry/java-buildpack/pull/159 .

I hope this information helps others struggling with OOM problems in CF.
I'm not saying that this is a ready made solution just for you. YMMV. It
worked for me.

-Lari




On 15-04-29 10:53 AM, Head-Rapson, David wrote:

Hi,

I’m after some guidance on how to get profile Java apps in CF, in order to
get to the bottom of memory issues.

We have an app that’s crashing every few hours with OOM error, most likely
it’s a memory leak.

I’d like to profile the JVM and work out what’s eating memory, however
tools like yourkit require connectivity INTO the JVM server (i.e. the
warden container), either via host / port or via SSH.

Since warden containers cannot be connected to on ports other than for
HTTP and cannot be SSHd to, neither of these works for me.



I tried installed a standalone JDK onto the warden container, however as
soon as I ran ‘jmap’ to invoke the dump, warden cleaned up the container –
most likely for memory over-consumption.



I had previously found a hack in the Weblogic buildpack (
https://github.com/pivotal-cf/weblogic-buildpack/blob/master/docs/container-wls-monitoring.md)
for modifying the start script which, when used with
–XX:HeapDumpOnOutOfMemoryError, should copy any heapdump files to a file
share somewhere. I have my own custom buildpack so I could use something
similar.

Has anyone got a better solution than this?



We would love to use newrelic / app dynamics for this however we’re not
allowed. And I’m not 100% certain they could help with this either.



Dave



The information transmitted is intended for the person or entity to which
it is addressed and may contain confidential, privileged or copyrighted
material. If you receive this in error, please contact the sender and
delete the material from any computer. Fidelity only gives information on
products and services and does not give investment advice to retail clients
based on individual circumstances. Any comments or statements made are not
necessarily those of Fidelity. All e-mails may be monitored. FIL
Investments International (Reg. No.1448245), FIL Investment Services (UK)
Limited (Reg. No. 2016555), FIL Pensions Management (Reg. No. 2015142) and
Financial Administration Services Limited (Reg. No. 1629709) are authorised
and regulated in the UK by the Financial Conduct Authority. FIL Life
Insurance Limited (Reg No. 3406905) is authorised in the UK by the
Prudential Regulation Authority and regulated in the UK by the Financial
Conduct Authority and the Prudential Regulation Authority. Registered
offices at Oakhill House, 130 Tonbridge Road, Hildenborough, Tonbridge,
Kent TN11 9DZ.
--
You received this message because you are subscribed to the Google Groups
"Cloud Foundry Developers" group.
To view this discussion on the web visit
https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/DFFA4ADB9F3BC34194429921AB329336408CAB04%40UKFIL7006WIN.intl.intlroot.fid-intl.com
<https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/DFFA4ADB9F3BC34194429921AB329336408CAB04%40UKFIL7006WIN.intl.intlroot.fid-intl.com?utm_medium=email&utm_source=footer>
.
To unsubscribe from this group and stop receiving emails from it, send an
email to vcap-dev+unsubscribe(a)cloudfoundry.org.



_______________________________________________
Cf-dev mailing list
Cf-dev(a)lists.cloudfoundry.org
https://lists.cloudfoundry.org/mailman/listinfo/cf-dev

--
Regards,

Daniel Jones
EngineerBetter.com

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.