User story #5617
closedDetecting and restarting Rudder on OOM (Out Of Memory Exception)
Description
Hi,
I've a lot of nodes (5000 "test" nodes). The policy generation following their acceptation has been ongoing since yesterday 21:28 (for 12 hours now). This is not normal, it should not have been longer than 4 hours.
In the webapp log, I can see the following exception:
[2014-10-07 21:26:35] INFO com.normation.rudder.services.policies.DeploymentServiceImpl - Start policy generation, checking updated rules [2014-10-07 21:28:28] WARN application - [Store Agent Run Times] Task frequency is set too low! Last task took 74577 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists. [2014-10-07 21:28:29] ERROR net.liftweb.actor.ActorLogger - Actor threw an exception java.lang.OutOfMemoryError: Java heap space at com.unboundid.util.StaticUtils.toLowerCase(StaticUtils.java:440) ~[unboundid-ldapsdk-2.3.4.jar:2.3.4] Exception in thread "pool-3-thread-8" java.lang.OutOfMemoryError: Java heap space at com.unboundid.util.StaticUtils.toLowerCase(StaticUtils.java:440) at com.unboundid.ldap.sdk.Entry.<init>(Entry.java:309) at com.unboundid.ldap.sdk.Entry.<init>(Entry.java:284) at com.normation.ldap.sdk.LDAPEntry$.apply(LDAPEntry.scala:291) at com.normation.ldap.sdk.LDAPEntry$.apply(LDAPEntry.scala:293) at com.normation.ldap.sdk.RoLDAPConnection$$anonfun$search$1.apply(LDAPConnection.scala:303) at com.normation.ldap.sdk.RoLDAPConnection$$anonfun$search$1.apply(LDAPConnection.scala:303) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at com.normation.ldap.sdk.RoLDAPConnection.search(LDAPConnection.scala:303) at com.normation.ldap.sdk.ReadOnlyEntryLDAPConnection$class.search(LDAPConnection.scala:82) at com.normation.ldap.sdk.RoLDAPConnection.search(LDAPConnection.scala:283) at com.normation.ldap.sdk.ReadOnlyEntryLDAPConnection$class.searchOne(LDAPConnection.scala:142) at com.normation.ldap.sdk.RoLDAPConnection.searchOne(LDAPConnection.scala:283) at com.normation.rudder.services.nodes.NodeInfoServiceImpl$$anonfun$getAll$1.apply(NodeInfoService.scala:193) at com.normation.rudder.services.nodes.NodeInfoServiceImpl$$anonfun$getAll$1.apply(NodeInfoService.scala:189) at com.normation.ldap.sdk.LDAPConnectionProvider$$anonfun$map$1.apply(LDAPConnectionProvider.scala:94) at com.normation.ldap.sdk.LDAPConnectionProvider$$anonfun$map$1.apply(LDAPConnectionProvider.scala:93) at com.normation.ldap.sdk.LDAPConnectionProvider$class.withCon(LDAPConnectionProvider.scala:154) at com.normation.ldap.sdk.ROPooledSimpleAuthConnectionProvider.withCon(LDAPConnectionProvider.scala:369) at com.normation.ldap.sdk.LDAPConnectionProvider$class.map(LDAPConnectionProvider.scala:93) at com.normation.ldap.sdk.ROPooledSimpleAuthConnectionProvider.map(LDAPConnectionProvider.scala:369) at com.normation.rudder.services.nodes.NodeInfoServiceImpl.getAll(NodeInfoService.scala:189) at com.normation.rudder.services.policies.DeploymentService_findDependantRules_bruteForce$class.getAllNodeInfos(DeploymentService.scala:322) at com.normation.rudder.services.policies.DeploymentServiceImpl.getAllNodeInfos(DeploymentService.scala:276) at com.normation.rudder.services.policies.DeploymentService$$anonfun$2.apply(DeploymentService.scala:90) [2014-10-07 21:28:30] INFO com.normation.rudder.batch.AsyncDeploymentAgent - One automatic policy update process is already pending, ignoring new policy update request
Looks like it actually failed two minutes after the beginning, but rudder is stuck thinking it's still ongoing:
Updating policies (started at 2014-10-07 21:26). Another update is pending since 2014-10-07 21:26
Thanks.
Updated by François ARMAND about 10 years ago
Ah, the kind of things making you love java....
From an user experience point of view, I think it be better to kill the JVM on such scenario, so that at least we don't have to deal with a zombie app (and so, our agent can detect it's no more running and start it again).
Of course, it would be even better to not have such error, I'm talking about a remission scenario here.
Updated by François ARMAND about 10 years ago
- Category set to System integration
- Status changed from New to Discussion
- Assignee set to François ARMAND
- Target version set to 2.6.19
I propose to add a JVM option to kill the server and restart it when out of memory errors happen
Updated by François ARMAND about 10 years ago
- Target version changed from 2.6.19 to 140
More information about what we want to do:
Use the "OnOutOfMemoryError" in something like: -XX:OnOutOfMemoryError="kill -9 %p ; RESTART_RUDDER"
Example of use:
% cat boom.scala object Booom { def main(args: Array[String]) : Unit = { val a = Array.fill(1000000)(Array.fill(100000)(1)) } } % scala boom.scala java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ArrayBuilder$ofInt.mkArray(ArrayBuilder.scala:323) at scala.collection.mutable.ArrayBuilder$ofInt.resize(ArrayBuilder.scala:329) ..... % scala -J-XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" boom.scala # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" # Executing /bin/sh -c "kill -9 10251 && echo 'such memory, much killness, wow'"... such memory, much killness, wow /home/fanf/java/scala/bin/scala: line 21: 10251 Killed "$@"
There is one big caveat to that: the command(s) passed to the option use fork() on linux, meaning that they need as much memory as the parent JVM for the fork to succeed, modulo (on Linux) the memory_overcommit profil used on the server. Of course, most of the time, if you get a OutOfMemory Error, that means that you don't have any more memory available for the JVM, and if your JVM is configured to use most of the server memory available, the fork won't be possible under default overcommit configuration on Linux (mode 0, 50% allowed).
Workaround: configure overcommit in mode 1 or with 100%.
See:
- https://bugs.openjdk.java.net/browse/JDK-8027434 for the bug report on that problem.
- https://www.kernel.org/doc/Documentation/vm/overcommit-accounting for explanation of overcommit modes
- https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-captun.html for example about how to configure overcommit on RedHat (that boils donw to: "echo 1 > /proc/sys/vm/overcommit_memory" for temp configuration, use sysctl config for permanent one)
Updated by Matthieu CERDA about 10 years ago
- Target version changed from 140 to 3.0.0~beta1
Updated by Jonathan CLARKE almost 10 years ago
- Target version changed from 3.0.0~beta1 to 3.0.0~beta2
Updated by François ARMAND almost 10 years ago
- Target version changed from 3.0.0~beta2 to 3.0.0~rc1
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 3.0.0~rc1 to 3.0.0
Updated by François ARMAND almost 10 years ago
- Target version changed from 3.0.0 to 2.10.9
I have no idea of what we could do for now. All in all, the target is not correct: the bug exists since java exists (or something alike).
Perhaps we should add a system directive looking for occurence of "java.lang.OutOfMemoryError: Java heap space" (or other oom errors), and restarting Rudder in that case ?
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.10.9 to 2.10.10
Updated by Vincent MEMBRÉ almost 10 years ago
- Target version changed from 2.10.10 to 2.10.11
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.11 to 2.10.12
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.12 to 2.10.13
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.13 to 2.10.14
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.14 to 2.10.15
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.15 to 2.10.16
Updated by Vincent MEMBRÉ over 9 years ago
- Target version changed from 2.10.16 to 2.10.17
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.10.17 to 2.10.18
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.10.18 to 2.10.19
Updated by Vincent MEMBRÉ about 9 years ago
- Target version changed from 2.10.19 to 2.10.20
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 2.10.20 to 277
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 277 to 2.11.18
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 2.11.18 to 2.11.19
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.19 to 2.11.20
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.20 to 2.11.21
Updated by François ARMAND over 8 years ago
- Subject changed from Aborted policy generation due to out of memory exception is not detected to Detecting and restarting Rudder on OOM (Out Of Memory Exception)
Updated by François ARMAND over 8 years ago
Updated by François ARMAND over 8 years ago
- Related to Bug #2843: Rudder can fail to generate promises when Java is lacking memory added
Updated by François ARMAND over 8 years ago
- Related to Bug #7524: Java OOM during Java's log migration added
Updated by François ARMAND over 8 years ago
- Related to Bug #7735: OOM in Rudder when there are too many repaired reports added
Updated by François ARMAND over 8 years ago
We could also make the OOM detection be done in a cron, looking in the last log, and restarting Rudder when Needed.
Updated by François ARMAND over 8 years ago
To better explain why the -XX:OnOutOfMemoryError is not a very good solution:
On a VM with 2,5G of RAM, default Centos6, no space, where Oom is a java program that is just allocating big arrays:
At 1G heap space, everything works as expected:
root@test:~/testOOM# java6 -Xms1000m -Xmx1000m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom start assigning array num. 0 assigning array num. 1000 assigning array num. 2000 # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" # Executing /bin/sh -c "kill -9 3613 && echo 'such memory, much killness, wow'"... such memory, much killness, wow Killed
At 1,3Go heap space, the forked children get killed before having a chance to kick off, NOT doing it's job:
root@test:~/testOOM# java6 -Xms1300m -Xmx1300m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom start assigning array num. 0 assigning array num. 1000 assigning array num. 2000 assigning array num. 3000 assigning array num. 4000 # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" # Executing /bin/sh -c "kill -9 3601 && echo 'such memory, much killness, wow'"... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at Oom.main(Oom.java:18)
(actually, it's even befoare 1,3Go, because heap space is just a little party of the java RAM process, and Linux has some reserved memory, and with Xms we ask for a preallocation of the whole memory segment. But you get the basic idea: around 50% of RAM allocated to the heap, -XX:OnOutOfMemoryError stop being usefull)
And at higher level of RAM, linux is killing directly the process, not letting OOM appear (this case is out of scope, it's for information):
root@test:~/testOOM# java6 -Xms2300m -Xmx2300m -XX:OnOutOfMemoryError="kill -9 %p && echo 'such memory, much killness, wow'" Oom start assigning array num. 0 assigning array num. 1000 assigning array num. 2000 assigning array num. 3000 assigning array num. 4000 Killed
Updated by François ARMAND over 8 years ago
- Assignee changed from François ARMAND to Benoît PECCATTE
Benoit, could you please find what it would take to have a cron script (or promise) looking in the log and checking for OOM and restarting rudder if so ?
The main difficulities is to be able to be sure that we don't kill Rudder in loop. That should be ok, because Jetty is creating a new log file at each start of Rudder, but perhaps there is other things to take care of?
Updated by Florian Heigl over 8 years ago
One thing to take care of is the presence of the various heap and crash dumps the JVM will litter around, one set per OOM.
Those can be pretty troublesome as they fill up the filesystem.
In our case we got really lucky in that we had already separated the postgresql and ldap directories.
Others may not and might face OOM-related corruption on the other components.
Updated by François ARMAND over 8 years ago
Note: in JVM 8u92, two jvm options were introduced: http://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html
- ExitOnOutOfMemoryError
- CrashOnOutOfMemoryError
It seems to be the behavior that we want.
We should enable one or the other when we detect that the JVM version is sufficiently hight to support them.
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.21 to 2.11.22
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.22 to 2.11.23
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 2.11.23 to 2.11.24
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 2.11.24 to 308
Updated by François ARMAND about 8 years ago
- Related to Architecture #8923: Requires Java8 (jdk8) for Rudder 4.0 added
Updated by François ARMAND about 8 years ago
Some more precision on JVM supporting these flags:
- hotspot and openjdk JDK 8u92 (1.8.0_92-b14) and up
- openjdk
- jrockit, any java 7/8
- not on IBM JVM
We have to option, I'm not sure which one we prefer. I think crash, so that we have a chance to know what part of the code OOMed.
Crash example:
% scala -J-XX:+CrashOnOutOfMemoryError scala> val a = new Array[Array[Int]](10000) a: Array[Array[Int]] = Array(null, null, null, null, null, ... scala> for( i <- 0 until 10000) { a(i) = new Array(100000) } Aborting due to java.lang.OutOfMemoryError: Java heap space # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (debug.cpp:308), pid=2125, tid=0x00007f81e7ffa700 # fatal error: OutOfMemory encountered: Java heap space # # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops) # Core dump written. Default location: /home/fanf/core or core.2125 # # An error report file with more information is saved as: # /home/fanf/hs_err_pid2125.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # /home/fanf/java/scala/bin/scala : ligne 21 : 2125 Abandon (core dumped)"$@"
Or simply exit:
== [fanf@luhman16] ~ (master●) == % scala -J-XX:+ExitOnOutOfMemoryError Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101). Type in expressions for evaluation. Or try :help. scala> val a = new Array[Array[Int]](10000) a: Array[Array[Int]] = Array(null, null, null, null, null, ... scala> for( i <- 0 until 10000) { a(i) = new Array(100000) } Terminating due to java.lang.OutOfMemoryError: Java heap space
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 308 to 3.1.14
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.14 to 3.1.15
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.15 to 3.1.16
Updated by Vincent MEMBRÉ about 8 years ago
- Target version changed from 3.1.16 to 3.1.17
Updated by Vincent MEMBRÉ almost 8 years ago
- Target version changed from 3.1.17 to 3.1.18
Updated by Vincent MEMBRÉ almost 8 years ago
- Target version changed from 3.1.18 to 3.1.19
Updated by Benoît PECCATTE over 7 years ago
- Tracker changed from Bug to User story
Updated by François ARMAND over 7 years ago
- Related to Bug #8165: rudder-init fails to report memory errors from jetty start added
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.19 to 3.1.20
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.20 to 3.1.21
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.21 to 3.1.22
Updated by Vincent MEMBRÉ over 7 years ago
- Target version changed from 3.1.22 to 3.1.23
Updated by Vincent MEMBRÉ about 7 years ago
- Target version changed from 3.1.23 to 3.1.24
Updated by Vincent MEMBRÉ about 7 years ago
- Target version changed from 3.1.24 to 3.1.25
Updated by Benoît PECCATTE about 7 years ago
- Target version changed from 3.1.25 to 4.1.9
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 4.1.9 to 4.1.10
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 4.1.10 to 4.1.11
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.11 to 4.1.12
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.12 to 4.1.13
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.13 to 4.1.14
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 4.1.14 to 4.1.15
Updated by Vincent MEMBRÉ about 6 years ago
- Target version changed from 4.1.15 to 4.1.16
Updated by Vincent MEMBRÉ about 6 years ago
- Target version changed from 4.1.16 to 4.1.17
Updated by Vincent MEMBRÉ almost 6 years ago
- Target version changed from 4.1.17 to 4.1.18
Updated by Vincent MEMBRÉ almost 6 years ago
- Target version changed from 4.1.18 to 4.1.19
Updated by Alexis Mousset almost 6 years ago
- Target version changed from 4.1.19 to 4.1.20
Updated by Alexis Mousset over 5 years ago
- Target version changed from 4.1.20 to 588
Updated by Alexis Mousset over 5 years ago
- Target version changed from 588 to 5.0.13
Updated by Vincent MEMBRÉ about 5 years ago
- Target version changed from 5.0.13 to 5.0.14
Updated by Vincent MEMBRÉ about 5 years ago
- Target version changed from 5.0.14 to 5.0.15
Updated by Vincent MEMBRÉ almost 5 years ago
- Target version changed from 5.0.15 to 5.0.16
Updated by Alexis Mousset almost 5 years ago
- Target version changed from 5.0.16 to 5.0.17
Updated by Vincent MEMBRÉ over 4 years ago
- Target version changed from 5.0.17 to 5.0.18
Updated by Benoît PECCATTE over 4 years ago
- Target version changed from 5.0.18 to 6.2.0~beta1
Updated by François ARMAND over 4 years ago
- Status changed from Discussion to Resolved
- Assignee deleted (
Benoît PECCATTE) - Target version changed from 6.2.0~beta1 to 5.0.18
We had several iteration on that one since rudder 4.1, and as of rudder 5.0, the solution is pretty robust. It works with OOM and all other "non expected", fatal exceptions.