Bug #7735
closedOOM in Rudder when there are too many repaired reports
Description
If there are too many repaired reports in the database, the Rudder web interface requires a lot more memory, and can lead to OOM
There can be a lot of repaired reports (for instance if you use a lot of command_execution in ncf technique editor), and witha lot of nodes, it can quicly add up to 2 millions entries (runs every 5 minutes, 10 repairs per run, 300 nodes -> 2.5 millions repaired entries)
an output of the OOM is the following, but it may be really anything at all
ERROR net.liftweb.actor.ActorLogger - Actor threw an exception java.lang.OutOfMemoryError: Java heap space Exception in thread "Connection reader for connection 1 to localhost:389" java.lang.OutOfMemoryError: Java heap space at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at com.unboundid.asn1.ASN1StreamReader.read(ASN1StreamReader.java:978) at com.unboundid.asn1.ASN1StreamReader.readType(ASN1StreamReader.java:327) at com.unboundid.asn1.ASN1StreamReader.beginSequence(ASN1StreamReader.java:900) at com.unboundid.ldap.protocol.LDAPMessage.readLDAPResponseFrom(LDAPMessage.java:1146) at com.unboundid.ldap.sdk.LDAPConnectionReader.run(LDAPConnectionReader.java:257) Exception in thread "pool-2-thread-5" java.lang.OutOfMemoryError: Java heap space at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:101) at org.postgresql.jdbc2.TimestampUtils.toTimestamp(TimestampUtils.java:333) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:540) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:2629) at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:448) at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:438) at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:92) at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60) at org.springframework.jdbc.core.JdbcTemplate$1QueryStatementCallback.doInStatement(JdbcTemplate.java:446) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:396) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:456) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:464) at com.normation.rudder.repository.jdbc.ReportsJdbcRepository.getErrorReportsBeetween(ReportsJdbcRepository.scala:428) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger$$anonfun$messageHandler$1.applyOrElse(AutomaticReportLogger.scala:130) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at net.liftweb.actor.LiftActor$class.execTranslate(LiftActor.scala:440) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.execTranslate(AutomaticReportLogger.scala:84) at net.liftweb.actor.SpecializedLiftActor$class.liftedTree2$1(LiftActor.scala:288) at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$proc2(LiftActor.scala:287) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply$mcV$sp(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$class.around(LiftActor.scala:224) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.around(AutomaticReportLogger.scala:84) at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$processMailbox(LiftActor.scala:209) at net.liftweb.actor.SpecializedLiftActor$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiftActor.scala:173) at net.liftweb.actor.LAScheduler$$anonfun$9$$anon$2$$anon$3.run(LiftActor.scala:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Workarounds
A workaround may be to add more ram to the Rudder web app, as explained here: http://www.rudder-project.org/doc-3.2/_performance_tuning.html#_java_out_of_memory_error
It may not be sufficient in case where there is really a lot of event, so you may need a more radical workaround, as explain in comments, which is to delete the surnumerous events:
delete from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval ;
Of course, it's a rather irreversible workaround, so you may want to know what where the problems before:
select nodeid,directiveid,ruleid,component,keyvalue,msg from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval group by nodeid,directiveid,ruleid, component, keyvalue,msg ;