Bug #7735
closedOOM in Rudder when there are too many repaired reports
Description
If there are too many repaired reports in the database, the Rudder web interface requires a lot more memory, and can lead to OOM
There can be a lot of repaired reports (for instance if you use a lot of command_execution in ncf technique editor), and witha lot of nodes, it can quicly add up to 2 millions entries (runs every 5 minutes, 10 repairs per run, 300 nodes -> 2.5 millions repaired entries)
an output of the OOM is the following, but it may be really anything at all
ERROR net.liftweb.actor.ActorLogger - Actor threw an exception java.lang.OutOfMemoryError: Java heap space Exception in thread "Connection reader for connection 1 to localhost:389" java.lang.OutOfMemoryError: Java heap space at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at com.unboundid.asn1.ASN1StreamReader.read(ASN1StreamReader.java:978) at com.unboundid.asn1.ASN1StreamReader.readType(ASN1StreamReader.java:327) at com.unboundid.asn1.ASN1StreamReader.beginSequence(ASN1StreamReader.java:900) at com.unboundid.ldap.protocol.LDAPMessage.readLDAPResponseFrom(LDAPMessage.java:1146) at com.unboundid.ldap.sdk.LDAPConnectionReader.run(LDAPConnectionReader.java:257) Exception in thread "pool-2-thread-5" java.lang.OutOfMemoryError: Java heap space at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:101) at org.postgresql.jdbc2.TimestampUtils.toTimestamp(TimestampUtils.java:333) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:540) at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:2629) at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:448) at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:438) at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:92) at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60) at org.springframework.jdbc.core.JdbcTemplate$1QueryStatementCallback.doInStatement(JdbcTemplate.java:446) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:396) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:456) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:464) at com.normation.rudder.repository.jdbc.ReportsJdbcRepository.getErrorReportsBeetween(ReportsJdbcRepository.scala:428) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger$$anonfun$messageHandler$1.applyOrElse(AutomaticReportLogger.scala:130) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at net.liftweb.actor.LiftActor$class.execTranslate(LiftActor.scala:440) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.execTranslate(AutomaticReportLogger.scala:84) at net.liftweb.actor.SpecializedLiftActor$class.liftedTree2$1(LiftActor.scala:288) at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$proc2(LiftActor.scala:287) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply$mcV$sp(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210) at net.liftweb.actor.SpecializedLiftActor$class.around(LiftActor.scala:224) at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.around(AutomaticReportLogger.scala:84) at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$processMailbox(LiftActor.scala:209) at net.liftweb.actor.SpecializedLiftActor$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiftActor.scala:173) at net.liftweb.actor.LAScheduler$$anonfun$9$$anon$2$$anon$3.run(LiftActor.scala:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Workarounds
A workaround may be to add more ram to the Rudder web app, as explained here: http://www.rudder-project.org/doc-3.2/_performance_tuning.html#_java_out_of_memory_error
It may not be sufficient in case where there is really a lot of event, so you may need a more radical workaround, as explain in comments, which is to delete the surnumerous events:
delete from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval ;
Of course, it's a rather irreversible workaround, so you may want to know what where the problems before:
select nodeid,directiveid,ruleid,component,keyvalue,msg from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval group by nodeid,directiveid,ruleid, component, keyvalue,msg ;
Updated by Nicolas CHARLES almost 9 years ago
a workaround for this isssue is to run the command
delete from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval ;
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 3.0.13 to 3.0.14
Updated by Vincent MEMBRÉ almost 9 years ago
- Target version changed from 3.0.14 to 3.0.15
Updated by François ARMAND over 8 years ago
- Status changed from New to In progress
Updated by Vincent MEMBRÉ over 8 years ago
- Target version changed from 3.0.15 to 3.0.16
Updated by François ARMAND over 8 years ago
- Translation missing: en.field_tag_list set to Sponsored
Updated by François ARMAND over 8 years ago
- Related to User story #5617: Detecting and restarting Rudder on OOM (Out Of Memory Exception) added
Updated by François ARMAND over 8 years ago
By the way, we should not let Rudder in such a pity state, and simply find a solution for #5617
Updated by François ARMAND over 8 years ago
To reproduce, go to the Rudder server, connect to Postgres ();
Then get an example of event logs:
select * from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval and nodeid != 'root' limit 10; id | executiondate | nodeid | directiveid | ruleid | serial | component | keyvalue | executiontimestamp | eventtype | policy | msg --------+------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------+---------------------+---------------------------+------------------------+-----------------+---------------------+------------------------------------------------------------- 153512 | 2015-08-06 14:11:54+02 | e58d1658-7cae-4b10-aafa-7147b8d02095 | dd4a5ee8-080d-452b-bfa8-9e46b459a804 | eff9e40d-5ccf-450a-801a-d4432a7592e4 | 0 | File | /tmp/simple-val1 | 2015-08-06 16:11:53+02 | result_repaired | File | The file /tmp/simple-val1 was successfully updated
And then, execute the pglsq xommand:
rudder=> DO $do$ BEGIN FOR i in 1..100000 LOOP insert into ruddersysevents ( executiondate, nodeid, directiveid, ruleid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg) VALUES ('values','from', 'the', 'result'); END LOOP; END; $do$;
With changing the actual values.
The executiontimestamp value must be in the last 3 days to work, and you may have to do it several time to have the correct number of event to get to OOM.
Putting a lower value for you -Xmx parameter (like 1G) should help not having to do it only one time.
Updated by François ARMAND over 8 years ago
The problem is not only with "CachedNodeChangesServiceImpl" but also with getErrorReportsBeetween that is doing an even worst request in case of sudden burst of repaired, because it also get the result_error.
Updated by Nicolas CHARLES over 8 years ago
There is something odd there - a line shouldn't be more than 2KB (we are limited by the syslog formal), so even 100 000 of them should not be more than 200 000 KB
Aren't we duplicating data, or copying the collections ? Would using views in collection help in this use case ?
Updated by François ARMAND over 8 years ago
Well, there is a lot of collection transformation and so there is a lot of memory copy. That's the fail of Scala.
All in all, we shouldn't just happily load a 1 millions lines like that, without boundaries consideration, especially for a batch processus that absolutly doesn't need to have all the line in one go.
I'm oppening a side bug for that case, because there was a LOT of other problem with the logic in the logger, and so it was much, much less efficient than it should have been.
Updated by François ARMAND over 8 years ago
- Status changed from In progress to Pending technical review
- Pull Request set to https://github.com/Normation/rudder/pull/1084
Updated by Jonathan CLARKE over 8 years ago
- Translation missing: en.field_tag_list changed from Sponsored to Sponsored, Next minor release
Updated by François ARMAND over 8 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|f4db618e07e20fee4a18b8e19fa35d7f03dfeaea.
Updated by Vincent MEMBRÉ over 8 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 3.0.16, 3.1.10 and 3.2.3 which were released on 2016-06-01, but not announced.
- 2.11: Changelog
- 3.0: Changelog
- 3.1: Changelog
- 3.2: Changelog
- Download: https://www.rudder-project.org/site/get-rudder/downloads/