Project

General

Profile

Actions

Bug #7735

closed

OOM in Rudder when there are too many repaired reports

Added by Nicolas CHARLES almost 9 years ago. Updated over 2 years ago.

Status:
Released
Priority:
1 (highest)
Category:
Web - Compliance & node report
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
0
Name check:
Fix check:
Regression:

Description

If there are too many repaired reports in the database, the Rudder web interface requires a lot more memory, and can lead to OOM

There can be a lot of repaired reports (for instance if you use a lot of command_execution in ncf technique editor), and witha lot of nodes, it can quicly add up to 2 millions entries (runs every 5 minutes, 10 repairs per run, 300 nodes -> 2.5 millions repaired entries)

an output of the OOM is the following, but it may be really anything at all

ERROR net.liftweb.actor.ActorLogger - Actor threw an exception
java.lang.OutOfMemoryError: Java heap space
Exception in thread "Connection reader for connection 1 to localhost:389" java.lang.OutOfMemoryError: Java heap space
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:152)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
    at com.unboundid.asn1.ASN1StreamReader.read(ASN1StreamReader.java:978)
    at com.unboundid.asn1.ASN1StreamReader.readType(ASN1StreamReader.java:327)
    at com.unboundid.asn1.ASN1StreamReader.beginSequence(ASN1StreamReader.java:900)
    at com.unboundid.ldap.protocol.LDAPMessage.readLDAPResponseFrom(LDAPMessage.java:1146)
    at com.unboundid.ldap.sdk.LDAPConnectionReader.run(LDAPConnectionReader.java:257)
Exception in thread "pool-2-thread-5" java.lang.OutOfMemoryError: Java heap space
    at org.postgresql.jdbc2.TimestampUtils.loadCalendar(TimestampUtils.java:101)
    at org.postgresql.jdbc2.TimestampUtils.toTimestamp(TimestampUtils.java:333)
    at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:540)
    at org.postgresql.jdbc2.AbstractJdbc2ResultSet.getTimestamp(AbstractJdbc2ResultSet.java:2629)
    at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:448)
    at com.normation.rudder.repository.jdbc.ReportsMapper$.mapRow(ReportsJdbcRepository.scala:438)
    at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:92)
    at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60)
    at org.springframework.jdbc.core.JdbcTemplate$1QueryStatementCallback.doInStatement(JdbcTemplate.java:446)
    at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:396)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:456)
    at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:464)
    at com.normation.rudder.repository.jdbc.ReportsJdbcRepository.getErrorReportsBeetween(ReportsJdbcRepository.scala:428)
    at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger$$anonfun$messageHandler$1.applyOrElse(AutomaticReportLogger.scala:130)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at net.liftweb.actor.LiftActor$class.execTranslate(LiftActor.scala:440)
    at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.execTranslate(AutomaticReportLogger.scala:84)
    at net.liftweb.actor.SpecializedLiftActor$class.liftedTree2$1(LiftActor.scala:288)
    at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$proc2(LiftActor.scala:287)
    at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply$mcV$sp(LiftActor.scala:210)
    at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210)
    at net.liftweb.actor.SpecializedLiftActor$$anonfun$net$liftweb$actor$SpecializedLiftActor$$processMailbox$1.apply(LiftActor.scala:210)
    at net.liftweb.actor.SpecializedLiftActor$class.around(LiftActor.scala:224)
    at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.around(AutomaticReportLogger.scala:84)
    at net.liftweb.actor.SpecializedLiftActor$class.net$liftweb$actor$SpecializedLiftActor$$processMailbox(LiftActor.scala:209)
    at net.liftweb.actor.SpecializedLiftActor$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiftActor.scala:173)
    at net.liftweb.actor.LAScheduler$$anonfun$9$$anon$2$$anon$3.run(LiftActor.scala:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Workarounds

A workaround may be to add more ram to the Rudder web app, as explained here: http://www.rudder-project.org/doc-3.2/_performance_tuning.html#_java_out_of_memory_error

It may not be sufficient in case where there is really a lot of event, so you may need a more radical workaround, as explain in comments, which is to delete the surnumerous events:

delete from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval ;

Of course, it's a rather irreversible workaround, so you may want to know what where the problems before:

select nodeid,directiveid,ruleid,component,keyvalue,msg
  from ruddersysevents
  where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval
  group by nodeid,directiveid,ruleid, component, keyvalue,msg
;

Subtasks 1 (0 open1 closed)

Bug #8195: Correct memory profile and efficiency of non-compliant-reports loggerReleasedVincent MEMBRÉ2016-04-19Actions

Related issues 1 (0 open1 closed)

Related to Rudder - User story #5617: Detecting and restarting Rudder on OOM (Out Of Memory Exception)ResolvedActions
Actions #1

Updated by Nicolas CHARLES almost 9 years ago

a workaround for this isssue is to run the command

delete from ruddersysevents where eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval ;
Actions #2

Updated by Vincent MEMBRÉ almost 9 years ago

  • Target version changed from 3.0.13 to 3.0.14
Actions #3

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 3.0.14 to 3.0.15
Actions #4

Updated by Nicolas CHARLES over 8 years ago

  • Assignee set to François ARMAND
Actions #5

Updated by François ARMAND over 8 years ago

  • Status changed from New to In progress
Actions #6

Updated by Vincent MEMBRÉ over 8 years ago

  • Target version changed from 3.0.15 to 3.0.16
Actions #7

Updated by François ARMAND over 8 years ago

  • Translation missing: en.field_tag_list set to Sponsored
Actions #8

Updated by François ARMAND over 8 years ago

  • Description updated (diff)
Actions #9

Updated by François ARMAND over 8 years ago

  • Related to User story #5617: Detecting and restarting Rudder on OOM (Out Of Memory Exception) added
Actions #10

Updated by François ARMAND over 8 years ago

By the way, we should not let Rudder in such a pity state, and simply find a solution for #5617

Actions #11

Updated by François ARMAND over 8 years ago

To reproduce, go to the Rudder server, connect to Postgres ();

Then get an example of event logs:

select * from ruddersysevents where  eventtype = 'result_repaired' and executiontimestamp < now()-'1 hour'::interval and nodeid != 'root' limit 10;
   id   |     executiondate      |                nodeid                |             directiveid              |                ruleid                | serial |      component      |         keyvalue          |   executiontimestamp   |    eventtype    |       policy        |                             msg
--------+------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------+---------------------+---------------------------+------------------------+-----------------+---------------------+-------------------------------------------------------------
 153512 | 2015-08-06 14:11:54+02 | e58d1658-7cae-4b10-aafa-7147b8d02095 | dd4a5ee8-080d-452b-bfa8-9e46b459a804 | eff9e40d-5ccf-450a-801a-d4432a7592e4 |      0 | File                | /tmp/simple-val1          | 2015-08-06 16:11:53+02 | result_repaired | File                | The file /tmp/simple-val1 was successfully updated

And then, execute the pglsq xommand:

rudder=> DO
$do$
BEGIN
FOR i in 1..100000 LOOP
insert into ruddersysevents ( executiondate, nodeid, directiveid, ruleid, serial, component, keyvalue, executiontimestamp, eventtype, policy, msg) VALUES ('values','from', 'the', 'result');
END LOOP;
END;
$do$;

With changing the actual values.
The executiontimestamp value must be in the last 3 days to work, and you may have to do it several time to have the correct number of event to get to OOM.
Putting a lower value for you -Xmx parameter (like 1G) should help not having to do it only one time.

Actions #12

Updated by François ARMAND over 8 years ago

The problem is not only with "CachedNodeChangesServiceImpl" but also with getErrorReportsBeetween that is doing an even worst request in case of sudden burst of repaired, because it also get the result_error.

Actions #13

Updated by Nicolas CHARLES over 8 years ago

There is something odd there - a line shouldn't be more than 2KB (we are limited by the syslog formal), so even 100 000 of them should not be more than 200 000 KB
Aren't we duplicating data, or copying the collections ? Would using views in collection help in this use case ?

Actions #14

Updated by François ARMAND over 8 years ago

Well, there is a lot of collection transformation and so there is a lot of memory copy. That's the fail of Scala.

All in all, we shouldn't just happily load a 1 millions lines like that, without boundaries consideration, especially for a batch processus that absolutly doesn't need to have all the line in one go.

I'm oppening a side bug for that case, because there was a LOT of other problem with the logic in the logger, and so it was much, much less efficient than it should have been.

Actions #15

Updated by François ARMAND over 8 years ago

  • Status changed from In progress to Pending technical review
  • Pull Request set to https://github.com/Normation/rudder/pull/1084
Actions #16

Updated by Jonathan CLARKE over 8 years ago

  • Translation missing: en.field_tag_list changed from Sponsored to Sponsored, Next minor release
Actions #17

Updated by François ARMAND over 8 years ago

  • Status changed from Pending technical review to Pending release
Actions #18

Updated by Vincent MEMBRÉ over 8 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 3.0.16, 3.1.10 and 3.2.3 which were released on 2016-06-01, but not announced.

Actions #19

Updated by Alexis Mousset over 2 years ago

  • Priority set to 0
Actions

Also available in: Atom PDF