Project

General

Profile

Bug #11594

JVM crash due to thread leak in CachedFindRuleNodeStatusReports

Added by François ARMAND over 2 years ago. Updated about 2 months ago.

Status:
New
Priority:
N/A
Assignee:
-
Category:
Performance and scalability
Target version:
Pull Request:
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Medium
Priority:
29

Description

We have a thread leak, perhaps due to a deadlock+unbounded thread pool. Some user are reported crashes with javacore filled with thousands of following stack trace:

3XMTHREADINFO      "ForkJoinPool-1-worker-46345" J9VMThread:0x0000000035564D00, j9thread_t:0x00007F1019500010, java/lang/Thread:0x00000003EE1D81C0, state:B, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1C6C3, isDaemon:true)
3XMTHREADINFO1            (native thread ID:0x4DD4, native priority:0x5, native policy:UNKNOWN, vmstate:B, vm thread flags:0x01000201)
3XMTHREADINFO2            (native stack address range from:0x00007F1018FA0000, to:0x00007F1018FE1000, size:0x41000)
3XMCPUTIME               CPU usage total: 0.332868435 secs, current category="Application" 
3XMTHREADBLOCK     Blocked on: com/normation/rudder/services/reports/CachedReportingServiceImpl@0x00000003D9B602D8 Owned by: "ForkJoinPool-1-worker-47359" (J9VMThread:0x000000003D862600, java/lang/Thread:0x0000000707287D28)
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2$$anon$4.block(ExecutionContextImpl.scala:48(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool.managedBlock(ForkJoinPool.java:3640(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45(Compiled Code))
4XESTACKTRACE                at scala/concurrent/package$.blocking(package.scala:123(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$class.invalidate(ReportingServiceImpl.scala:152(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/services/reports/CachedReportingServiceImpl.invalidate(ReportingServiceImpl.scala:76(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply$mcV$sp(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/Future$PromiseCompletingRunnable.run(Future.scala:24(Compiled Code))
4XESTACKTRACE                at scala/concurrent/impl/ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinTask.doExec(ForkJoinTask.java:260(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinPool.runWorker(ForkJoinPool.java:1979(Compiled Code))
4XESTACKTRACE                at scala/concurrent/forkjoin/ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107(Compiled Code))
3XMTHREADINFO3           No native callstack available for this thread

Related issues

Related to Rudder - Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computationReleasedActions

History

#1

Updated by François ARMAND over 2 years ago

The problem is there:

  /**
   * Invalidate some keys in the cache. That won't charge them again
   * immediately
   *
   * Add a "blocking" signal the Future's thread pool to give more thread to other
   * because this one is taken for a long time.
   */
  def invalidate(nodeIds: Set[NodeId]): Box[Map[NodeId, (RunAndConfigInfo, Set[RuleNodeStatusReport])]] = scala.concurrent.blocking { this.synchronized {
    logger.debug(s"Compliance cache: invalidate cache for nodes: [${nodeIds.map { _.value }.mkString(",")}]")
    cache = cache -- nodeIds
    //preload new results
    checkAndUpdateCache(nodeIds)
  } }

Most likelly, for some reason we enter in a state where there is a lot of update done, and suddenly we have thread pilling up too fast to unstack. If the modification rate is sufficient to require a full new checkAndUpdateCache, we reach thread apocalypse at some point.

#2

Updated by François ARMAND over 2 years ago

As it was reported during night (only ?), we can thing it happens when a loads of servers are sending there inventory. And for some reason, updating the cache take a while, perhaps because some blocking semaphor below, and we never get out of the maze.

Most likelly, we need to rewrite a the node repository to better manage LDAP thread r/w connection and remove blocking logic.

#4

Updated by François ARMAND over 2 years ago

@nicolas: I don't think so, the blocking seems to be above. But well, who knows exactly ;)

#5

Updated by Janos Matya over 2 years ago

For long term it might be a good idea to outsource building the policy, either to the relays or dedicated host, like a distcc setup (would most probably require a major rewrite, but could solve quite a some scalability issues you might also face in the future)

#6

Updated by Benoît PECCATTE over 2 years ago

  • Effort required set to Medium
  • Priority changed from 52 to 35
#7

Updated by Vincent MEMBRÉ over 2 years ago

  • Target version changed from 3.1.24 to 3.1.25
#8

Updated by Vincent MEMBRÉ about 2 years ago

  • Target version changed from 3.1.25 to 387
  • Priority changed from 35 to 34
#9

Updated by Vincent MEMBRÉ about 2 years ago

  • Target version changed from 387 to 4.1.10
#10

Updated by Vincent MEMBRÉ almost 2 years ago

  • Target version changed from 4.1.10 to 4.1.11
  • Priority changed from 34 to 33
#11

Updated by Vincent MEMBRÉ almost 2 years ago

  • Target version changed from 4.1.11 to 4.1.12
  • Priority changed from 33 to 32
#12

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.12 to 4.1.13
  • Priority changed from 32 to 31
#13

Updated by Benoît PECCATTE over 1 year ago

  • Target version changed from 4.1.13 to 411
#14

Updated by Benoît PECCATTE over 1 year ago

  • Target version changed from 411 to 4.1.13
#15

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.13 to 4.1.14
#16

Updated by Benoît PECCATTE over 1 year ago

  • Target version changed from 4.1.14 to 4.1.15
  • Priority changed from 31 to 30
#17

Updated by Vincent MEMBRÉ over 1 year ago

  • Target version changed from 4.1.15 to 4.1.16
  • Priority changed from 30 to 29
#18

Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 4.1.16 to 4.1.17
#19

Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 4.1.17 to 4.1.18
  • Priority changed from 29 to 0
#20

Updated by Vincent MEMBRÉ about 1 year ago

  • Target version changed from 4.1.18 to 4.1.19
#21

Updated by Alexis MOUSSET about 1 year ago

  • Target version changed from 4.1.19 to 4.1.20
#22

Updated by François ARMAND 11 months ago

  • Target version changed from 4.1.20 to 4.1.21
#23

Updated by Vincent MEMBRÉ 10 months ago

  • Target version changed from 4.1.21 to 4.1.22
#24

Updated by Nicolas CHARLES 10 months ago

  • Related to Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computation added
#25

Updated by François ARMAND 10 months ago

  • User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Operational - other Techniques | Rudder settings | Plugins
  • Priority changed from 0 to 29
#26

Updated by Vincent MEMBRÉ 9 months ago

  • Target version changed from 4.1.22 to 4.1.23
#27

Updated by Vincent MEMBRÉ 8 months ago

  • Target version changed from 4.1.23 to 4.1.24
#28

Updated by Vincent MEMBRÉ 7 months ago

  • Target version changed from 4.1.24 to 588
#29

Updated by Benoît PECCATTE 7 months ago

  • Target version changed from 588 to 5.0.13

Updated to 5.0 but I need a second thought, @François ?

#30

Updated by Vincent MEMBRÉ 5 months ago

  • Target version changed from 5.0.13 to 5.0.14
#31

Updated by Vincent MEMBRÉ 4 months ago

  • Target version changed from 5.0.14 to 5.0.15
#32

Updated by Vincent MEMBRÉ about 2 months ago

  • Target version changed from 5.0.15 to 5.0.16

Also available in: Atom PDF