Bug #11594
closedJVM crash due to thread leak in CachedFindRuleNodeStatusReports
Description
We have a thread leak, perhaps due to a deadlock+unbounded thread pool. Some user are reported crashes with javacore filled with thousands of following stack trace:
3XMTHREADINFO "ForkJoinPool-1-worker-46345" J9VMThread:0x0000000035564D00, j9thread_t:0x00007F1019500010, java/lang/Thread:0x00000003EE1D81C0, state:B, prio=5 3XMJAVALTHREAD (java/lang/Thread getId:0x1C6C3, isDaemon:true) 3XMTHREADINFO1 (native thread ID:0x4DD4, native priority:0x5, native policy:UNKNOWN, vmstate:B, vm thread flags:0x01000201) 3XMTHREADINFO2 (native stack address range from:0x00007F1018FA0000, to:0x00007F1018FE1000, size:0x41000) 3XMCPUTIME CPU usage total: 0.332868435 secs, current category="Application" 3XMTHREADBLOCK Blocked on: com/normation/rudder/services/reports/CachedReportingServiceImpl@0x00000003D9B602D8 Owned by: "ForkJoinPool-1-worker-47359" (J9VMThread:0x000000003D862600, java/lang/Thread:0x0000000707287D28) 3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0) 3XMTHREADINFO3 Java callstack: 4XESTACKTRACE at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$$anonfun$invalidate$1.apply(ReportingServiceImpl.scala:152(Compiled Code)) 4XESTACKTRACE at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2$$anon$4.block(ExecutionContextImpl.scala:48(Compiled Code)) 4XESTACKTRACE at scala/concurrent/forkjoin/ForkJoinPool.managedBlock(ForkJoinPool.java:3640(Compiled Code)) 4XESTACKTRACE at scala/concurrent/impl/ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45(Compiled Code)) 4XESTACKTRACE at scala/concurrent/package$.blocking(package.scala:123(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/services/reports/CachedFindRuleNodeStatusReports$class.invalidate(ReportingServiceImpl.scala:152(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/services/reports/CachedReportingServiceImpl.invalidate(ReportingServiceImpl.scala:76(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply$mcV$sp(ReportsExecutionService.scala:209(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code)) 4XESTACKTRACE at com/normation/rudder/reports/execution/ReportsExecutionService$$anonfun$hook$2.apply(ReportsExecutionService.scala:209(Compiled Code)) 4XESTACKTRACE at scala/concurrent/impl/Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24(Compiled Code)) 4XESTACKTRACE at scala/concurrent/impl/Future$PromiseCompletingRunnable.run(Future.scala:24(Compiled Code)) 4XESTACKTRACE at scala/concurrent/impl/ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121(Compiled Code)) 4XESTACKTRACE at scala/concurrent/forkjoin/ForkJoinTask.doExec(ForkJoinTask.java:260(Compiled Code)) 4XESTACKTRACE at scala/concurrent/forkjoin/ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339(Compiled Code)) 4XESTACKTRACE at scala/concurrent/forkjoin/ForkJoinPool.runWorker(ForkJoinPool.java:1979(Compiled Code)) 4XESTACKTRACE at scala/concurrent/forkjoin/ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107(Compiled Code)) 3XMTHREADINFO3 No native callstack available for this thread
Updated by François ARMAND about 7 years ago
The problem is there:
/** * Invalidate some keys in the cache. That won't charge them again * immediately * * Add a "blocking" signal the Future's thread pool to give more thread to other * because this one is taken for a long time. */ def invalidate(nodeIds: Set[NodeId]): Box[Map[NodeId, (RunAndConfigInfo, Set[RuleNodeStatusReport])]] = scala.concurrent.blocking { this.synchronized { logger.debug(s"Compliance cache: invalidate cache for nodes: [${nodeIds.map { _.value }.mkString(",")}]") cache = cache -- nodeIds //preload new results checkAndUpdateCache(nodeIds) } }
Most likelly, for some reason we enter in a state where there is a lot of update done, and suddenly we have thread pilling up too fast to unstack. If the modification rate is sufficient to require a full new checkAndUpdateCache, we reach thread apocalypse at some point.
Updated by François ARMAND about 7 years ago
As it was reported during night (only ?), we can thing it happens when a loads of servers are sending there inventory. And for some reason, updating the cache take a while, perhaps because some blocking semaphor below, and we never get out of the maze.
Most likelly, we need to rewrite a the node repository to better manage LDAP thread r/w connection and remove blocking logic.
Updated by Nicolas CHARLES about 7 years ago
could it be related to https://www.rudder-project.org/redmine/issues/10646 ?
Updated by François ARMAND about 7 years ago
@nicolas: I don't think so, the blocking seems to be above. But well, who knows exactly ;)
Updated by Janos Matya about 7 years ago
For long term it might be a good idea to outsource building the policy, either to the relays or dedicated host, like a distcc setup (would most probably require a major rewrite, but could solve quite a some scalability issues you might also face in the future)
Updated by Benoît PECCATTE about 7 years ago
- Effort required set to Medium
- Priority changed from 52 to 35
Updated by Vincent MEMBRÉ about 7 years ago
- Target version changed from 3.1.24 to 3.1.25
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 3.1.25 to 387
- Priority changed from 35 to 34
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 387 to 4.1.10
Updated by Vincent MEMBRÉ almost 7 years ago
- Target version changed from 4.1.10 to 4.1.11
- Priority changed from 34 to 33
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.11 to 4.1.12
- Priority changed from 33 to 32
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.12 to 4.1.13
- Priority changed from 32 to 31
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 4.1.13 to 411
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 411 to 4.1.13
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.13 to 4.1.14
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 4.1.14 to 4.1.15
- Priority changed from 31 to 30
Updated by Vincent MEMBRÉ about 6 years ago
- Target version changed from 4.1.15 to 4.1.16
- Priority changed from 30 to 29
Updated by Vincent MEMBRÉ about 6 years ago
- Target version changed from 4.1.16 to 4.1.17
Updated by Vincent MEMBRÉ almost 6 years ago
- Target version changed from 4.1.17 to 4.1.18
- Priority changed from 29 to 0
Updated by Vincent MEMBRÉ almost 6 years ago
- Target version changed from 4.1.18 to 4.1.19
Updated by Alexis Mousset almost 6 years ago
- Target version changed from 4.1.19 to 4.1.20
Updated by François ARMAND over 5 years ago
- Target version changed from 4.1.20 to 4.1.21
Updated by Vincent MEMBRÉ over 5 years ago
- Target version changed from 4.1.21 to 4.1.22
Updated by Nicolas CHARLES over 5 years ago
- Related to Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computation added
Updated by François ARMAND over 5 years ago
- User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Operational - other Techniques | Rudder settings | Plugins
- Priority changed from 0 to 29
Updated by Vincent MEMBRÉ over 5 years ago
- Target version changed from 4.1.22 to 4.1.23
Updated by Vincent MEMBRÉ over 5 years ago
- Target version changed from 4.1.23 to 4.1.24
Updated by Vincent MEMBRÉ over 5 years ago
- Target version changed from 4.1.24 to 588
Updated by Benoît PECCATTE over 5 years ago
- Target version changed from 588 to 5.0.13
Updated to 5.0 but I need a second thought, @François ?
Updated by Vincent MEMBRÉ about 5 years ago
- Target version changed from 5.0.13 to 5.0.14
Updated by Vincent MEMBRÉ about 5 years ago
- Target version changed from 5.0.14 to 5.0.15
Updated by Vincent MEMBRÉ almost 5 years ago
- Target version changed from 5.0.15 to 5.0.16
Updated by Alexis Mousset almost 5 years ago
- Target version changed from 5.0.16 to 5.0.17
Updated by Vincent MEMBRÉ over 4 years ago
- Target version changed from 5.0.17 to 5.0.18
Updated by Vincent MEMBRÉ over 4 years ago
- Target version changed from 5.0.18 to 5.0.19
Updated by Vincent MEMBRÉ about 4 years ago
- Target version changed from 5.0.19 to 5.0.20
Updated by Vincent MEMBRÉ about 4 years ago
- Target version changed from 5.0.20 to 797
Updated by Benoît PECCATTE over 3 years ago
- Target version changed from 797 to 6.1.14
- Priority changed from 29 to 31
Updated by Vincent MEMBRÉ over 3 years ago
- Target version changed from 6.1.14 to 6.1.15
Updated by Vincent MEMBRÉ over 3 years ago
- Target version changed from 6.1.15 to 6.1.16
Updated by Vincent MEMBRÉ about 3 years ago
- Target version changed from 6.1.16 to 6.1.17
- Priority changed from 31 to 32
Updated by Vincent MEMBRÉ about 3 years ago
- Target version changed from 6.1.17 to 6.1.18
Updated by Vincent MEMBRÉ almost 3 years ago
- Target version changed from 6.1.18 to 6.1.19
- Priority changed from 32 to 33
Updated by François ARMAND almost 3 years ago
- Status changed from New to Resolved
This is corrected with the switch to ZIO in rudder 6.1.