Project

General

Profile

Actions

Bug #16291

closed

Deadlock when application starts

Added by François ARMAND almost 5 years ago. Updated almost 5 years ago.

Status:
Released
Priority:
N/A
Category:
Architecture - Internal libs
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Effort required:
Priority:
94
Name check:
To do
Fix check:
To do
Regression:

Description

We have a deadlock at boot.
It seems that the problem from #15982 happens again. Attached thread dump when the deadlock happens.

Problematic interaction is only if a REST request is done in parallel to start


Files

2019-11-28-deadlock.tdump (86 KB) 2019-11-28-deadlock.tdump François ARMAND, 2019-11-28 18:05
2019-11-28-deadlock-2nd-boot.tdump (58.3 KB) 2019-11-28-deadlock-2nd-boot.tdump François ARMAND, 2019-11-28 18:16

Related issues 3 (0 open3 closed)

Related to Rudder - Bug #15982: Webapp does not start when only one proc is available (deadlock)ReleasedNicolas CHARLESActions
Has duplicate Rudder - Bug #16290: Rudder-jetty can not start on debian9ResolvedActions
Has duplicate Rudder - Bug #16126: Webapp on 6.0 on Ubutun16 doesn't start/is stuckResolvedFrançois ARMANDActions
Actions #1

Updated by François ARMAND almost 5 years ago

  • Related to Bug #15982: Webapp does not start when only one proc is available (deadlock) added
Actions #2

Updated by François ARMAND almost 5 years ago

More info: on the second run, it happens even whithout a rest request incoming. See attached thread dump.

Actions #3

Updated by François ARMAND almost 5 years ago

It seems that in both cases, we have these two blocked thread:

- one in GitArchiverUtils:

zio-default-async-5-1389774257
priority:5 - threadId:0x00007f2c7c038000 - nativeId:0x147d - nativeId (decimal):5245 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at zio.internal.OneShot.get(OneShot.scala:79)
- locked <0x00000000c4b75500> (a zio.internal.OneShot)
at zio.Runtime.unsafeRunSync(Runtime.scala:70)
at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:313)
at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:327)
at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory(GitArchiverUtils.scala:108)
at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory$(GitArchiverUtils.scala:106)
at com.normation.rudder.repository.xml.GitRuleArchiverImpl.getRootDirectory$lzycompute(GitArchiverImpl.scala:68)
- locked <0x00000000c6434380> (a com.normation.rudder.repository.xml.GitRuleArchiverImpl)
at com.normation.rudder.repository.xml.GitRuleArchiverImpl.getRootDirectory(GitArchiverImpl.scala:68)
at com.normation.rudder.repository.xml.ItemArchiveManagerImpl.$anonfun$exportRules$3(ItemArchiveManagerImpl.scala:158)
at com.normation.rudder.repository.xml.ItemArchiveManagerImpl$$Lambda$2718/1520790608.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at zio.blocking.Blocking$Service.$anonfun$effectBlocking$5(Blocking.scala:134)
at zio.blocking.Blocking$Service$$Lambda$234/468393264.apply(Unknown Source)
at zio.internal.FiberContext.evaluateNow(FiberContext.scala:386)
at zio.internal.FiberContext.$anonfun$fork$2(FiberContext.scala:655)
at zio.internal.FiberContext$$Lambda$242/1947181350.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

- one in CheckInitXmlExport:

zio-default-async-4-1389774257
priority:5 - threadId:0x00007f2c74202800 - nativeId:0x1476 - nativeId (decimal):5238 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at zio.internal.OneShot.get(OneShot.scala:79)
- locked <0x00000000c4b778d0> (a zio.internal.OneShot)
at zio.Runtime.unsafeRunSync(Runtime.scala:70)
at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:313)
at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:327)
at com.normation.box$IOToBox.toBox(ZioCommons.scala:376)
at bootstrap.liftweb.checks.CheckInitXmlExport.checks(CheckInitXmlExport.scala:72)
at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1(BootstrapChecks.scala:99)
at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1$adapted(BootstrapChecks.scala:91)
at bootstrap.liftweb.SequentialImmediateBootStrapChecks$$Lambda$1698/1357529035.apply(Unknown Source)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at bootstrap.liftweb.SequentialImmediateBootStrapChecks.checks(BootstrapChecks.scala:91)
at bootstrap.liftweb.RudderConfig$.$anonfun$init$2(RudderConfig.scala:1074)
at bootstrap.liftweb.RudderConfig$$$Lambda$1632/1651251663.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at zio.blocking.Blocking$Service.$anonfun$effectBlocking$5(Blocking.scala:134)
at zio.blocking.Blocking$Service$$Lambda$234/468393264.apply(Unknown Source)
at zio.internal.FiberContext.evaluateNow(FiberContext.scala:386)
at zio.internal.FiberContext.$anonfun$fork$2(FiberContext.scala:655)
at zio.internal.FiberContext$$Lambda$242/1947181350.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And one other from Lift thread pool from lift actor, also waiting on a RunSync, like:

pool-4-thread-2
priority:5 - threadId:0x000055c3a10ce800 - nativeId:0x38c9 - nativeId (decimal):14537 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000c5f89d08> (a zio.internal.OneShot)
at java.lang.Object.wait(Object.java:502)
at zio.internal.OneShot.get(OneShot.scala:79)
- locked <0x00000000c5f89d08> (a zio.internal.OneShot)
at zio.Runtime.unsafeRunSync(Runtime.scala:70)
at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:313)
at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:327)
at com.normation.box$IOToBox.toBox(ZioCommons.scala:376)
at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo(NodeInfoService.scala:546)
at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo$(NodeInfoService.scala:528)
at com.normation.rudder.services.nodes.NodeInfoServiceCachedImpl.getLDAPNodeInfo(NodeInfoService.scala:601)
at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.$anonfun$queryAndChekNodeId$3(LdapQueryProcessor.scala:179)
at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor$$Lambda$1507/1678631731.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.queryAndChekNodeId(LdapQueryProcessor.scala:176)
at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.processOnlyId(LdapQueryProcessor.scala:231)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$4(DynGroupUpdaterService.scala:100)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1349/362963019.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$2(DynGroupUpdaterService.scala:99)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1347/1802870526.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.computeDynGroup(DynGroupUpdaterService.scala:97)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$update$2(DynGroupUpdaterService.scala:132)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1345/254443580.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at net.liftweb.common.Box$WithFilter.flatMap(Box.scala:484)
at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.update(DynGroupUpdaterService.scala:131)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14(UpdateDynamicGroups.scala:257)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14$adapted(UpdateDynamicGroups.scala:255)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2$$Lambda$1323/1191968593.apply(Unknown Source)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.TraversableLike$$Lambda$42/1594199808.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:255)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:246)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at net.liftweb.actor.SpecializedLiftActor.execTranslate(LiftActor.scala:338)
at net.liftweb.actor.SpecializedLiftActor.execTranslate$(LiftActor.scala:338)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.execTranslate(UpdateDynamicGroups.scala:244)
at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
at net.liftweb.actor.SpecializedLiftActor$$Lambda$1206/566531775.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.around(UpdateDynamicGroups.scala:244)
at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
at net.liftweb.actor.SpecializedLiftActor$$Lambda$1205/307036850.apply$mcV$sp(Unknown Source)
at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Or like:

pool-4-thread-14
priority:5 - threadId:0x00007f2c6c11f800 - nativeId:0x158e - nativeId (decimal):5518 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at zio.internal.OneShot.get(OneShot.scala:79)
- locked <0x00000000c4b735d0> (a zio.internal.OneShot)
at zio.Runtime.unsafeRunSync(Runtime.scala:70)
at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:313)
at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:327)
at com.normation.zio$UnsafeRun.runNow(ZioCommons.scala:352)
at com.normation.rudder.hooks.RunHooks$.syncRun(RunHooks.scala:292)
at com.normation.rudder.services.policies.PromiseGeneration_Hooks.$anonfun$runPreHooks$1(DeploymentService.scala:1491)
at com.normation.rudder.services.policies.PromiseGeneration_Hooks$$Lambda$2707/704991484.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at com.normation.rudder.services.policies.PromiseGeneration_Hooks.runPreHooks(DeploymentService.scala:1490)
at com.normation.rudder.services.policies.PromiseGeneration_Hooks.runPreHooks$(DeploymentService.scala:1487)
at com.normation.rudder.services.policies.PromiseGenerationServiceImpl.runPreHooks(DeploymentService.scala:628)
at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$14(DeploymentService.scala:218)
at com.normation.rudder.services.policies.PromiseGenerationService$$Lambda$2702/484666814.apply(Unknown Source)
at net.liftweb.common.Full.flatMap(Box.scala:856)
at com.normation.rudder.services.policies.PromiseGenerationService.deploy(DeploymentService.scala:209)
at com.normation.rudder.services.policies.PromiseGenerationService.deploy$(DeploymentService.scala:145)
at com.normation.rudder.services.policies.PromiseGenerationServiceImpl.deploy(DeploymentService.scala:628)
at com.normation.rudder.batch.AsyncDeploymentActor$DeployerAgent$$anonfun$messageHandler$1.applyOrElse(AsyncDeploymentActor.scala:381)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at net.liftweb.actor.LiftActor.execTranslate(LiftActor.scala:503)
at net.liftweb.actor.LiftActor.execTranslate$(LiftActor.scala:495)
at com.normation.rudder.batch.AsyncDeploymentActor$DeployerAgent$.execTranslate(AsyncDeploymentActor.scala:371)
at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
at net.liftweb.actor.SpecializedLiftActor$$Lambda$1230/2075478193.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
at com.normation.rudder.batch.AsyncDeploymentActor$DeployerAgent$.around(AsyncDeploymentActor.scala:371)
at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
at net.liftweb.actor.SpecializedLiftActor$$Lambda$1229/1135936028.apply$mcV$sp(Unknown Source)
at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Actions #4

Updated by François ARMAND almost 5 years ago

  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Getting started - demo | first install | Technique editor and level 1 Techniques
  • Priority changed from 0 to 94

I'm setting it to critical even if having 2 cpu solves the problem, because people are likely to test in a maching with just one CPU.

Actions #5

Updated by François ARMAND almost 5 years ago

The problem may be that we should never have a runSync run from somewhere else than the zio threadpool. ==> look at how we could do that?

Actions #6

Updated by François ARMAND almost 5 years ago

OK, so the problem is likelly that with still have some synchronize, that LAScheduler use synchronized, and that is seems that openjdk 8 has some problem to deals with the way they interact (but not openjdk 11) => remove our synchronize

Actions #7

Updated by François ARMAND almost 5 years ago

Lazy val are also synchronized in disguise

Actions #8

Updated by François ARMAND almost 5 years ago

  • Status changed from New to In progress
Actions #10

Updated by François ARMAND almost 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2643
Actions #11

Updated by François ARMAND almost 5 years ago

  • Status changed from Pending technical review to Pending release
Actions #12

Updated by François ARMAND almost 5 years ago

  • Has duplicate Bug #16290: Rudder-jetty can not start on debian9 added
Actions #13

Updated by François ARMAND almost 5 years ago

  • Has duplicate Bug #16126: Webapp on 6.0 on Ubutun16 doesn't start/is stuck added
Actions #14

Updated by Alexis Mousset almost 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 6.0.0 which was released today.

Actions

Also available in: Atom PDF