Project

General

Profile

Actions

Bug #15982

closed

Webapp does not start when only one proc is available (deadlock)

Added by François ARMAND about 5 years ago. Updated about 5 years ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
0
Name check:
To do
Fix check:
To do
Regression:

Description

With the current 6.0.0-beta, the webapp doesn't start if there is only one proc available for the JVM.

This is due to a thread deadlock in the main thread:

   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.4/Native Method)
        - waiting on <0x00000000eba139e8> (a zio.internal.OneShot)
        at java.lang.Object.wait(java.base@11.0.4/Object.java:328)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - waiting to re-lock in wait() <0x00000000eba139e8> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:312)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:336)
        at com.normation.zio$UnsafeRun.runNow(ZioCommons.scala:361)
        at bootstrap.liftweb.Boot.initPlugins(Boot.scala:531)
        at bootstrap.liftweb.Boot.boot(Boot.scala:248)
   ...

Several other thread are also in an unsafeRunSync part.


Related issues 1 (0 open1 closed)

Related to Rudder - Bug #16291: Deadlock when application startsReleasedNicolas CHARLESActions
Actions #1

Updated by François ARMAND about 5 years ago

  • Status changed from New to In progress
Actions #2

Updated by François ARMAND about 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2535
Actions #3

Updated by François ARMAND about 5 years ago

So, the problem seems to be that cat and zio don't collaborate well when cats is not managed through zio.interop.

To explain the problem: with the current configuration, we only have one thread pool for the whole Future/cats/zio runtime because we told ZIO runtime to also use scala global scheduler and we have a config with only one proc => if the thread block the proc for the threadpool, all other threads are blocked.

If we only have zio things, everything goes well as long as we mark "blocking" things that can block (fair enough).
But if we also have one cat things, we get a deadlock - I don't understand why. I understand that the cats operation get executed on scala global scheduler, and that it use the thread and so no other zio operation can get executed. But I don't understand why it doesn't finish and we get a dead lock.

You can test by replacing the executionTask of FindNewReportsExecution by:

  val executeTask: Long => Box[Unit]  = (processId:Long) => net.liftweb.common.Full(1) // tryo{reportsExecutionService.findAndSaveExecutions(processId)}

=> boot succeeds

I wasn't able to reproduce the problem on a toy sample: if a cats operation get the thread pool time, other zio op are well stopped (as expected) but the does not forbid the cats op to terminate, and then to execute zio operations.

So, the proposed correction split appart zio threadpool and cats one. So we don't reach the strange case above.

Last thought: perhaps the problem is with the lift `net.liftweb.actor.LAScheduler`. When we have the deadlock, we have both a cats and a zio operation:

"pool-6-thread-2" #106 prio=5 os_prio=0 cpu=98.86ms elapsed=14.54s tid=0x00007fda48009800 nid=0x5a7ff in Object.wait()  [0x00007fd9ea4f9000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.4/Native Method)
        - waiting on <0x00000000ec51b008> (a zio.internal.OneShot)
        at java.lang.Object.wait(java.base@11.0.4/Object.java:328)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - waiting to re-lock in wait() <0x00000000ec51b008> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:319)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:343)
        at com.normation.box$IOToBox.toBox(ZioCommons.scala:392)
        at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo(NodeInfoService.scala:546)
        at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo$(NodeInfoService.scala:528)
        at com.normation.rudder.services.nodes.NodeInfoServiceCachedImpl.getLDAPNodeInfo(NodeInfoService.scala:601)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.$anonfun$queryAndChekNodeId$3(LdapQueryProcessor.scala:179)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor$$Lambda$1682/0x0000000100e9b040.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.queryAndChekNodeId(LdapQueryProcessor.scala:176)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.processOnlyId(LdapQueryProcessor.scala:231)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$4(DynGroupUpdaterService.scala:100)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1519/0x0000000100e0d840.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$2(DynGroupUpdaterService.scala:99)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1517/0x0000000100e0c840.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.computeDynGroup(DynGroupUpdaterService.scala:97)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$update$2(DynGroupUpdaterService.scala:132)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1515/0x0000000100e0a040.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at net.liftweb.common.Box$WithFilter.flatMap(Box.scala:484)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.update(DynGroupUpdaterService.scala:131)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14(UpdateDynamicGroups.scala:257)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14$adapted(UpdateDynamicGroups.scala:255)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2$$Lambda$1499/0x0000000100df8040.apply(Unknown Source)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
        at scala.collection.TraversableLike$$Lambda$175/0x000000010053d040.apply(Unknown Source)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike.map(TraversableLike.scala:238)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
        at scala.collection.immutable.List.map(List.scala:298)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:255)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:246)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate(LiftActor.scala:338)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate$(LiftActor.scala:338)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.execTranslate(UpdateDynamicGroups.scala:244)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1380/0x0000000100d8f040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.around(UpdateDynamicGroups.scala:244)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1379/0x0000000100d8e840.apply$mcV$sp(Unknown Source)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.4/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.4/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.4/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000000c486ade8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"pool-6-thread-10" #206 prio=5 os_prio=0 cpu=1.38ms elapsed=4.79s tid=0x00007fd9e4114000 nid=0x5a863 waiting on condition  [0x00007fda21bf1000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.4/Native Method)
        - parking to wait for  <0x00000000ec677758> (a cats.effect.internals.IOPlatform$OneShotLatch)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.4/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.4/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.4/AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.4/AbstractQueuedSynchronizer.java:1345)
        at cats.effect.internals.IOPlatform$.$anonfun$unsafeResync$2(IOPlatform.scala:51)
        at cats.effect.internals.IOPlatform$$$Lambda$1634/0x0000000100e75040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:57)
        at scala.concurrent.package$.blocking(package.scala:146)
        at cats.effect.internals.IOPlatform$.unsafeResync(IOPlatform.scala:51)
        at cats.effect.IO.unsafeRunTimed(IO.scala:324)
        at cats.effect.IO.unsafeRunSync(IO.scala:239)
        at com.normation.rudder.db.Doobie.transactRunBox(Doobie.scala:100)
        at com.normation.rudder.repository.jdbc.ReportsJdbcRepository.getReportsfromId(ReportsJdbcRepository.scala:431)
        at com.normation.rudder.reports.execution.ReportsExecutionService.findAndSaveExecutions(ReportsExecutionService.scala:125)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$2(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution$$Lambda$1830/0x0000000100f12c40.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.common.Tryo.tryo(Tryo.scala:22)
        at net.liftweb.common.Tryo.tryo$(Tryo.scala:20)
        at net.liftweb.util.Helpers$.tryo(Helpers.scala:34)
        at net.liftweb.common.Tryo.tryo(Tryo.scala:60)
        at net.liftweb.common.Tryo.tryo$(Tryo.scala:60)
        at net.liftweb.util.Helpers$.tryo(Helpers.scala:34)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$1(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$1$adapted(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution$$Lambda$1806/0x0000000100efd040.apply(Unknown Source)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$$anonfun$messageHandler$2.applyOrElse(AbstractScheduler.scala:198)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$$anonfun$messageHandler$2.applyOrElse(AbstractScheduler.scala:191)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate(LiftActor.scala:338)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate$(LiftActor.scala:338)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$.execTranslate(AbstractScheduler.scala:189)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1380/0x0000000100d8f040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$.around(AbstractScheduler.scala:189)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1379/0x0000000100d8e840.apply$mcV$sp(Unknown Source)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.4/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.4/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.4/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000000ec677358> (a java.util.concurrent.ThreadPoolExecutor$Worker)

But that doesn't tell more why.

Actions #4

Updated by François ARMAND about 5 years ago

  • Status changed from Pending technical review to Pending release
Actions #5

Updated by Vincent MEMBRÉ about 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 6.0.0~beta1 which was released today.

Actions #6

Updated by François ARMAND almost 5 years ago

  • Related to Bug #16291: Deadlock when application starts added
Actions

Also available in: Atom PDF