Project

General

Profile

Bug #15982

Webapp does not start when only one proc is available (deadlock)

Added by François ARMAND about 1 month ago. Updated 15 days ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
User visibility:
Effort required:
Priority:
0

Description

With the current 6.0.0-beta, the webapp doesn't start if there is only one proc available for the JVM.

This is due to a thread deadlock in the main thread:

   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.4/Native Method)
        - waiting on <0x00000000eba139e8> (a zio.internal.OneShot)
        at java.lang.Object.wait(java.base@11.0.4/Object.java:328)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - waiting to re-lock in wait() <0x00000000eba139e8> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:312)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:336)
        at com.normation.zio$UnsafeRun.runNow(ZioCommons.scala:361)
        at bootstrap.liftweb.Boot.initPlugins(Boot.scala:531)
        at bootstrap.liftweb.Boot.boot(Boot.scala:248)
   ...

Several other thread are also in an unsafeRunSync part.

Associated revisions

Revision 5b463273 (diff)
Added by François ARMAND about 1 month ago

Fixes #15982: Webapp does not start when only one proc is available (deadlock)

History

#1

Updated by François ARMAND about 1 month ago

  • Status changed from New to In progress
#2

Updated by François ARMAND about 1 month ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2535
#3

Updated by François ARMAND about 1 month ago

So, the problem seems to be that cat and zio don't collaborate well when cats is not managed through zio.interop.

To explain the problem: with the current configuration, we only have one thread pool for the whole Future/cats/zio runtime because we told ZIO runtime to also use scala global scheduler and we have a config with only one proc => if the thread block the proc for the threadpool, all other threads are blocked.

If we only have zio things, everything goes well as long as we mark "blocking" things that can block (fair enough).
But if we also have one cat things, we get a deadlock - I don't understand why. I understand that the cats operation get executed on scala global scheduler, and that it use the thread and so no other zio operation can get executed. But I don't understand why it doesn't finish and we get a dead lock.

You can test by replacing the executionTask of FindNewReportsExecution by:

  val executeTask: Long => Box[Unit]  = (processId:Long) => net.liftweb.common.Full(1) // tryo{reportsExecutionService.findAndSaveExecutions(processId)}

=> boot succeeds

I wasn't able to reproduce the problem on a toy sample: if a cats operation get the thread pool time, other zio op are well stopped (as expected) but the does not forbid the cats op to terminate, and then to execute zio operations.

So, the proposed correction split appart zio threadpool and cats one. So we don't reach the strange case above.

Last thought: perhaps the problem is with the lift `net.liftweb.actor.LAScheduler`. When we have the deadlock, we have both a cats and a zio operation:

"pool-6-thread-2" #106 prio=5 os_prio=0 cpu=98.86ms elapsed=14.54s tid=0x00007fda48009800 nid=0x5a7ff in Object.wait()  [0x00007fd9ea4f9000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.4/Native Method)
        - waiting on <0x00000000ec51b008> (a zio.internal.OneShot)
        at java.lang.Object.wait(java.base@11.0.4/Object.java:328)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - waiting to re-lock in wait() <0x00000000ec51b008> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:319)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:343)
        at com.normation.box$IOToBox.toBox(ZioCommons.scala:392)
        at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo(NodeInfoService.scala:546)
        at com.normation.rudder.services.nodes.NodeInfoServiceCached.getLDAPNodeInfo$(NodeInfoService.scala:528)
        at com.normation.rudder.services.nodes.NodeInfoServiceCachedImpl.getLDAPNodeInfo(NodeInfoService.scala:601)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.$anonfun$queryAndChekNodeId$3(LdapQueryProcessor.scala:179)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor$$Lambda$1682/0x0000000100e9b040.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.queryAndChekNodeId(LdapQueryProcessor.scala:176)
        at com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor.processOnlyId(LdapQueryProcessor.scala:231)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$4(DynGroupUpdaterService.scala:100)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1519/0x0000000100e0d840.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$computeDynGroup$2(DynGroupUpdaterService.scala:99)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1517/0x0000000100e0c840.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.computeDynGroup(DynGroupUpdaterService.scala:97)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.$anonfun$update$2(DynGroupUpdaterService.scala:132)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl$$Lambda$1515/0x0000000100e0a040.apply(Unknown Source)
        at net.liftweb.common.Full.flatMap(Box.scala:856)
        at net.liftweb.common.Box$WithFilter.flatMap(Box.scala:484)
        at com.normation.rudder.services.queries.DynGroupUpdaterServiceImpl.update(DynGroupUpdaterService.scala:131)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14(UpdateDynamicGroups.scala:257)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.$anonfun$applyOrElse$14$adapted(UpdateDynamicGroups.scala:255)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2$$Lambda$1499/0x0000000100df8040.apply(Unknown Source)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
        at scala.collection.TraversableLike$$Lambda$175/0x000000010053d040.apply(Unknown Source)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike.map(TraversableLike.scala:238)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
        at scala.collection.immutable.List.map(List.scala:298)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:255)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$$anonfun$messageHandler$2.applyOrElse(UpdateDynamicGroups.scala:246)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate(LiftActor.scala:338)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate$(LiftActor.scala:338)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.execTranslate(UpdateDynamicGroups.scala:244)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1380/0x0000000100d8f040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
        at com.normation.rudder.batch.UpdateDynamicGroups$LAUpdateDyngroupManager$LAUpdateDyngroup$.around(UpdateDynamicGroups.scala:244)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1379/0x0000000100d8e840.apply$mcV$sp(Unknown Source)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.4/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.4/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.4/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000000c486ade8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"pool-6-thread-10" #206 prio=5 os_prio=0 cpu=1.38ms elapsed=4.79s tid=0x00007fd9e4114000 nid=0x5a863 waiting on condition  [0x00007fda21bf1000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.4/Native Method)
        - parking to wait for  <0x00000000ec677758> (a cats.effect.internals.IOPlatform$OneShotLatch)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.4/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.4/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.4/AbstractQueuedSynchronizer.java:1039)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.4/AbstractQueuedSynchronizer.java:1345)
        at cats.effect.internals.IOPlatform$.$anonfun$unsafeResync$2(IOPlatform.scala:51)
        at cats.effect.internals.IOPlatform$$$Lambda$1634/0x0000000100e75040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:57)
        at scala.concurrent.package$.blocking(package.scala:146)
        at cats.effect.internals.IOPlatform$.unsafeResync(IOPlatform.scala:51)
        at cats.effect.IO.unsafeRunTimed(IO.scala:324)
        at cats.effect.IO.unsafeRunSync(IO.scala:239)
        at com.normation.rudder.db.Doobie.transactRunBox(Doobie.scala:100)
        at com.normation.rudder.repository.jdbc.ReportsJdbcRepository.getReportsfromId(ReportsJdbcRepository.scala:431)
        at com.normation.rudder.reports.execution.ReportsExecutionService.findAndSaveExecutions(ReportsExecutionService.scala:125)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$2(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution$$Lambda$1830/0x0000000100f12c40.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.common.Tryo.tryo(Tryo.scala:22)
        at net.liftweb.common.Tryo.tryo$(Tryo.scala:20)
        at net.liftweb.util.Helpers$.tryo(Helpers.scala:34)
        at net.liftweb.common.Tryo.tryo(Tryo.scala:60)
        at net.liftweb.common.Tryo.tryo$(Tryo.scala:60)
        at net.liftweb.util.Helpers$.tryo(Helpers.scala:34)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$1(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution.$anonfun$executeTask$1$adapted(FetchNodeExecutions.scala:62)
        at com.normation.rudder.batch.FindNewReportsExecution$$Lambda$1806/0x0000000100efd040.apply(Unknown Source)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$$anonfun$messageHandler$2.applyOrElse(AbstractScheduler.scala:198)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$$anonfun$messageHandler$2.applyOrElse(AbstractScheduler.scala:191)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate(LiftActor.scala:338)
        at net.liftweb.actor.SpecializedLiftActor.execTranslate$(LiftActor.scala:338)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$.execTranslate(AbstractScheduler.scala:189)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:304)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1380/0x0000000100d8f040.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:240)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:239)
        at com.normation.rudder.batch.AbstractScheduler$StatusManager$TaskProcessor$.around(AbstractScheduler.scala:189)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:226)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:189)
        at net.liftweb.actor.SpecializedLiftActor$$Lambda$1379/0x0000000100d8e840.apply$mcV$sp(Unknown Source)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.4/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.4/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.4/Thread.java:834)

   Locked ownable synchronizers:
        - <0x00000000ec677358> (a java.util.concurrent.ThreadPoolExecutor$Worker)

But that doesn't tell more why.

#4

Updated by François ARMAND about 1 month ago

  • Status changed from Pending technical review to Pending release
#5

Updated by Vincent MEMBRÉ 15 days ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 6.0.0~beta1 which was released today.

Also available in: Atom PDF