Project

General

Profile

Actions

Bug #15324

closed

Webapp fails to boot (several time)

Added by François ARMAND over 4 years ago. Updated over 4 years ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Effort required:
Priority:
90
Name check:
To do
Fix check:
To do
Regression:

Description

When we install a new rudder server, it need to restart jetty 7 times before being successfull.

Here are the log:

Boot 1: suicide

First boot fails because LDAP is not ready and rudder kill itself. This seems legit.

[2019-07-26 09:49:23] ERROR scheduledJob - report logger could not fetch latest 100 non compliant reports, retry on next run: Could not get node information from database
[2019-07-26 09:49:23] INFO  bootchecks - LDAP and PostgreSQL connection are OK
[2019-07-26 09:49:23] INFO  bootchecks - Check mandatory DIT entries
[2019-07-26 09:49:23] ERROR bootchecks - There is some required entries missing in the LDAP directory: ou=Pending Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Accepted Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Removed Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Rudder, cn=rudder-configuration
[2019-07-26 09:49:23] ERROR application - Fatal error during boot, Rudder will stop now
javax.servlet.UnavailableException: There is some required entries missing in the LDAP directory: ou=Pending Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Accepted Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Removed Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Rudder, cn=rudder-configuration
        at bootstrap.liftweb.checks.CheckDIT.FAIL$1(CheckDIT.scala:77)
        at bootstrap.liftweb.checks.CheckDIT.checks(CheckDIT.scala:106)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1(BootstrapChecks.scala:99)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1$adapted(BootstrapChecks.scala:91)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.checks(BootstrapChecks.scala:91)
        at bootstrap.liftweb.RudderConfig$.init(RudderConfig.scala:860)
        at bootstrap.liftweb.LiftInitContextListener.contextInitialized(LiftInitContextListener.scala:109)
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:953)
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:558)
        at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:918)
        at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:370)
        at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1497)
        at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1459)

Boot 2: stall

In second boot, rudder fails to start after 1 minute and systemd kills it:

Jul 26 09:49:33 server systemd[1]: Starting Jetty Web Application Server...
.....
Jul 26 09:50:34 server rudder-jetty.sh[12515]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:50:34 UTC 2019
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:50:34 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Unit entered failed state.
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Failed with result 'exit-code'.

In webapp logs, the last lines are:

[2019-07-26 09:49:51] WARN  scheduledJob - There is no reports to log
[2019-07-26 09:49:51] WARN  com.normation.rudder.repository.jdbc.RudderPropertiesRepositoryImpl - last id not present in database, create it with value 0

Boot 3 - stall

Same than previsous, but elsewhere. Rudder fails to start in 1 minutes:

Jul 26 09:50:35 server systemd[1]: Starting Jetty Web Application Server...
...
Jul 26 09:51:35 server rudder-jetty.sh[12961]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:51:35 UTC 2019
Jul 26 09:51:35 server systemd[1]: Failed to start Jetty Web Application Server.

But app seems to block elsewhere (we get at least one generation, and errors on bad inventories):

[2019-07-26 09:50:57] ERROR com.normation.rudder.services.policies.WriteNodeCertificatesPemImpl - SystemError: An error occured; cause was: /var/rudder/lib/ssl/allnodecerts.pem.new
[2019-07-26 09:50:58] ERROR com.normation.rudder.domain.nodes.NodeInfo - Error when trying to get the CFEngine-MD5 digest of CFEngine public key for node 'server.rudder.local' (root) <- unable to decode base64 string: invalid characters encountered in base64 data
[2019-07-26 09:50:58] ERROR com.normation.rudder.domain.nodes.NodeInfo - Error when trying to get the sha-256 digest of CFEngine public key for node 'server.rudder.local' (root) <- unable to decode base64 string: invalid characters encountered in base64 data
...
[2019-07-26 09:51:01] INFO  policy.generation - Successful policy update '1' [started 2019-07-26 09:50:56 - ended 2019-07-26 09:51:01]

Boot 4 - stall

Again, app killed after 1min

Jul 26 09:51:36 server systemd[1]: Starting Jetty Web Application Server...
...

Jul 26 09:52:36 server rudder-jetty.sh[14178]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:52:36 UTC 2019
Jul 26 09:52:36 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:52:37 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:52:37 server systemd[1]: rudder-jetty.service: Unit entered failed state.

The last log are:

[2019-07-26 09:51:53] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one
[2019-07-26 09:51:54] INFO  migration - No migration detected by migration script (table 'migrationeventlog' is empty or does not exist)
[2019-07-26 09:51:55] INFO  com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_09-51-55.058)

Boot 5,6,7 - stalling again, again, and finally works

Here, we have 3 minutes before the next restart:

Jul 26 09:55:41 server systemd[1]: Starting Jetty Web Application Server...

Jul 26 09:56:41 server rudder-jetty.sh[15847]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:56:41 UTC 2019
Jul 26 09:56:41 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:56:42 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:56:42 server systemd[1]: rudder-jetty.service: Unit entered failed state.
Jul 26 09:56:42 server systemd[1]: rudder-jetty.service: Failed with result 'exit-code'.

Things happen, but the last bootbecks is again about archive:

[2019-07-26 09:55:58] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one

The next one is exactly the same.

And the last one finally succeed with the archive.

[2019-07-26 10:00:58] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one
[2019-07-26 10:00:58] DEBUG com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Archived Rule category: /var/rudder/configuration-repository/ruleCategories/category.xml
[2019-07-26 10:00:59] INFO  com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_10-00-59.360)
[2019-07-26 10:00:59] DEBUG com.normation.rudder.repository.xml.GitRuleArchiverImpl - Archived rule: /var/rudder/configuration-repository/rules/32377fd7-02fd-43d0-aab7-28460a91347b.xml
[2019-07-26 10:00:59] INFO  com.normation.rudder.repository.xml.GitRuleArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_10-00-59.768)
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitActiveTechniqueCategoryArchiverImpl - Create a new archive: GitPath(archives/directives/2019-07-26_10-01-01.400)
.....
[2019-07-26 10:01:01] DEBUG com.normation.rudder.repository.xml.GitNodeGroupArchiverImpl - Archived node group category: /var/rudder/configuration-repository/groups/category.xml
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitNodeGroupArchiverImpl - Create a new archive: GitPath(archives/groups/2019-07-26_10-01-01.660)
[2019-07-26 10:01:01] DEBUG com.normation.rudder.repository.xml.GitParameterArchiverImpl - Archived parameter: /var/rudder/configuration-repository/parameters/rudder_file_edit_header.xml
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitParameterArchiverImpl - Create a new archive: GitPath(archives/parameters/2019-07-26_10-01-01.794)
[2019-07-26 10:01:02] INFO  com.normation.rudder.repository.xml.ItemArchiveManagerImpl - Create a new archive: GitPath(archives/full/2019-07-26_10-01-02.108)
[2019-07-26 10:01:02] INFO  bootchecks - First full archive of configuration-repository items done
...

Subtasks 3 (0 open3 closed)

Bug #15376: Forget to fork backend batchReleasedNicolas CHARLESActions
Bug #15381: Add more info on LDAP backend errorReleasedVincent MEMBRÉActions
Bug #15383: Unhandled error if the LDAP connection is unavailableReleasedVincent MEMBRÉActions
Actions #1

Updated by François ARMAND over 4 years ago

  • Status changed from New to In progress
Actions #2

Updated by François ARMAND over 4 years ago

So, it seems that the blocking point is that boocheck:

[2019-07-31 13:40:21] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one

Afterward, threads accumulate and one of these two seems to be faulty (actually the second should have been called by the first):

"zio-default-blocking-10-2118413714" #87 daemon prio=5 os_prio=0 tid=0x00007f4b341c3800 nid=0x2c11 in Object.wait() [0x00007f4b27b7b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - locked <0x00000000c7b93c08> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:291)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:305)
        at com.normation.box$IOToBox.toBox(ZioCommons.scala:344)
        at bootstrap.liftweb.checks.CheckInitXmlExport.checks(CheckInitXmlExport.scala:72)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1(BootstrapChecks.scala:99)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1$adapted(BootstrapChecks.scala:91)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks$$Lambda$2109/108232476.apply(Unknown Source)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.checks(BootstrapChecks.scala:91)
        at bootstrap.liftweb.RudderConfig$.$anonfun$init$2(RudderConfig.scala:864)
        at bootstrap.liftweb.RudderConfig$$$Lambda$2069/546271589.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at zio.blocking.Blocking$Service.$anonfun$effectBlocking$5(Blocking.scala:133)
        at zio.blocking.Blocking$Service$$Lambda$427/800470253.apply(Unknown Source)
        at zio.internal.FiberContext.evaluateNow(FiberContext.scala:332)
        at zio.internal.FiberContext.$anonfun$evaluateLater$1(FiberContext.scala:592)
        at zio.internal.FiberContext$$Lambda$423/1775525066.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
"zio-default-blocking-16-2118413714" #309 daemon prio=5 os_prio=0 tid=0x00007f4b620e5000 nid=0x2cf9 in Object.wait() [0x00007f4b25469000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - locked <0x00000000c7b93380> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:291)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:305)
        at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory(GitArchiverUtils.scala:107)
        at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory$(GitArchiverUtils.scala:105)
        at com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl.getRootDirectory$lzycompute(GitRuleCategoryArchiver.scala:120)
        - locked <0x00000000c7099da0> (a com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl)
        at com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl.getRootDirectory(GitRuleCategoryArchiver.scala:120)
        at com.normation.rudder.repository.xml.ItemArchiveManagerImpl.$anonfun$exportRuleCategories$2(ItemArchiveManagerImpl.scala:140)
        at com.normation.rudder.repository.xml.ItemArchiveManagerImpl$$Lambda$2656/502585367.apply(Unknown Source)
        at zio.internal.FiberContext.evaluateNow(FiberContext.scala:698)
        at zio.internal.FiberContext.$anonfun$evaluateLater$1(FiberContext.scala:592)
        at zio.internal.FiberContext$$Lambda$423/1775525066.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Actions #3

Updated by François ARMAND over 4 years ago

Reproduced with:

object GitArchiveUtilMain {
  val GIT_ROOT = "/tmp/rudder-test/gitroot" 
  def main(args: Array[String]): Unit = {
    println("Start")
    val util = new GitArchiverUtils {
      override def gitRepo: GitRepositoryProvider = new GitRepositoryProviderImpl(GIT_ROOT)
      override def gitRootDirectory: File = new File(GIT_ROOT)
      override def relativePath: String = "/" 
      override def xmlPrettyPrinter: RudderPrettyPrinter = null
      override def encoding: String = null
      override def gitModificationRepository: GitModificationRepository = new GitModificationRepository {
        override def getCommits(modificationId: ModificationId): IOResult[Option[GitCommitId]] = None.succeed
        override def addCommit(commit: GitCommitId, modId: ModificationId): IOResult[DB.GitCommitJoin] = DB.GitCommitJoin(commit, modId).succeed
      }
      override def loggerName: String = "my-logger-logger" 
    }
    println("util created")
    println("getRootDirectory: " + util.getRootDirectory.getAbsolutePath)
    // never reach that.
    println("End")
  }
}

Actions #4

Updated by François ARMAND over 4 years ago

So, the problem is with the lazy val for getRootDirectory.
The lock on object may cause something strange with ZIO

Actions #6

Updated by François ARMAND over 4 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2355
Actions #7

Updated by Nicolas CHARLES over 4 years ago

  • Assignee changed from Nicolas CHARLES to François ARMAND
Actions #8

Updated by François ARMAND over 4 years ago

  • Status changed from Pending technical review to Pending release
Actions #9

Updated by Vincent MEMBRÉ over 4 years ago

  • Status changed from Pending release to Released
  • Priority changed from 47 to 90

This bug has been fixed in Rudder 6.0.0~beta1 which was released today.

Actions

Also available in: Atom PDF