Project

General

Profile

Bug #15324

Webapp fails to boot (several time)

Added by François ARMAND 4 months ago. Updated 15 days ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Effort required:
Priority:
90

Description

When we install a new rudder server, it need to restart jetty 7 times before being successfull.

Here are the log:

Boot 1: suicide

First boot fails because LDAP is not ready and rudder kill itself. This seems legit.

[2019-07-26 09:49:23] ERROR scheduledJob - report logger could not fetch latest 100 non compliant reports, retry on next run: Could not get node information from database
[2019-07-26 09:49:23] INFO  bootchecks - LDAP and PostgreSQL connection are OK
[2019-07-26 09:49:23] INFO  bootchecks - Check mandatory DIT entries
[2019-07-26 09:49:23] ERROR bootchecks - There is some required entries missing in the LDAP directory: ou=Pending Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Accepted Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Removed Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Rudder, cn=rudder-configuration
[2019-07-26 09:49:23] ERROR application - Fatal error during boot, Rudder will stop now
javax.servlet.UnavailableException: There is some required entries missing in the LDAP directory: ou=Pending Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Accepted Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Removed Inventories, ou=Inventories, cn=rudder-configuration | ou=Inventories, cn=rudder-configuration | ou=Rudder, cn=rudder-configuration
        at bootstrap.liftweb.checks.CheckDIT.FAIL$1(CheckDIT.scala:77)
        at bootstrap.liftweb.checks.CheckDIT.checks(CheckDIT.scala:106)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1(BootstrapChecks.scala:99)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1$adapted(BootstrapChecks.scala:91)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.checks(BootstrapChecks.scala:91)
        at bootstrap.liftweb.RudderConfig$.init(RudderConfig.scala:860)
        at bootstrap.liftweb.LiftInitContextListener.contextInitialized(LiftInitContextListener.scala:109)
        at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:953)
        at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:558)
        at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:918)
        at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:370)
        at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1497)
        at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1459)

Boot 2: stall

In second boot, rudder fails to start after 1 minute and systemd kills it:

Jul 26 09:49:33 server systemd[1]: Starting Jetty Web Application Server...
.....
Jul 26 09:50:34 server rudder-jetty.sh[12515]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:50:34 UTC 2019
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:50:34 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Unit entered failed state.
Jul 26 09:50:34 server systemd[1]: rudder-jetty.service: Failed with result 'exit-code'.

In webapp logs, the last lines are:

[2019-07-26 09:49:51] WARN  scheduledJob - There is no reports to log
[2019-07-26 09:49:51] WARN  com.normation.rudder.repository.jdbc.RudderPropertiesRepositoryImpl - last id not present in database, create it with value 0

Boot 3 - stall

Same than previsous, but elsewhere. Rudder fails to start in 1 minutes:

Jul 26 09:50:35 server systemd[1]: Starting Jetty Web Application Server...
...
Jul 26 09:51:35 server rudder-jetty.sh[12961]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:51:35 UTC 2019
Jul 26 09:51:35 server systemd[1]: Failed to start Jetty Web Application Server.

But app seems to block elsewhere (we get at least one generation, and errors on bad inventories):

[2019-07-26 09:50:57] ERROR com.normation.rudder.services.policies.WriteNodeCertificatesPemImpl - SystemError: An error occured; cause was: /var/rudder/lib/ssl/allnodecerts.pem.new
[2019-07-26 09:50:58] ERROR com.normation.rudder.domain.nodes.NodeInfo - Error when trying to get the CFEngine-MD5 digest of CFEngine public key for node 'server.rudder.local' (root) <- unable to decode base64 string: invalid characters encountered in base64 data
[2019-07-26 09:50:58] ERROR com.normation.rudder.domain.nodes.NodeInfo - Error when trying to get the sha-256 digest of CFEngine public key for node 'server.rudder.local' (root) <- unable to decode base64 string: invalid characters encountered in base64 data
...
[2019-07-26 09:51:01] INFO  policy.generation - Successful policy update '1' [started 2019-07-26 09:50:56 - ended 2019-07-26 09:51:01]

Boot 4 - stall

Again, app killed after 1min

Jul 26 09:51:36 server systemd[1]: Starting Jetty Web Application Server...
...

Jul 26 09:52:36 server rudder-jetty.sh[14178]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:52:36 UTC 2019
Jul 26 09:52:36 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:52:37 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:52:37 server systemd[1]: rudder-jetty.service: Unit entered failed state.

The last log are:

[2019-07-26 09:51:53] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one
[2019-07-26 09:51:54] INFO  migration - No migration detected by migration script (table 'migrationeventlog' is empty or does not exist)
[2019-07-26 09:51:55] INFO  com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_09-51-55.058)

Boot 5,6,7 - stalling again, again, and finally works

Here, we have 3 minutes before the next restart:

Jul 26 09:55:41 server systemd[1]: Starting Jetty Web Application Server...

Jul 26 09:56:41 server rudder-jetty.sh[15847]: Starting Jetty: . . . . . . . . . . . . . . . FAILED Fri Jul 26 09:56:41 UTC 2019
Jul 26 09:56:41 server systemd[1]: rudder-jetty.service: Control process exited, code=exited status=1
Jul 26 09:56:42 server systemd[1]: Failed to start Jetty Web Application Server.
Jul 26 09:56:42 server systemd[1]: rudder-jetty.service: Unit entered failed state.
Jul 26 09:56:42 server systemd[1]: rudder-jetty.service: Failed with result 'exit-code'.

Things happen, but the last bootbecks is again about archive:

[2019-07-26 09:55:58] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one

The next one is exactly the same.

And the last one finally succeed with the archive.

[2019-07-26 10:00:58] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one
[2019-07-26 10:00:58] DEBUG com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Archived Rule category: /var/rudder/configuration-repository/ruleCategories/category.xml
[2019-07-26 10:00:59] INFO  com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_10-00-59.360)
[2019-07-26 10:00:59] DEBUG com.normation.rudder.repository.xml.GitRuleArchiverImpl - Archived rule: /var/rudder/configuration-repository/rules/32377fd7-02fd-43d0-aab7-28460a91347b.xml
[2019-07-26 10:00:59] INFO  com.normation.rudder.repository.xml.GitRuleArchiverImpl - Create a new archive: GitPath(archives/configurations-rules/2019-07-26_10-00-59.768)
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitActiveTechniqueCategoryArchiverImpl - Create a new archive: GitPath(archives/directives/2019-07-26_10-01-01.400)
.....
[2019-07-26 10:01:01] DEBUG com.normation.rudder.repository.xml.GitNodeGroupArchiverImpl - Archived node group category: /var/rudder/configuration-repository/groups/category.xml
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitNodeGroupArchiverImpl - Create a new archive: GitPath(archives/groups/2019-07-26_10-01-01.660)
[2019-07-26 10:01:01] DEBUG com.normation.rudder.repository.xml.GitParameterArchiverImpl - Archived parameter: /var/rudder/configuration-repository/parameters/rudder_file_edit_header.xml
[2019-07-26 10:01:01] INFO  com.normation.rudder.repository.xml.GitParameterArchiverImpl - Create a new archive: GitPath(archives/parameters/2019-07-26_10-01-01.794)
[2019-07-26 10:01:02] INFO  com.normation.rudder.repository.xml.ItemArchiveManagerImpl - Create a new archive: GitPath(archives/full/2019-07-26_10-01-02.108)
[2019-07-26 10:01:02] INFO  bootchecks - First full archive of configuration-repository items done
...

Subtasks

Bug #15376: Forget to fork backend batchReleasedNicolas CHARLESActions
Bug #15381: Add more info on LDAP backend errorReleasedVincent MEMBRÉActions
Bug #15383: Unhandled error if the LDAP connection is unavailableReleasedVincent MEMBRÉActions

Associated revisions

Revision 10d84335 (diff)
Added by François ARMAND 4 months ago

Fixes #15324: Webapp fails to boot (several time)

History

#1

Updated by François ARMAND 4 months ago

  • Status changed from New to In progress
#2

Updated by François ARMAND 4 months ago

So, it seems that the blocking point is that boocheck:

[2019-07-31 13:40:21] INFO  bootchecks - No full archive of configuration-repository items seems to have been done, initialising the system with one

Afterward, threads accumulate and one of these two seems to be faulty (actually the second should have been called by the first):

"zio-default-blocking-10-2118413714" #87 daemon prio=5 os_prio=0 tid=0x00007f4b341c3800 nid=0x2c11 in Object.wait() [0x00007f4b27b7b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - locked <0x00000000c7b93c08> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:291)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:305)
        at com.normation.box$IOToBox.toBox(ZioCommons.scala:344)
        at bootstrap.liftweb.checks.CheckInitXmlExport.checks(CheckInitXmlExport.scala:72)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1(BootstrapChecks.scala:99)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.$anonfun$checks$1$adapted(BootstrapChecks.scala:91)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks$$Lambda$2109/108232476.apply(Unknown Source)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at bootstrap.liftweb.SequentialImmediateBootStrapChecks.checks(BootstrapChecks.scala:91)
        at bootstrap.liftweb.RudderConfig$.$anonfun$init$2(RudderConfig.scala:864)
        at bootstrap.liftweb.RudderConfig$$$Lambda$2069/546271589.apply$mcV$sp(Unknown Source)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at zio.blocking.Blocking$Service.$anonfun$effectBlocking$5(Blocking.scala:133)
        at zio.blocking.Blocking$Service$$Lambda$427/800470253.apply(Unknown Source)
        at zio.internal.FiberContext.evaluateNow(FiberContext.scala:332)
        at zio.internal.FiberContext.$anonfun$evaluateLater$1(FiberContext.scala:592)
        at zio.internal.FiberContext$$Lambda$423/1775525066.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
"zio-default-blocking-16-2118413714" #309 daemon prio=5 os_prio=0 tid=0x00007f4b620e5000 nid=0x2cf9 in Object.wait() [0x00007f4b25469000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at zio.internal.OneShot.get(OneShot.scala:79)
        - locked <0x00000000c7b93380> (a zio.internal.OneShot)
        at zio.Runtime.unsafeRunSync(Runtime.scala:70)
        at zio.Runtime.unsafeRunSync$(Runtime.scala:65)
        at com.normation.zio$ZioRuntime$$anon$2.unsafeRunSync(ZioCommons.scala:291)
        at com.normation.zio$ZioRuntime$.runNow(ZioCommons.scala:305)
        at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory(GitArchiverUtils.scala:107)
        at com.normation.rudder.repository.xml.GitArchiverUtils.getRootDirectory$(GitArchiverUtils.scala:105)
        at com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl.getRootDirectory$lzycompute(GitRuleCategoryArchiver.scala:120)
        - locked <0x00000000c7099da0> (a com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl)
        at com.normation.rudder.rule.category.GitRuleCategoryArchiverImpl.getRootDirectory(GitRuleCategoryArchiver.scala:120)
        at com.normation.rudder.repository.xml.ItemArchiveManagerImpl.$anonfun$exportRuleCategories$2(ItemArchiveManagerImpl.scala:140)
        at com.normation.rudder.repository.xml.ItemArchiveManagerImpl$$Lambda$2656/502585367.apply(Unknown Source)
        at zio.internal.FiberContext.evaluateNow(FiberContext.scala:698)
        at zio.internal.FiberContext.$anonfun$evaluateLater$1(FiberContext.scala:592)
        at zio.internal.FiberContext$$Lambda$423/1775525066.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
#3

Updated by François ARMAND 4 months ago

Reproduced with:

object GitArchiveUtilMain {
  val GIT_ROOT = "/tmp/rudder-test/gitroot" 
  def main(args: Array[String]): Unit = {
    println("Start")
    val util = new GitArchiverUtils {
      override def gitRepo: GitRepositoryProvider = new GitRepositoryProviderImpl(GIT_ROOT)
      override def gitRootDirectory: File = new File(GIT_ROOT)
      override def relativePath: String = "/" 
      override def xmlPrettyPrinter: RudderPrettyPrinter = null
      override def encoding: String = null
      override def gitModificationRepository: GitModificationRepository = new GitModificationRepository {
        override def getCommits(modificationId: ModificationId): IOResult[Option[GitCommitId]] = None.succeed
        override def addCommit(commit: GitCommitId, modId: ModificationId): IOResult[DB.GitCommitJoin] = DB.GitCommitJoin(commit, modId).succeed
      }
      override def loggerName: String = "my-logger-logger" 
    }
    println("util created")
    println("getRootDirectory: " + util.getRootDirectory.getAbsolutePath)
    // never reach that.
    println("End")
  }
}

#4

Updated by François ARMAND 4 months ago

So, the problem is with the lazy val for getRootDirectory.
The lock on object may cause something strange with ZIO

#6

Updated by François ARMAND 4 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2355
#7

Updated by Nicolas CHARLES 4 months ago

  • Assignee changed from Nicolas CHARLES to François ARMAND
#8

Updated by François ARMAND 4 months ago

  • Status changed from Pending technical review to Pending release
#9

Updated by Vincent MEMBRÉ 15 days ago

  • Status changed from Pending release to Released
  • Priority changed from 47 to 90

This bug has been fixed in Rudder 6.0.0~beta1 which was released today.

Also available in: Atom PDF