Project

General

Profile

Bug #13517

Policy generation fails definitively when postgresql is shut down during policy generation

Added by Nicolas CHARLES 3 months ago. Updated about 2 months ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Getting started - demo | first install | level 1 Techniques
Effort required:
Very Small
Priority:
122

Description

If Postgresql is shut down during a policy generation for more than 30 seconds, the policy generation fails, and cannot recover unless the webapp is restarted

To reproduce:
  1. edit hook /opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check and add a "sleep 10" there
  2. trigger a full regeneration
  3. shut down postgresql during the generation
  4. wait 30 seconds

You'll get the following stacktrace

java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30011ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:591)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:194)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:85)
        at doobie.util.transactor$DataSourceTransactor$DataSourceTransactorCtor$$anon$1.$anonfun$connect$2(transactor.scala:122)
        at scalaz.concurrent.Task$.$anonfun$delay$1(Task.scala:339)
        at scalaz.concurrent.Task$.$anonfun$suspend$2(Task.scala:347)
        at scalaz.concurrent.Task$.Try(Task.scala:457)
        at scalaz.concurrent.Task$.$anonfun$suspend$1(Task.scala:347)
        at scalaz.concurrent.Future.step(Future.scala:119)
        at scalaz.concurrent.Future.unsafePerformListen(Future.scala:75)
        at scalaz.concurrent.Future.unsafePerformAsync(Future.scala:159)
        at scalaz.concurrent.Future.unsafePerformSync(Future.scala:184)
        at scalaz.concurrent.Task.unsafePerformSync(Task.scala:99)
        at com.normation.rudder.repository.jdbc.RudderPropertiesRepositoryImpl.getReportLoggerLastId(RudderPropertiesRepository.scala:63)
        at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger$$anonfun$messageHandler$1.applyOrElse(AutomaticReportLogger.scala:105)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
        at net.liftweb.actor.LiftActor.execTranslate(LiftActor.scala:504)
        at net.liftweb.actor.LiftActor.execTranslate$(LiftActor.scala:496)
        at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.execTranslate(AutomaticReportLogger.scala:88)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:305)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:227)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:241)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:240)
        at com.normation.rudder.batch.AutomaticReportLogger$LAAutomaticReportLogger.around(AutomaticReportLogger.scala:88)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:227)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:190)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:52)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:216)
        at org.postgresql.Driver.makeConnection(Driver.java:404)
        at org.postgresql.Driver.connect(Driver.java:272)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:95)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:101)
        at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:346)
        at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:196)
        at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:442)
        at com.zaxxer.hikari.pool.HikariPool.access$200(HikariPool.java:73)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:620)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:606)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.postgresql.core.PGStream.<init>(PGStream.java:61)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
        ... 16 common frames omitted
[2018-09-18 15:19:43] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '7' at 2018-09-18 15:19:43: Exception caught during policy update process: HikariPool-1 - Connection is not available, request timed out after 30002ms.
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30002ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:591)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:194)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:85)
        at doobie.util.transactor$DataSourceTransactor$DataSourceTransactorCtor$$anon$1.$anonfun$connect$2(transactor.scala:122)
        at scalaz.concurrent.Task$.$anonfun$delay$1(Task.scala:339)
        at scalaz.concurrent.Task$.$anonfun$suspend$2(Task.scala:347)
        at scalaz.concurrent.Task$.Try(Task.scala:457)
        at scalaz.concurrent.Task$.$anonfun$suspend$1(Task.scala:347)
        at scalaz.concurrent.Future.step(Future.scala:119)
        at scalaz.concurrent.Future.unsafePerformListen(Future.scala:75)
        at scalaz.concurrent.Future.unsafePerformAsync(Future.scala:159)
        at scalaz.concurrent.Future.unsafePerformSync(Future.scala:184)
        at scalaz.concurrent.Task.unsafePerformSync(Task.scala:99)
        at com.normation.rudder.repository.jdbc.UpdateExpectedReportsJdbcRepository.saveNodeExpectedReports(ExpectedReportsJdbcRepository.scala:231)
        at com.normation.rudder.services.policies.ExpectedReportsUpdateImpl.updateExpectedReports(ExpectedReportsUpdate.scala:205)
        at com.normation.rudder.services.policies.PromiseGeneration_setExpectedReports.setExpectedReports(DeploymentService.scala:1099)
        at com.normation.rudder.services.policies.PromiseGeneration_setExpectedReports.setExpectedReports$(DeploymentService.scala:1079)
        at com.normation.rudder.services.policies.PromiseGenerationServiceImpl.setExpectedReports(DeploymentService.scala:478)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$75(DeploymentService.scala:235)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$71(DeploymentService.scala:228)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$65(DeploymentService.scala:217)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
       at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$43(DeploymentService.scala:184)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$38(DeploymentService.scala:170)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$36(DeploymentService.scala:169)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$34(DeploymentService.scala:168)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$33(DeploymentService.scala:167)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$31(DeploymentService.scala:166)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$28(DeploymentService.scala:163)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$24(DeploymentService.scala:160)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$20(DeploymentService.scala:157)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$16(DeploymentService.scala:154)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$12(DeploymentService.scala:151)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$8(DeploymentService.scala:148)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$5(DeploymentService.scala:143)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.$anonfun$deploy$2(DeploymentService.scala:138)
        at net.liftweb.common.Full.flatMap(Box.scala:808)
        at com.normation.rudder.services.policies.PromiseGenerationService.deploy(DeploymentService.scala:137)
        at com.normation.rudder.services.policies.PromiseGenerationService.deploy$(DeploymentService.scala:121)
        at com.normation.rudder.services.policies.PromiseGenerationServiceImpl.deploy(DeploymentService.scala:478)
        at com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent$$anonfun$messageHandler$1.applyOrElse(AsyncDeploymentAgent.scala:350)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
        at net.liftweb.actor.LiftActor.execTranslate(LiftActor.scala:504)
        at net.liftweb.actor.LiftActor.execTranslate$(LiftActor.scala:496)
...
[2018-09-18 15:20:11] ERROR application - Error when executing [Store Agent Run Times] scheduler task started at 2018/09/18 15:19:41, ended at 2018/09/18 15:20:11. <- HikariPool-1 - Connection is not available, request timed out after 300
06ms.
[2018-09-18 15:20:13] ERROR net.liftweb.actor.ActorLogger - Actor threw an exception
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30001ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:591)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:194)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:85)
        at doobie.util.transactor$DataSourceTransactor$DataSourceTransactorCtor$$anon$1.$anonfun$connect$2(transactor.scala:122)
        at scalaz.concurrent.Task$.$anonfun$delay$1(Task.scala:339)
        at scalaz.concurrent.Task$.$anonfun$suspend$2(Task.scala:347)
        at scalaz.concurrent.Task$.Try(Task.scala:457)
        at scalaz.concurrent.Task$.$anonfun$suspend$1(Task.scala:347)
        at scalaz.concurrent.Future.step(Future.scala:119)
        at scalaz.concurrent.Future.unsafePerformListen(Future.scala:75)
        at scalaz.concurrent.Future.unsafePerformAsync(Future.scala:159)
        at scalaz.concurrent.Future.unsafePerformSync(Future.scala:184)
        at scalaz.concurrent.Task.unsafePerformSync(Task.scala:99)
        at com.normation.rudder.repository.jdbc.EventLogJdbcRepository.saveEventLog(EventLogJdbcRepository.scala:88)
        at com.normation.rudder.batch.AsyncDeploymentAgent$$anonfun$lowPriority$1.applyOrElse(AsyncDeploymentAgent.scala:292)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
        at net.liftweb.http.ListenerManager$$anonfun$listenerService$1.applyOrElse(CometActor.scala:214)

        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
        at scala.collection.MapLike.$anonfun$applyOrElse$1(MapLike.scala:163)
        at scala.collection.MapLike.getOrElse(MapLike.scala:128)
        at scala.collection.MapLike.getOrElse$(MapLike.scala:126)
        at scala.collection.AbstractMap.getOrElse(Map.scala:59)
        at scala.collection.MapLike.applyOrElse(MapLike.scala:163)
        at scala.collection.MapLike.applyOrElse$(MapLike.scala:162)
        at scala.collection.AbstractMap.applyOrElse(Map.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
        at scala.collection.MapLike.$anonfun$applyOrElse$1(MapLike.scala:163)
        at scala.collection.MapLike.getOrElse(MapLike.scala:128)
        at scala.collection.MapLike.getOrElse$(MapLike.scala:126)
        at scala.collection.AbstractMap.getOrElse(Map.scala:59)
        at scala.collection.MapLike.applyOrElse(MapLike.scala:163)
        at scala.collection.MapLike.applyOrElse$(MapLike.scala:162)
        at scala.collection.AbstractMap.applyOrElse(Map.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
        at net.liftweb.actor.LiftActor.execTranslate(LiftActor.scala:504)
        at net.liftweb.actor.LiftActor.execTranslate$(LiftActor.scala:496)
        at com.normation.rudder.batch.AsyncDeploymentAgent.execTranslate(AsyncDeploymentAgent.scala:100)
        at net.liftweb.actor.SpecializedLiftActor.proc2(LiftActor.scala:305)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$processMailbox$1(LiftActor.scala:227)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at net.liftweb.actor.SpecializedLiftActor.around(LiftActor.scala:241)
        at net.liftweb.actor.SpecializedLiftActor.around$(LiftActor.scala:240)
        at com.normation.rudder.batch.AsyncDeploymentAgent.around(AsyncDeploymentAgent.scala:100)
        at net.liftweb.actor.SpecializedLiftActor.processMailbox(LiftActor.scala:227)
        at net.liftweb.actor.SpecializedLiftActor.$anonfun$$bang$3(LiftActor.scala:190)
        at net.liftweb.actor.LAScheduler$$anon$2$$anon$3.run(LiftActor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Caused by: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:52)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:216)
        at org.postgresql.Driver.makeConnection(Driver.java:404)
        at org.postgresql.Driver.connect(Driver.java:272)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:95)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:101)
        at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:346)
        at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:196)
        at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:442)
        at com.zaxxer.hikari.pool.HikariPool.access$200(HikariPool.java:73)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:620)
        at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:606)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.postgresql.core.PGStream.<init>(PGStream.java:61)
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:144)
        ... 16 common frames omitted


Related issues

Related to Datasources - Bug #13588: Correct attempt.transact -> transact.attempt in SQL request (see #13517)Pending release

Associated revisions

Revision 76eb3db4 (diff)
Added by François ARMAND 3 months ago

Fixes #13517: Policy generation fails definitively when postgresql is shut down during policy generation

Revision 80b748e6 (diff)
Added by François ARMAND 3 months ago

Issue #13517: missing (new) transact.attempt in 4.3

History

#1 Updated by François ARMAND 3 months ago

  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Getting started - demo | first install | level 1 Techniques
  • Effort required set to Very Small
  • Priority changed from 0 to 122

So, the problem is in `EventLogJdbcRepository.saveEventLog` and every other plance in the code where we use the code pattern:

sql"""....""".....attempt.transact(xa).unsafePerformSync

The problem here is that the "attempt" only catch errors in the SQL part, not when an error happens in the "transact" part.

We must use:

sql"""....""".....transact(xa).attempt.unsafePerformSync

#2 Updated by François ARMAND 3 months ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND

#3 Updated by François ARMAND 3 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/2040

#4 Updated by Normation Quality Assistant 3 months ago

  • Assignee changed from Nicolas CHARLES to François ARMAND

#5 Updated by François ARMAND 3 months ago

  • Status changed from Pending technical review to Pending release

#6 Updated by François ARMAND 3 months ago

  • Related to Bug #13588: Correct attempt.transact -> transact.attempt in SQL request (see #13517) added

#7 Updated by Vincent MEMBRÉ about 2 months ago

  • Status changed from Pending release to Released
This bug has been fixed in Rudder 4.1.15, 4.3.5 and 5.0.1 which were released today.
Changelog
Changelog
Changelog

Also available in: Atom PDF