Project

General

Profile

Actions

Bug #18227

closed

Fiber error when postgres is unavaible breaks generation forever

Added by François ARMAND about 4 years ago. Updated over 2 years ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Priority:
90
Name check:
To do
Fix check:
Checked
Regression:

Description

One user had to shutdown postgres after rudder was up.
It leads to errors like that:

[2020-09-18 10:18:57+0200] INFO  compliance - [metrics] global compliance (number of components): 5118 [p:0 s:2164 r:7 e:3 u:0 m:0 nr:0 na:172 rd:0 c:2342 ana:208 nc:222 ae:0 bpm:0]
[2020-09-18 10:19:09+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available,
[2020-09-18 10:19:09+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30003 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2020-09-18 10:19:18+0200] WARN  scheduled.job - Error when trying to clean log reports from report table. <- An error occured while deleting log reports <- HikariPool-1 - Connection is not available, request timed out after 30001ms.
[2020-09-18 10:19:27+0200] ERROR report.cache - Error when updating compliance cache for nodes: [be8c60fb-2de5-44ad-8b7a-84acad91323d, 4d3d95a5-8e93-4021-9b98-e9daa97a5583, 31b367c9-3fd3-44b7-8a13-33a3b7a5c4fc, root, 7eabf174-21bc-4a16-8dc8-5aa168fa7316
 ->
[2020-09-18 10:19:31+0200] ERROR com.normation.rudder.services.ClearCacheServiceImpl - Error when logging cache event: SystemError: Error when persisting event log ClearCache; cause was: java.sql.SQLTransientConnectionException: HikariPool-1 - Connectio
 ->
[2020-09-18 10:19:44+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available,
[2020-09-18 10:19:44+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30003 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2020-09-18 10:19:50+0200] ERROR scheduled.job - could not fetch last id, don't log anything, wait next run, cause is Error when parsing property 'reportLoggerLastId'
[2020-09-18 10:20:01+0200] INFO  policy.generation - Start policy generation, checking updated rules
[2020-09-18 10:20:02+0200] INFO  policy.generation - [metrics] Xmx:1.94 GB nodes:15 (cached:0) rules:34 (enabled:31) techniques:84 (enabled:81) directives:144 (enabled:130) groups:32 (dynamic:29) parameters:3
[2020-09-18 10:20:19+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available,
[2020-09-18 10:20:19+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30000 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2020-09-18 10:20:32+0200] ERROR historization - Error with update nodes historization process <- Could not update the nodes historization information in base. <- Fiber failed.
A checked error was not handled.
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30004ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at doobie.util.transactor$Transactor$fromDataSource$FromDataSourceUnapplied.$anonfun$apply$14(transactor.scala:280)

1/ the connection pool should have seen that postgres was down, why timeout ?
2/ we need to take care of that stack trace and print only a relevant message to user
3/ postgres down must not make rudder broken foreever. When the connection get back, everything should be fine again.

Actions

Also available in: Atom PDF