Project

General

Profile

Bug #16004

Unexpected crash of webapp on 6.0

Added by Nicolas CHARLES about 1 month ago. Updated 15 days ago.

Status:
Released
Priority:
N/A
Category:
-
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Effort required:
Priority:
0

Description

I had a fresh 6.0 install, worked fined. I let it runs without doing anything, ~30min later, I came back and found my webapp stuck and not able to click on any buttons.
Webapp logs where displaying lot of messages like:

[2019-10-18 11:21:28+0200] INFO  policy.generation - Successful policy update '3' [started 2019-10-18 11:21:26 - ended 2019-10-18 11:21:28]
[2019-10-18 11:41:43+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:41:43+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30006 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:41:59+0200] WARN  scheduledJob - Error when trying to clean log reports from report table. <- An error occured while deleting log reports <- HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:01+0200] ERROR scheduledJob - could not fetch last id, don't log anything, wait next run, cause is Error when parsing property 'reportLoggerLastId'
[2019-10-18 11:42:18+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:18+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30002 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:42:53+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:53+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30002 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:43:28+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:43:28+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30003 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.

I had to reboot the VM, allow 2 more cpus and now eveything is working fine again. I have no idea if it was the reboot or the cpu allocation that fixed it.

Given the error messages, it looks like the database wasn't repsonding anymore. It may have been due to the system being under load or b/c of some thread leak (when threads accumulate, the eat all CPU just for switching between them, which is not great).

If the problem happen again, before stopping the JVM, can you do a threadump?

To get a threadump, you need a jdk (openjdk-development) to get jstack command, and then:

jstack -l  <jvm-pid> > <file-path>


Files

dump_log (290 KB) dump_log Nicolas CHARLES, 2019-10-21 12:22

Associated revisions

Revision c4b4fbbd (diff)
Added by Nicolas CHARLES about 1 month ago

Fixes #16004: Unexpected crash of webapp on 6.0

History

#1

Updated by Nicolas CHARLES about 1 month ago

#2

Updated by Nicolas CHARLES about 1 month ago

  • Status changed from New to In progress
  • Assignee set to Nicolas CHARLES
#3

Updated by Nicolas CHARLES about 1 month ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request set to https://github.com/Normation/rudder/pull/2540
#4

Updated by Nicolas CHARLES about 1 month ago

  • Status changed from Pending technical review to Pending release
#5

Updated by Nicolas CHARLES about 1 month ago

  • Description updated (diff)
#6

Updated by Vincent MEMBRÉ 15 days ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 6.0.0~beta1 which was released today.

Also available in: Atom PDF