Project

General

Profile

Actions

Bug #16004

closed

Unexpected crash of webapp on 6.0

Added by Nicolas CHARLES over 4 years ago. Updated over 4 years ago.

Status:
Released
Priority:
N/A
Category:
-
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Effort required:
Priority:
0
Name check:
To do
Fix check:
To do
Regression:

Description

I had a fresh 6.0 install, worked fined. I let it runs without doing anything, ~30min later, I came back and found my webapp stuck and not able to click on any buttons.
Webapp logs where displaying lot of messages like:

[2019-10-18 11:21:28+0200] INFO  policy.generation - Successful policy update '3' [started 2019-10-18 11:21:26 - ended 2019-10-18 11:21:28]
[2019-10-18 11:41:43+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:41:43+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30006 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:41:59+0200] WARN  scheduledJob - Error when trying to clean log reports from report table. <- An error occured while deleting log reports <- HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:01+0200] ERROR scheduledJob - could not fetch last id, don't log anything, wait next run, cause is Error when parsing property 'reportLoggerLastId'
[2019-10-18 11:42:18+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:18+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30002 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:42:53+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:42:53+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30002 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.
[2019-10-18 11:43:28+0200] ERROR report - Could not get node executions reports from the RudderSysEvents table  cause is: Could not get node execution status <- Error when retrieving 'executionStatus' from db: HikariPool-1 - Connection is not available, request timed out after 30000ms.
[2019-10-18 11:43:28+0200] WARN  application - [Store Agent Run Times] Task frequency is set too low! Last task took 30003 ms but tasks are scheduled every 5000 ms. Adjust rudder.batch.storeAgentRunTimes.updateInterval if this problem persists.

I had to reboot the VM, allow 2 more cpus and now eveything is working fine again. I have no idea if it was the reboot or the cpu allocation that fixed it.

Given the error messages, it looks like the database wasn't repsonding anymore. It may have been due to the system being under load or b/c of some thread leak (when threads accumulate, the eat all CPU just for switching between them, which is not great).

If the problem happen again, before stopping the JVM, can you do a threadump?

To get a threadump, you need a jdk (openjdk-development) to get jstack command, and then:

jstack -l  <jvm-pid> > <file-path>


Files

dump_log (290 KB) dump_log Nicolas CHARLES, 2019-10-21 12:22
Actions

Also available in: Atom PDF