Bug #26464
openStackoverflow in NodeStatusReports event computing
Description
On load, we get an XSS in ComputeNodeStatusReportServiceImpl:
2025-03-04 06:40:42+0100 INFO policy.generation.timing - Policy generation succeeded in: 1 min 49 s 2025-03-04 06:40:42+0100 INFO policy.generation.manager - Successful policy update '740168' [started 2025-03-04 06:38:53 - ended 2025-03-04 06:40:42] java.lang.StackOverflowError at java.base/java.lang.invoke.DirectMethodHandle.allocateInstance(DirectMethodHandle.java:520) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.$anonfun$groupQueueActionByType$1(ComputeNodeStatusReportService.scala:373) at scala.Option.map(Option.scala:242) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.groupQueueActionByType(ComputeNodeStatusReportService.scala:372) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.$anonfun$groupQueueActionByType$1(ComputeNodeStatusReportService.scala:373) at scala.Option.map(Option.scala:242) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.groupQueueActionByType(ComputeNodeStatusReportService.scala:372) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.$anonfun$groupQueueActionByType$1(ComputeNodeStatusReportService.scala:373) at scala.Option.map(Option.scala:242) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.groupQueueActionByType(ComputeNodeStatusReportService.scala:372) at com.normation.rudder.services.reports.ComputeNodeStatusReportServiceImpl.$anonfun$groupQueueActionByType$1(ComputeNodeStatusReportService.scala:373) at scala.Option.map(Option.scala:242) (and loop on the last 3 lines)
WORKAROUND
This can be workarounded by increasing the stack size - which also point to a real system contention, and not a logic bug:
=> add -Xss64m
to the GC parameters in @/etc/default/rudder-jetty@alexandre.brianceau
It then may happen that jetty refuse to start because it is killed by systemd before having fully processed the old things.
You may need to force stop jetty, and perhaps wait for the agent to repair things, and perhaps wait a couple of generation/report processing before compliance converge back to green.
It looks like a real XSS because the groupBy is not stack safe, but we need to investigate, understand to root cause, and correct it.
The observed instance was Rudder 8.2.4 but nothing changed in more recent version here.