Project

General

Profile

Bug #15011

Error at the end of a policy generation with too many nodes

Added by Nicolas CHARLES about 1 year ago. Updated about 1 year ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
73
Tags:

Description

I got the following error after doing a clear cache with 2500 nodes

[2019-06-01 12:59:22] WARN  explain_compliance.a04d8f89-27f4-4428-ad89-6ab4e45798a4a04d8f89-27f4-4428-ad89-6ab4e45798a4 - Received a run at 2019
-06-01T12:57:11.000Z for node 'a04d8f89-27f4-4428-ad89-6ab4e45798a4a04d8f89-27f4-4428-ad89-6ab4e45798a4' with configId '20190529-153902-d76426ef
' but that node should be sending reports for configId 20190531-224745-355b5fdd
Jun 01, 2019 12:59:22 PM com.zaxxer.nuprocess.linux.LinuxProcess start
WARNING: Failed to start process
java.io.IOException: error=7, Argument list too long
        at com.zaxxer.nuprocess.internal.LibJava8.Java_java_lang_UNIXProcess_forkAndExec(Native Method)
        at com.zaxxer.nuprocess.linux.LinuxProcess.start(LinuxProcess.java:109)
        at com.zaxxer.nuprocess.linux.LinProcessFactory.createProcess(LinProcessFactory.java:40)
        at com.zaxxer.nuprocess.NuProcessBuilder.start(NuProcessBuilder.java:266)
        at com.normation.rudder.hooks.RunNuCommand$.run(RunNuCommand.scala:153)
        at com.normation.rudder.hooks.RunHooks$.$anonfun$asyncRun$3(RunHooks.scala:186)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:303)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

[2019-06-01 12:59:22] INFO  policy.generation - Policy generation completed in:    2921610 ms
[2019-06-01 12:59:22] ERROR policy.generation - Error when updating policy, reason was: Exit code=-2147483648 for hook: '/opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server'.
 stdout: 
 stderr: ''
[2019-06-01 12:59:22] INFO  policy.generation - Flag file '/opt/rudder/etc/policy-update-running' successfully removed
[2019-06-01 12:59:22] ERROR policy.generation - Policy update error for process '111' at 2019-06-01 12:59:22: Exit code=-2147483648 for hook: '/opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server'.

Cause

As explained in comments, the problem is that we create a string of all updated node IDs and we give it to hook through environment variable RUDDER_NODE_IDS.
To do so, the JVM fork and pass the value in process parameter. When the number of nodes increase sufficiently, we hit the ARG_MAX limit, which is plateform specific.
In Linux, it is defined by default to 1/4 of ulimit -s, and as by default ulimit -s is 8bB, we have ARG_MAX=2097152b.
But it's a little bit less clear than that: if we increase ulimit -s to for ex 16kB, the limit from the JVM is still 2097152b - at least for Open JDK 1.8.131.
So, perhaps it's an hardcoded limit, or perhaps the interpretation differs from JVM to JVM.

Solution

Given that increasing the ulimit -s from Linux does not increasing the size of string we are able to pass to the child process, we need to change the way we pass parameter.
For Rudder 5.0.12 and up (and all more recent branch), RUDDER_NODE_IDS parameter is deprecated and we don't document it anymore in hook template. It is replaced by a new documented parameter: RUDDER_NODE_IDS_PATH. That parameter contains the path toward a file that can be sourced and contains the list of updated node for that generation. Sourcing the file will define variable RUDDER_NODE_IDS if needed.

To avoid breaking possible user hook, we still define the undocumented RUDDER_NODE_IDS parameter with the same format than in Rudder 5.0.11 or previously if:

- there is user hooks present and executable in /opt/rudder/etc/hooks.d/policy-generation-finished/ AND there is less than 3000 updated nodes
- OR Rudder /opt/rudder/etc/rudder-web.properties configuration file contains property rudder.hooks.policy-generation-finished.nodeids.compability=true

So in the general case, you don't have to do anything and everything will continue to work as before. You only have to do something when you have more than 3000 nodes and personal hooks in policy-generation-finished.

In the latter case, you only need to source the file given in RUDDER_NODE_IDS_PATH parameter (by default: /var/rudder/policy-generation-info/updated-nodeids) and use any of the defined variable in that file:

- RUDDER_UPDATED_POLICY_SERVER_IDS: the array of updated policy servers during the generation, sorted from root to immediate relays to farer relays
- RUDDER_UPDATED_NODE_IDS: the array of updated nodes during the generation, sorted alpha-numerically
- RUDDER_NODE_IDS: the arry of all updated elements, starting by policy server then simple nodes.


Related issues

Has duplicate Rudder - Bug #15137: Hook Exit code=-2147483648 on one generation (linked to FullGC?)RejectedActions

Also available in: Atom PDF