Bug #15011: Error at the end of a policy generation with too many nodes - Rudder - Issue Tracker

Actions

Copy link

Bug #15011

closed

Error at the end of a policy generation with too many nodes

Added by Nicolas CHARLES about 6 years ago. Updated about 3 years ago.

Status:

Released

Priority:

N/A

Assignee:

François ARMAND

Category:

Performance and scalability

Target version:

5.0.12

Pull Request:

https://github.com/Normation/rudder/p...

Severity:

Major - prevents use of part of Rudder | no simple workaround

UX impact:

User visibility:

Infrequent - complex configurations | third party integrations

Effort required:

Priority:

Name check:

To do

Fix check:

Checked

Regression:

Description

I got the following error after doing a clear cache with 2500 nodes

[2019-06-01 12:59:22] WARN  explain_compliance.a04d8f89-27f4-4428-ad89-6ab4e45798a4a04d8f89-27f4-4428-ad89-6ab4e45798a4 - Received a run at 2019
-06-01T12:57:11.000Z for node 'a04d8f89-27f4-4428-ad89-6ab4e45798a4a04d8f89-27f4-4428-ad89-6ab4e45798a4' with configId '20190529-153902-d76426ef
' but that node should be sending reports for configId 20190531-224745-355b5fdd
Jun 01, 2019 12:59:22 PM com.zaxxer.nuprocess.linux.LinuxProcess start
WARNING: Failed to start process
java.io.IOException: error=7, Argument list too long
        at com.zaxxer.nuprocess.internal.LibJava8.Java_java_lang_UNIXProcess_forkAndExec(Native Method)
        at com.zaxxer.nuprocess.linux.LinuxProcess.start(LinuxProcess.java:109)
        at com.zaxxer.nuprocess.linux.LinProcessFactory.createProcess(LinProcessFactory.java:40)
        at com.zaxxer.nuprocess.NuProcessBuilder.start(NuProcessBuilder.java:266)
        at com.normation.rudder.hooks.RunNuCommand$.run(RunNuCommand.scala:153)
        at com.normation.rudder.hooks.RunHooks$.$anonfun$asyncRun$3(RunHooks.scala:186)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:303)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

[2019-06-01 12:59:22] INFO  policy.generation - Policy generation completed in:    2921610 ms
[2019-06-01 12:59:22] ERROR policy.generation - Error when updating policy, reason was: Exit code=-2147483648 for hook: '/opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server'.
 stdout: 
 stderr: ''
[2019-06-01 12:59:22] INFO  policy.generation - Flag file '/opt/rudder/etc/policy-update-running' successfully removed
[2019-06-01 12:59:22] ERROR policy.generation - Policy update error for process '111' at 2019-06-01 12:59:22: Exit code=-2147483648 for hook: '/opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server'.

Cause¶

As explained in comments, the problem is that we create a string of all updated node IDs and we give it to hook through environment variable RUDDER_NODE_IDS.
To do so, the JVM fork and pass the value in process parameter. When the number of nodes increase sufficiently, we hit the ARG_MAX limit, which is plateform specific.
In Linux, it is defined by default to 1/4 of ulimit -s, and as by default ulimit -s is 8bB, we have ARG_MAX=2097152b.
But it's a little bit less clear than that: if we increase ulimit -s to for ex 16kB, the limit from the JVM is still 2097152b - at least for Open JDK 1.8.131.
So, perhaps it's an hardcoded limit, or perhaps the interpretation differs from JVM to JVM.

Solution¶

Given that increasing the ulimit -s from Linux does not increasing the size of string we are able to pass to the child process, we need to change the way we pass parameter.
For Rudder 5.0.12 and up (and all more recent branch), RUDDER_NODE_IDS parameter is deprecated and we don't document it anymore in hook template. It is replaced by a new documented parameter: RUDDER_NODE_IDS_PATH. That parameter contains the path toward a file that can be sourced and contains the list of updated node for that generation. Sourcing the file will define variable RUDDER_NODE_IDS if needed.

To avoid breaking possible user hook, we still define the undocumented RUDDER_NODE_IDS parameter with the same format than in Rudder 5.0.11 or previously if:

- there is user hooks present and executable in /opt/rudder/etc/hooks.d/policy-generation-finished/ AND there is less than 3000 updated nodes
- OR Rudder /opt/rudder/etc/rudder-web.properties configuration file contains property rudder.hooks.policy-generation-finished.nodeids.compability=true

So in the general case, you don't have to do anything and everything will continue to work as before. You only have to do something when you have more than 3000 nodes and personal hooks in policy-generation-finished.

In the latter case, you only need to source the file given in RUDDER_NODE_IDS_PATH parameter (by default: /var/rudder/policy-generation-info/updated-nodeids) and use any of the defined variable in that file:

- RUDDER_UPDATED_POLICY_SERVER_IDS: the array of updated policy servers during the generation, sorted from root to immediate relays to farer relays
- RUDDER_UPDATED_NODE_IDS: the array of updated nodes during the generation, sorted alpha-numerically
- RUDDER_NODE_IDS: the arry of all updated elements, starting by policy server then simple nodes.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Nicolas CHARLES about 6 years ago

Description updated (diff)

Actions

Copy link

Updated by François ARMAND about 6 years ago

This error is strange because we use environment variable to transfert big parameters just to avoid that. So I'm wondering if the reported error is not a false one, and that the problem is something else. As the system was under load-testing, their was a lot of presure on memory / gc / threads / timeouts.

Actions

Copy link

Updated by François ARMAND about 6 years ago

Related to Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computation added

Actions

Copy link

Updated by François ARMAND about 6 years ago

Related to Bug #15137: Hook Exit code=-2147483648 on one generation (linked to FullGC?) added

Actions

Copy link

Updated by François ARMAND about 6 years ago

I can reproduce it systematically by changing in DeploymentService.scala the list of nodes passed to /policy-generation-finished hooks:

...
  , ("RUDDER_NODE_IDS",  "fb7b6e08-b794-473b-8e47-d309fc09f9f0 " * 3600) //updatedNodeIds.mkString(" "))
...

This leads to:

Jun 27, 2019 6:17:39 PM com.zaxxer.nuprocess.linux.LinuxProcess start
WARNING: Failed to start process
java.io.IOException: error=7, Argument list too long
...

Actions

Copy link

Updated by François ARMAND about 6 years ago

So, to be more precise, the "argument list too long" happens when the sum of env variable name size + variable value size > 131070 char.
That correspond to 3542 node UUIDs.

Actions

Copy link

Updated by François ARMAND about 6 years ago

So, the problem is that we hit ARG_MAX value, which is given by getconf ARG_MAX. By default, on linux, we have:

% getconf ARG_MAX
2097152

And when we reache 3542 nodes, we have a string of size 131070, and each char is 16b => 2097120

So, we have the culprit. Normally, the good news is that on modern (>2.6 :) Linux kernel, ARG_MAX is defined by ARG_MAX = ulimit -s / 4. So we just need to increase ulimit -s... But that doesn't work. Java seems to have harcoded somewhere ARG_MAX to above value.

So, we have the following choices:

- find why Java (or NuProcess) doesn't comply to a change of ulimit -s. It can be because the value is not correctly passed to the process, or because Java is enforcing a hard coded value.
- remove node id list from the hook parameters (or make it optionnal somehow). It is added but not used by current default hooks.
- dump list of updated nodes in a file, and pass the file path to the hook. It is the most futur proof solution, and have the side effect of giving a log of updated nodes (which also mean file rotation will be needed).

Actions

Copy link

Updated by François ARMAND about 6 years ago

Translation missing: en.field_tag_list set to Sponsored
Severity set to Major - prevents use of part of Rudder | no simple workaround
User visibility set to Infrequent - complex configurations | third party integrations
Priority changed from 0 to 73

The problem is not critical because even if the generation is marked "failed" because of the script error, node configuration and compliance still work. This is because the implied node is a post generation hook, everything is already generated and save in base at that point.
Nonetheless, at least hook /opt/rudder/etc/hooks.d/policy-generation-finished/50-reload-policy-file-server' is not correctly executed, which means that cf-serverd is not reloaded, which can lead to node permission problem when they get their promises.

Actions

Copy link

Updated by François ARMAND about 6 years ago

Related to deleted (Bug #15137: Hook Exit code=-2147483648 on one generation (linked to FullGC?))

Actions

Copy link

#10

Updated by François ARMAND about 6 years ago

Has duplicate Bug #15137: Hook Exit code=-2147483648 on one generation (linked to FullGC?) added

Actions

Copy link

#11

Updated by François ARMAND about 6 years ago

Status changed from New to In progress
Assignee set to François ARMAND

Actions

Copy link

#12

Updated by François ARMAND about 6 years ago

Status changed from In progress to Pending technical review
Assignee changed from François ARMAND to Nicolas CHARLES
Pull Request set to https://github.com/Normation/rudder/pull/2287

PR https://github.com/Normation/rudder/pull/2287

Actions

Copy link

#13

Updated by François ARMAND about 6 years ago

To avoid any breaking changes for people with less than 3000 nodes and personnal post generation hook, we will:

- add a configuration parameter, "rudder.compatibility.post-policy-hook-nodeids", default to "undefined", which means that RUDDER_NODE_IDS is still generated,
- if we reach an arbitrary limit (3000 nodes), and the installation has personnal post-generation hooks, we fail with a warning message refering to that ticket
- if some plateform has a limit far below 3000 nodes, people will likely ask us or find that ticket, with the workaround that will be added in description to set the configuration directive to false,
- if some user find a way to increase the limit on his plateform, or that the limit is much more then 3000 nodes on his plateform, and he doesn't want to change his personnal hooks, he can force the configuration directive to "true".

Actions

Copy link

#14

Updated by François ARMAND about 6 years ago

Description updated (diff)

Actions

Copy link

#15

Updated by François ARMAND about 6 years ago

Related to deleted (Bug #14465: Generation fails with "Task FutureTask rejected from ThreadPoolExecutor" due to timeout in JS computation)

Actions

Copy link

#16

Updated by Rudder Quality Assistant about 6 years ago

Status changed from Pending technical review to Discussion
Assignee changed from Nicolas CHARLES to François ARMAND

Actions

Copy link

#17

Updated by François ARMAND about 6 years ago

Status changed from Discussion to Pending technical review
Assignee changed from François ARMAND to Nicolas CHARLES

Actions

Copy link

#18

Updated by Rudder Quality Assistant about 6 years ago

Assignee changed from Nicolas CHARLES to François ARMAND

Actions

Copy link

#19

Updated by François ARMAND about 6 years ago

Status changed from Pending technical review to Pending release

Applied in changeset rudder|e3fde9c19643a935b90fa68093b9173c8269a8fa.

Actions

Copy link

#20

Updated by Nicolas CHARLES about 6 years ago

Name check set to To do
Fix check set to To do

Actions

Copy link

#21

Updated by Nicolas CHARLES about 6 years ago

Fix check changed from To do to Checked

Actions

Copy link

#22

Updated by Nicolas CHARLES about 6 years ago

Description updated (diff)

Actions

Copy link

#23

Updated by Vincent MEMBRÉ about 6 years ago

Status changed from Pending release to Released

This bug has been fixed in Rudder 5.0.12 which was released today.

Actions

Copy link

#24

Updated by Alexis Mousset about 3 years ago

Priority changed from 73 to 61

Actions

Copy link

#25

Updated by François ARMAND over 1 year ago

Related to Bug #17314: policy generation error when disabling " Enable syntax validation of generated policies" during a policy generation added

Actions

Copy link

Also available in: Atom PDF

	Related to Rudder - Bug #17314: policy generation error when disabling " Enable syntax validation of generated policies" during a policy generation	Rejected					Actions
	Has duplicate Rudder - Bug #15137: Hook Exit code=-2147483648 on one generation (linked to FullGC?)	Rejected					Actions

Project

General

Profile

Rudder

Custom queries

Bug #15011

Error at the end of a policy generation with too many nodes

Cause¶

Solution¶

Updated by Nicolas CHARLES about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by Rudder Quality Assistant about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by Rudder Quality Assistant about 6 years ago

Updated by François ARMAND about 6 years ago

Updated by Nicolas CHARLES about 6 years ago

Updated by Nicolas CHARLES about 6 years ago

Updated by Nicolas CHARLES about 6 years ago

Updated by Vincent MEMBRÉ about 6 years ago

Updated by Alexis Mousset about 3 years ago

Updated by François ARMAND over 1 year ago