Project

General

Profile

Actions

Bug #10457

closed

Hook failed with fork: retry: No child processes

Added by Janos Mattyasovszky over 7 years ago. Updated over 7 years ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
25
Name check:
Fix check:
Regression:

Description

I got an error after I found #10456:

[2017-03-17 15:31:18] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 93326 ms
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [OLDPWD:/] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:1] [_:/usr/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-17T15:29:45.296+01:00] [RUDDER_NODEID:61053f9f-b3de-4290-9eda-bc4fe1567233] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/61053f9f-b3de-4290-9eda-bc4fe1567233/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community].
  Stdout: '   error: Can't stat file '/var/rudder/ncf//var/rudder/ncf/common/10_ncf_internals/list-compatible-inputs: fork: retry: No child processes' for parsing. (stat: No such file or directory)
'
  Stderr: ''
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '12' at 2017-03-17 15:31:18: Cannot write configuration node

Not sure if this isn't a limitation of nofiles, so it cannot fork?

Actions #1

Updated by Janos Mattyasovszky over 7 years ago

hah, found it:

[ 7219.731466] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service
[12893.159767] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service
Actions #2

Updated by Janos Mattyasovszky over 7 years ago

According to https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#TasksMax=N, the fix would be to include this line in the unit file (which currently is auto-generated):

TasksMax=infinity

will test this.

Actions #3

Updated by Janos Mattyasovszky over 7 years ago

copied the auto-generated unit file to /etc, and added the missing line:

sles12# systemctl cat rudder-jetty
# /etc/systemd/system/rudder-jetty.service
[Unit]
SourcePath=/etc/init.d/rudder-jetty
After=remote-fs.target network-online.target
Wants=remote-fs.target network-online.target

[Service]
Type=forking
TasksMax=infinity  <== Added this
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/etc/init.d/rudder-jetty start
ExecStop=/etc/init.d/rudder-jetty stop
Actions #4

Updated by Alexis Mousset over 7 years ago

  • Category set to System integration
  • Target version set to 4.1.0
  • Severity set to Major - prevents use of part of Rudder | no simple workaround
Actions #5

Updated by François ARMAND over 7 years ago

Well, perhaps it's better if we cap the number of parallel hook to say, 50? (or "number cpu + 1" or a configurable parameter). That won't change the throughout but certainly stress less the system and avoid these limit.

Actions #6

Updated by Janos Mattyasovszky over 7 years ago

I'd be happy with nproc --all, the only problem is, what if I scale my VM during operations up, and give it more cores? Would I have to restart jetty then? Could this maybe be checked at each time a policy generation is started?

Actions #7

Updated by François ARMAND over 7 years ago

Oh yes, the thread pool and manager logic is created each time. But I will make sure of that, thanks for pointing that use case.

Actions #8

Updated by François ARMAND over 7 years ago

  • User visibility set to Infrequent - complex configurations | third party integrations
Actions #9

Updated by François ARMAND over 7 years ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND
Actions #10

Updated by François ARMAND over 7 years ago

OK, so when using a real task manager, I get more consistant results, around 10% better. But performance are hard etc.

Before:

Write node configurations :      91750 ms
...
Write node configurations :      85166 ms
...
Write node configurations :      95879 ms

After:

Write node configurations :      79947 ms
...
Write node configurations :      79191 ms
...
Write node configurations :      75608 ms

See pull requests for details.

Actions #11

Updated by François ARMAND over 7 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/1608
Actions #12

Updated by Nicolas CHARLES over 7 years ago

without the PR , for 1602 nodes
[2017-03-23 14:37:58] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 55878 ms
[2017-03-23 14:46:10] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 57303 ms
[2017-03-23 14:47:35] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 39640 ms
[2017-03-23 14:48:42] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 33611 ms
[2017-03-23 14:50:04] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 35395 ms

with this PR
[2017-03-23 15:09:16] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 34423 ms
[2017-03-23 15:10:31] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 40921 ms
[2017-03-23 15:12:14] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 48889 ms
[2017-03-23 15:14:09] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 32235 ms

Note that it is on a laptop, so not really reliable

Actions #13

Updated by François ARMAND over 7 years ago

  • Status changed from Pending technical review to Pending release
Actions #14

Updated by Janos Mattyasovszky over 7 years ago

Without this PR on 32cpus and 7000+ nodes:

Dunno, it never finished, and I stopped it after 9+ hours

With this PR (same system):

Sum ~28 minutes (just base policy, no rules/directives).

Actions #15

Updated by Benoît PECCATTE over 7 years ago

  • Priority set to 25
Actions #16

Updated by Benoît PECCATTE over 7 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.0 which was released today.

Actions

Also available in: Atom PDF