Bug #10457
closed
Hook failed with fork: retry: No child processes
Added by Janos Mattyasovszky over 7 years ago.
Updated over 7 years ago.
Category:
System integration
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Infrequent - complex configurations | third party integrations
Description
I got an error after I found #10456:
[2017-03-17 15:31:18] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 93326 ms
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Exit code=1 for hook: '/opt/rudder/etc/hooks.d/policy-generation-node-ready/10-cf-promise-check' with environment variables: [PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin] [NLSPATH:/usr/dt/lib/nls/msg/%L/%N.cat] [OLDPWD:/] [XFILESEARCHPATH:/usr/dt/app-defaults/%L/Dt] [PWD:/opt/rudder/jetty7] [SHLVL:1] [_:/usr/bin/java] [RUDDER_GENERATION_DATETIME:2017-03-17T15:29:45.296+01:00] [RUDDER_NODEID:61053f9f-b3de-4290-9eda-bc4fe1567233] [RUDDER_NEXT_POLICIES_DIRECTORY:/var/rudder/share/61053f9f-b3de-4290-9eda-bc4fe1567233/rules.new/cfengine-community] [RUDDER_AGENT_TYPE:cfengine-community].
Stdout: ' error: Can't stat file '/var/rudder/ncf//var/rudder/ncf/common/10_ncf_internals/list-compatible-inputs: fork: retry: No child processes' for parsing. (stat: No such file or directory)
'
Stderr: ''
[2017-03-17 15:31:18] ERROR com.normation.rudder.batch.AsyncDeploymentAgent - Policy update error for process '12' at 2017-03-17 15:31:18: Cannot write configuration node
Not sure if this isn't a limitation of nofiles, so it cannot fork?
hah, found it:
[ 7219.731466] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service
[12893.159767] cgroup: fork rejected by pids controller in /system.slice/rudder-jetty.service
copied the auto-generated unit file to /etc
, and added the missing line:
sles12# systemctl cat rudder-jetty
# /etc/systemd/system/rudder-jetty.service
[Unit]
SourcePath=/etc/init.d/rudder-jetty
After=remote-fs.target network-online.target
Wants=remote-fs.target network-online.target
[Service]
Type=forking
TasksMax=infinity <== Added this
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/etc/init.d/rudder-jetty start
ExecStop=/etc/init.d/rudder-jetty stop
- Category set to System integration
- Target version set to 4.1.0
- Severity set to Major - prevents use of part of Rudder | no simple workaround
Well, perhaps it's better if we cap the number of parallel hook to say, 50? (or "number cpu + 1" or a configurable parameter). That won't change the throughout but certainly stress less the system and avoid these limit.
I'd be happy with nproc --all
, the only problem is, what if I scale my VM during operations up, and give it more cores? Would I have to restart jetty then? Could this maybe be checked at each time a policy generation is started?
Oh yes, the thread pool and manager logic is created each time. But I will make sure of that, thanks for pointing that use case.
- User visibility set to Infrequent - complex configurations | third party integrations
- Status changed from New to In progress
- Assignee set to François ARMAND
OK, so when using a real task manager, I get more consistant results, around 10% better. But performance are hard etc.
Before:
Write node configurations : 91750 ms
...
Write node configurations : 85166 ms
...
Write node configurations : 95879 ms
After:
Write node configurations : 79947 ms
...
Write node configurations : 79191 ms
...
Write node configurations : 75608 ms
See pull requests for details.
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder/pull/1608
without the PR , for 1602 nodes
[2017-03-23 14:37:58] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 55878 ms
[2017-03-23 14:46:10] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 57303 ms
[2017-03-23 14:47:35] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 39640 ms
[2017-03-23 14:48:42] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 33611 ms
[2017-03-23 14:50:04] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 35395 ms
with this PR
[2017-03-23 15:09:16] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 34423 ms
[2017-03-23 15:10:31] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 40921 ms
[2017-03-23 15:12:14] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 48889 ms
[2017-03-23 15:14:09] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Write node configurations : 32235 ms
Note that it is on a laptop, so not really reliable
- Status changed from Pending technical review to Pending release
Without this PR on 32cpus and 7000+ nodes:
Dunno, it never finished, and I stopped it after 9+ hours
With this PR (same system):
Sum ~28 minutes (just base policy, no rules/directives).
- Status changed from Pending release to Released
This bug has been fixed in Rudder 4.1.0 which was released today.
Also available in: Atom
PDF