Bug #12604
closedGeneration gets stuck when cf-serverd is not running
Description
Seen on SLES 11 with Rudder 4.1.7, but should happen everywhere.
When cf-serverd is not running and a policy generation is triggered, it gets stuck with:
root 13935 72.4 13.0 2810272 424520 pts/0 Sl 05:48 1:36 /usr/java/latest/bin/java -server -Xms1024m -Xmx1024m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Drudder.configFile=/opt/rudder/etc/rudder-w root 16307 0.0 0.0 0 0 pts/0 Z 05:50 0:00 \_ [rudder-reload-c] <defunct>
And we have no way to reset generation state, and have to restart jetty to start a new generation.
Updated by Alexis Mousset over 6 years ago
- Description updated (diff)
- Target version set to 4.1.12
Updated by Vincent MEMBRÉ over 6 years ago
- Target version changed from 4.1.12 to 4.1.13
Updated by François ARMAND over 6 years ago
I can't reproduce it in Ubuntu16.04 (rudder 4.1.12).
Wehn cf-execd does not run, what is the result of the content of the hook `policy-generation-finished/50-reload-policy-file-server`, i.e:
exec /opt/rudder/bin/rudder-reload-cf-serverd
On Ubuntu, we get an happy exit status 0 (I'm not sure it's what we want, but hey! No zombie at least!)
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 4.1.13 to 411
Updated by Benoît PECCATTE over 6 years ago
- Target version changed from 411 to 4.1.13
Updated by Benoît PECCATTE over 6 years ago
- Category set to System integration
- Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Infrequent - complex configurations | third party integrations
- Priority changed from 0 to 65
Updated by Félix DALLIDET over 6 years ago
I tried on a SLES 12 SP1 with rudder 4.1.12.
/opt/rudder/bin/rudder-reload-cf-serverd try to kill cf-serverd with its PID, which does not exist and fails.
But it restarts rudder-agent service and resolves the problem.
Promises generation are stuck if triggered manually. It seems to resolve itself when triggering a real generation (with policy changes) or at next rudder agent execution on the server.
Updated by Félix DALLIDET over 6 years ago
I reproduced it on a SLES11 SP4 with rudder 4.3.3
The generation get stuck forever whn killing cf-serverd.
Updated by Félix DALLIDET over 6 years ago
- Status changed from New to In progress
Updated by Félix DALLIDET over 6 years ago
- Status changed from In progress to New
Some observations:
minor:
The file /opt/rudder/bin/rudder-reload-cf-serverd seems to fails when trying to kill an already killed PID. Fixing it by testing if cf-serverd is running
before trying to kill it does not resolve the generation issue.
other:
By redirecting all the outputs to a file, I found that the restarting of the service fails when executing /var/rudder/cfengine-community/bin/cf-serverd with the error:
error: Unable to kill expired process 17155 from lock lock.server_cfengine_bundle.server_cfengine.-server._var_rudder_cfengine_community_inputs_promises_cf_4657_MD5=ce9ed37b30768afd47469328cfbf1bdc (probably process not found or permission denied)
It seems to be related to https://tracker.mender.io/browse/CFE-2824
I was not able to find which process is being killed.
Updated by Félix DALLIDET over 6 years ago
When starting the service cf-serverd it inherits all the file descriptors from the webapp. By using "startproc" to launch the cf-serverd it closes all the unwanted file descriptors.
As pointed by François, it seems to be a known bug, recently fixed https://github.com/brettwooldridge/NuProcess/issues/13
Here are 2 lists of the file descriptors used by cf-serverd in both cases:
The startproc case:
dr-x------ 2 root root 0 15 juin 05:00 . dr-xr-xr-x 8 root root 0 15 juin 05:00 .. lrwx------ 1 root root 64 15 juin 05:00 0 -> /dev/null lrwx------ 1 root root 64 15 juin 05:00 1 -> /dev/null lrwx------ 1 root root 64 15 juin 05:00 2 -> /dev/null lrwx------ 1 root root 64 15 juin 05:00 3 -> socket:[486961] lrwx------ 1 root root 64 15 juin 05:00 4 -> socket:[486962] lrwx------ 1 root root 64 15 juin 05:00 5 -> socket:[486964] lrwx------ 1 root root 64 15 juin 05:00 6 -> socket:[486966] The unwanted and actual case: dr-x------ 2 root root 0 15 juin 04:43 . dr-xr-xr-x 8 root root 0 15 juin 04:37 .. lrwx------ 1 root root 64 15 juin 04:43 0 -> /dev/null lrwx------ 1 root root 64 15 juin 04:43 1 -> /dev/null lr-x------ 1 root root 64 15 juin 04:43 14 -> /dev/random lr-x------ 1 root root 64 15 juin 04:43 15 -> /dev/urandom lr-x------ 1 root root 64 15 juin 04:43 16 -> /dev/random lr-x------ 1 root root 64 15 juin 04:43 17 -> /dev/random lr-x------ 1 root root 64 15 juin 04:43 18 -> /dev/urandom lrwx------ 1 root root 64 15 juin 04:43 180 -> socket:[464556] lrwx------ 1 root root 64 15 juin 04:43 182 -> socket:[465993] lrwx------ 1 root root 64 15 juin 04:43 183 -> socket:[9472] lrwx------ 1 root root 64 15 juin 04:43 184 -> socket:[9477] lrwx------ 1 root root 64 15 juin 04:43 185 -> socket:[9482] lrwx------ 1 root root 64 15 juin 04:43 186 -> socket:[464941] lrwx------ 1 root root 64 15 juin 04:43 187 -> socket:[465974] lrwx------ 1 root root 64 15 juin 04:43 188 -> socket:[468963] lrwx------ 1 root root 64 15 juin 04:43 189 -> socket:[470685] lr-x------ 1 root root 64 15 juin 04:43 19 -> /dev/urandom lrwx------ 1 root root 64 15 juin 04:43 190 -> socket:[465970] lrwx------ 1 root root 64 15 juin 04:43 191 -> socket:[467080] lrwx------ 1 root root 64 15 juin 04:43 192 -> socket:[466007] lrwx------ 1 root root 64 15 juin 04:43 193 -> socket:[469007] lrwx------ 1 root root 64 15 juin 04:43 194 -> socket:[470703] lrwx------ 1 root root 64 15 juin 04:43 195 -> socket:[479609] lrwx------ 1 root root 64 15 juin 04:43 196 -> socket:[464907] lrwx------ 1 root root 64 15 juin 04:43 197 -> socket:[479694] lrwx------ 1 root root 64 15 juin 04:43 198 -> socket:[466000] lrwx------ 1 root root 64 15 juin 04:43 199 -> socket:[467825] lrwx------ 1 root root 64 15 juin 04:43 2 -> /dev/null lrwx------ 1 root root 64 15 juin 04:43 200 -> socket:[467796] lrwx------ 1 root root 64 15 juin 04:43 201 -> socket:[464871] lrwx------ 1 root root 64 15 juin 04:43 202 -> socket:[464897] lrwx------ 1 root root 64 15 juin 04:43 203 -> socket:[467085] lrwx------ 1 root root 64 15 juin 04:43 205 -> socket:[467817] lrwx------ 1 root root 64 15 juin 04:43 206 -> socket:[464915] lrwx------ 1 root root 64 15 juin 04:43 207 -> socket:[467822] lrwx------ 1 root root 64 15 juin 04:43 208 -> socket:[455990] lrwx------ 1 root root 64 15 juin 04:43 209 -> socket:[9989] lrwx------ 1 root root 64 15 juin 04:43 210 -> socket:[9995] lr-x------ 1 root root 64 15 juin 04:43 211 -> pipe:[9996] l-wx------ 1 root root 64 15 juin 04:43 212 -> pipe:[9996] lrwx------ 1 root root 64 15 juin 04:43 213 -> anon_inode:[eventpoll] lr-x------ 1 root root 64 15 juin 04:43 214 -> pipe:[479719] lrwx------ 1 root root 64 15 juin 04:43 215 -> socket:[465959] lrwx------ 1 root root 64 15 juin 04:43 217 -> socket:[466004] lrwx------ 1 root root 64 15 juin 04:43 219 -> anon_inode:[eventpoll] lrwx------ 1 root root 64 15 juin 04:43 222 -> socket:[23019] l-wx------ 1 root root 64 15 juin 04:43 224 -> pipe:[479720] l-wx------ 1 root root 64 15 juin 04:43 226 -> pipe:[479721] lrwx------ 1 root root 64 15 juin 04:43 28 -> socket:[465997] lrwx------ 1 root root 64 15 juin 04:43 3 -> socket:[480055] lrwx------ 1 root root 64 15 juin 04:43 4 -> socket:[480056] l-wx------ 1 root root 64 15 juin 04:43 5 -> /var/log/rudder/core/rudder-webapp-2018-06-14.log66569034707068.tmp (deleted) lrwx------ 1 root root 64 15 juin 04:43 6 -> socket:[9452] lrwx------ 1 root root 64 15 juin 04:43 7 -> socket:[480058] lrwx------ 1 root root 64 15 juin 04:43 8 -> socket:[480060] l-wx------ 1 root root 64 15 juin 04:43 81 -> /var/log/rudder/compliance/non-compliant-reports-2018-06-14.log81880899162011.tmp (deleted) lrwx------ 1 root root 64 15 juin 04:43 86 -> socket:[9466] lrwx------ 1 root root 64 15 juin 04:43 87 -> socket:[9468] lrwx------ 1 root root 64 15 juin 04:43 88 -> socket:[9470]
Updated by François ARMAND over 6 years ago
- Status changed from New to In progress
- Assignee changed from Félix DALLIDET to François ARMAND
Updated by François ARMAND over 6 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder/pull/1971
Updated by François ARMAND over 6 years ago
- Assignee changed from Nicolas CHARLES to Félix DALLIDET
Updated by François ARMAND over 6 years ago
- Assignee changed from Félix DALLIDET to Nicolas CHARLES
- Priority changed from 65 to 64
Updated by Rudder Quality Assistant over 6 years ago
- Assignee changed from Nicolas CHARLES to François ARMAND
Updated by François ARMAND over 6 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|457e5a25e65bd8aee3ece69c22aac771b3e4cb86.
Updated by Vincent MEMBRÉ over 6 years ago
- Status changed from Pending release to Released
- Priority changed from 64 to 63