Project

General

Profile

Bug #12604

Generation gets stuck when cf-serverd is not running

Added by Alexis MOUSSET 7 months ago. Updated 5 months ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
63

Description

Seen on SLES 11 with Rudder 4.1.7, but should happen everywhere.

When cf-serverd is not running and a policy generation is triggered, it gets stuck with:

root     13935 72.4 13.0 2810272 424520 pts/0  Sl   05:48   1:36 /usr/java/latest/bin/java -server -Xms1024m -Xmx1024m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Drudder.configFile=/opt/rudder/etc/rudder-w
root     16307  0.0  0.0      0     0 pts/0    Z    05:50   0:00  \_ [rudder-reload-c] <defunct>

And we have no way to reset generation state, and have to restart jetty to start a new generation.

Associated revisions

Revision 457e5a25 (diff)
Added by François ARMAND 6 months ago

Fixes #12604: Generation gets stuck when cf-serverd is not running

History

#1 Updated by Alexis MOUSSET 7 months ago

  • Description updated (diff)
  • Target version set to 4.1.12

#2 Updated by Vincent MEMBRÉ 7 months ago

  • Target version changed from 4.1.12 to 4.1.13

#3 Updated by François ARMAND 7 months ago

I can't reproduce it in Ubuntu16.04 (rudder 4.1.12).

Wehn cf-execd does not run, what is the result of the content of the hook `policy-generation-finished/50-reload-policy-file-server`, i.e:

exec /opt/rudder/bin/rudder-reload-cf-serverd

On Ubuntu, we get an happy exit status 0 (I'm not sure it's what we want, but hey! No zombie at least!)

#4 Updated by Benoît PECCATTE 7 months ago

  • Target version changed from 4.1.13 to 411

#5 Updated by Benoît PECCATTE 7 months ago

  • Target version changed from 411 to 4.1.13

#6 Updated by Benoît PECCATTE 7 months ago

  • Category set to System integration
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Infrequent - complex configurations | third party integrations
  • Priority changed from 0 to 65

#7 Updated by Benoît PECCATTE 6 months ago

  • Assignee set to Félix DALLIDET

#8 Updated by Félix DALLIDET 6 months ago

I tried on a SLES 12 SP1 with rudder 4.1.12.
/opt/rudder/bin/rudder-reload-cf-serverd try to kill cf-serverd with its PID, which does not exist and fails.
But it restarts rudder-agent service and resolves the problem.

Promises generation are stuck if triggered manually. It seems to resolve itself when triggering a real generation (with policy changes) or at next rudder agent execution on the server.

#9 Updated by Félix DALLIDET 6 months ago

I reproduced it on a SLES11 SP4 with rudder 4.3.3

The generation get stuck forever whn killing cf-serverd.

#10 Updated by Félix DALLIDET 6 months ago

  • Status changed from New to In progress

#11 Updated by Félix DALLIDET 6 months ago

  • Status changed from In progress to New

Some observations:

minor:
The file /opt/rudder/bin/rudder-reload-cf-serverd seems to fails when trying to kill an already killed PID. Fixing it by testing if cf-serverd is running
before trying to kill it does not resolve the generation issue.

other:
By redirecting all the outputs to a file, I found that the restarting of the service fails when executing /var/rudder/cfengine-community/bin/cf-serverd with the error:

error: Unable to kill expired process 17155 from lock lock.server_cfengine_bundle.server_cfengine.-server._var_rudder_cfengine_community_inputs_promises_cf_4657_MD5=ce9ed37b30768afd47469328cfbf1bdc (probably process not found or permission denied)

It seems to be related to https://tracker.mender.io/browse/CFE-2824
I was not able to find which process is being killed.

#12 Updated by Félix DALLIDET 6 months ago

When starting the service cf-serverd it inherits all the file descriptors from the webapp. By using "startproc" to launch the cf-serverd it closes all the unwanted file descriptors.
As pointed by François, it seems to be a known bug, recently fixed https://github.com/brettwooldridge/NuProcess/issues/13
Here are 2 lists of the file descriptors used by cf-serverd in both cases:

The startproc case:

dr-x------ 2 root root  0 15 juin  05:00 .
dr-xr-xr-x 8 root root  0 15 juin  05:00 ..
lrwx------ 1 root root 64 15 juin  05:00 0 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 1 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 2 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 3 -> socket:[486961]
lrwx------ 1 root root 64 15 juin  05:00 4 -> socket:[486962]
lrwx------ 1 root root 64 15 juin  05:00 5 -> socket:[486964]
lrwx------ 1 root root 64 15 juin  05:00 6 -> socket:[486966]

The unwanted and actual case:

dr-x------ 2 root root  0 15 juin  04:43 .
dr-xr-xr-x 8 root root  0 15 juin  04:37 ..
lrwx------ 1 root root 64 15 juin  04:43 0 -> /dev/null
lrwx------ 1 root root 64 15 juin  04:43 1 -> /dev/null
lr-x------ 1 root root 64 15 juin  04:43 14 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 15 -> /dev/urandom
lr-x------ 1 root root 64 15 juin  04:43 16 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 17 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 18 -> /dev/urandom
lrwx------ 1 root root 64 15 juin  04:43 180 -> socket:[464556]
lrwx------ 1 root root 64 15 juin  04:43 182 -> socket:[465993]
lrwx------ 1 root root 64 15 juin  04:43 183 -> socket:[9472]
lrwx------ 1 root root 64 15 juin  04:43 184 -> socket:[9477]
lrwx------ 1 root root 64 15 juin  04:43 185 -> socket:[9482]
lrwx------ 1 root root 64 15 juin  04:43 186 -> socket:[464941]
lrwx------ 1 root root 64 15 juin  04:43 187 -> socket:[465974]
lrwx------ 1 root root 64 15 juin  04:43 188 -> socket:[468963]
lrwx------ 1 root root 64 15 juin  04:43 189 -> socket:[470685]
lr-x------ 1 root root 64 15 juin  04:43 19 -> /dev/urandom
lrwx------ 1 root root 64 15 juin  04:43 190 -> socket:[465970]
lrwx------ 1 root root 64 15 juin  04:43 191 -> socket:[467080]
lrwx------ 1 root root 64 15 juin  04:43 192 -> socket:[466007]
lrwx------ 1 root root 64 15 juin  04:43 193 -> socket:[469007]
lrwx------ 1 root root 64 15 juin  04:43 194 -> socket:[470703]
lrwx------ 1 root root 64 15 juin  04:43 195 -> socket:[479609]
lrwx------ 1 root root 64 15 juin  04:43 196 -> socket:[464907]
lrwx------ 1 root root 64 15 juin  04:43 197 -> socket:[479694]
lrwx------ 1 root root 64 15 juin  04:43 198 -> socket:[466000]
lrwx------ 1 root root 64 15 juin  04:43 199 -> socket:[467825]
lrwx------ 1 root root 64 15 juin  04:43 2 -> /dev/null
lrwx------ 1 root root 64 15 juin  04:43 200 -> socket:[467796]
lrwx------ 1 root root 64 15 juin  04:43 201 -> socket:[464871]
lrwx------ 1 root root 64 15 juin  04:43 202 -> socket:[464897]
lrwx------ 1 root root 64 15 juin  04:43 203 -> socket:[467085]
lrwx------ 1 root root 64 15 juin  04:43 205 -> socket:[467817]
lrwx------ 1 root root 64 15 juin  04:43 206 -> socket:[464915]
lrwx------ 1 root root 64 15 juin  04:43 207 -> socket:[467822]
lrwx------ 1 root root 64 15 juin  04:43 208 -> socket:[455990]
lrwx------ 1 root root 64 15 juin  04:43 209 -> socket:[9989]
lrwx------ 1 root root 64 15 juin  04:43 210 -> socket:[9995]
lr-x------ 1 root root 64 15 juin  04:43 211 -> pipe:[9996]
l-wx------ 1 root root 64 15 juin  04:43 212 -> pipe:[9996]
lrwx------ 1 root root 64 15 juin  04:43 213 -> anon_inode:[eventpoll]
lr-x------ 1 root root 64 15 juin  04:43 214 -> pipe:[479719]
lrwx------ 1 root root 64 15 juin  04:43 215 -> socket:[465959]
lrwx------ 1 root root 64 15 juin  04:43 217 -> socket:[466004]
lrwx------ 1 root root 64 15 juin  04:43 219 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 15 juin  04:43 222 -> socket:[23019]
l-wx------ 1 root root 64 15 juin  04:43 224 -> pipe:[479720]
l-wx------ 1 root root 64 15 juin  04:43 226 -> pipe:[479721]
lrwx------ 1 root root 64 15 juin  04:43 28 -> socket:[465997]
lrwx------ 1 root root 64 15 juin  04:43 3 -> socket:[480055]
lrwx------ 1 root root 64 15 juin  04:43 4 -> socket:[480056]
l-wx------ 1 root root 64 15 juin  04:43 5 -> /var/log/rudder/core/rudder-webapp-2018-06-14.log66569034707068.tmp (deleted)
lrwx------ 1 root root 64 15 juin  04:43 6 -> socket:[9452]
lrwx------ 1 root root 64 15 juin  04:43 7 -> socket:[480058]
lrwx------ 1 root root 64 15 juin  04:43 8 -> socket:[480060]
l-wx------ 1 root root 64 15 juin  04:43 81 -> /var/log/rudder/compliance/non-compliant-reports-2018-06-14.log81880899162011.tmp (deleted)
lrwx------ 1 root root 64 15 juin  04:43 86 -> socket:[9466]
lrwx------ 1 root root 64 15 juin  04:43 87 -> socket:[9468]
lrwx------ 1 root root 64 15 juin  04:43 88 -> socket:[9470]

#13 Updated by François ARMAND 6 months ago

  • Status changed from New to In progress
  • Assignee changed from Félix DALLIDET to François ARMAND

#14 Updated by François ARMAND 6 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/1971

#15 Updated by François ARMAND 6 months ago

  • Assignee changed from Nicolas CHARLES to Félix DALLIDET

#16 Updated by François ARMAND 6 months ago

  • Assignee changed from Félix DALLIDET to Nicolas CHARLES
  • Priority changed from 65 to 64

#17 Updated by Normation Quality Assistant 6 months ago

  • Assignee changed from Nicolas CHARLES to François ARMAND

#18 Updated by François ARMAND 6 months ago

  • Status changed from Pending technical review to Pending release

#19 Updated by Vincent MEMBRÉ 5 months ago

  • Status changed from Pending release to Released
  • Priority changed from 64 to 63

This bug has been fixed in Rudder 4.1.13, 4.2.7 and 4.3.3 which were released today.

Also available in: Atom PDF