Project

General

Profile

Actions

Bug #12604

closed

Generation gets stuck when cf-serverd is not running

Added by Alexis Mousset over 6 years ago. Updated over 6 years ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
63
Name check:
Fix check:
Regression:

Description

Seen on SLES 11 with Rudder 4.1.7, but should happen everywhere.

When cf-serverd is not running and a policy generation is triggered, it gets stuck with:

root     13935 72.4 13.0 2810272 424520 pts/0  Sl   05:48   1:36 /usr/java/latest/bin/java -server -Xms1024m -Xmx1024m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Drudder.configFile=/opt/rudder/etc/rudder-w
root     16307  0.0  0.0      0     0 pts/0    Z    05:50   0:00  \_ [rudder-reload-c] <defunct>

And we have no way to reset generation state, and have to restart jetty to start a new generation.

Actions #1

Updated by Alexis Mousset over 6 years ago

  • Description updated (diff)
  • Target version set to 4.1.12
Actions #2

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.12 to 4.1.13
Actions #3

Updated by François ARMAND over 6 years ago

I can't reproduce it in Ubuntu16.04 (rudder 4.1.12).

Wehn cf-execd does not run, what is the result of the content of the hook `policy-generation-finished/50-reload-policy-file-server`, i.e:

exec /opt/rudder/bin/rudder-reload-cf-serverd

On Ubuntu, we get an happy exit status 0 (I'm not sure it's what we want, but hey! No zombie at least!)

Actions #4

Updated by Benoît PECCATTE over 6 years ago

  • Target version changed from 4.1.13 to 411
Actions #5

Updated by Benoît PECCATTE over 6 years ago

  • Target version changed from 411 to 4.1.13
Actions #6

Updated by Benoît PECCATTE over 6 years ago

  • Category set to System integration
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Infrequent - complex configurations | third party integrations
  • Priority changed from 0 to 65
Actions #7

Updated by Benoît PECCATTE over 6 years ago

  • Assignee set to Félix DALLIDET
Actions #8

Updated by Félix DALLIDET over 6 years ago

I tried on a SLES 12 SP1 with rudder 4.1.12.
/opt/rudder/bin/rudder-reload-cf-serverd try to kill cf-serverd with its PID, which does not exist and fails.
But it restarts rudder-agent service and resolves the problem.

Promises generation are stuck if triggered manually. It seems to resolve itself when triggering a real generation (with policy changes) or at next rudder agent execution on the server.

Actions #9

Updated by Félix DALLIDET over 6 years ago

I reproduced it on a SLES11 SP4 with rudder 4.3.3

The generation get stuck forever whn killing cf-serverd.

Actions #10

Updated by Félix DALLIDET over 6 years ago

  • Status changed from New to In progress
Actions #11

Updated by Félix DALLIDET over 6 years ago

  • Status changed from In progress to New

Some observations:

minor:
The file /opt/rudder/bin/rudder-reload-cf-serverd seems to fails when trying to kill an already killed PID. Fixing it by testing if cf-serverd is running
before trying to kill it does not resolve the generation issue.

other:
By redirecting all the outputs to a file, I found that the restarting of the service fails when executing /var/rudder/cfengine-community/bin/cf-serverd with the error:

error: Unable to kill expired process 17155 from lock lock.server_cfengine_bundle.server_cfengine.-server._var_rudder_cfengine_community_inputs_promises_cf_4657_MD5=ce9ed37b30768afd47469328cfbf1bdc (probably process not found or permission denied)

It seems to be related to https://tracker.mender.io/browse/CFE-2824
I was not able to find which process is being killed.

Actions #12

Updated by Félix DALLIDET over 6 years ago

When starting the service cf-serverd it inherits all the file descriptors from the webapp. By using "startproc" to launch the cf-serverd it closes all the unwanted file descriptors.
As pointed by François, it seems to be a known bug, recently fixed https://github.com/brettwooldridge/NuProcess/issues/13
Here are 2 lists of the file descriptors used by cf-serverd in both cases:

The startproc case:

dr-x------ 2 root root  0 15 juin  05:00 .
dr-xr-xr-x 8 root root  0 15 juin  05:00 ..
lrwx------ 1 root root 64 15 juin  05:00 0 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 1 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 2 -> /dev/null
lrwx------ 1 root root 64 15 juin  05:00 3 -> socket:[486961]
lrwx------ 1 root root 64 15 juin  05:00 4 -> socket:[486962]
lrwx------ 1 root root 64 15 juin  05:00 5 -> socket:[486964]
lrwx------ 1 root root 64 15 juin  05:00 6 -> socket:[486966]

The unwanted and actual case:

dr-x------ 2 root root  0 15 juin  04:43 .
dr-xr-xr-x 8 root root  0 15 juin  04:37 ..
lrwx------ 1 root root 64 15 juin  04:43 0 -> /dev/null
lrwx------ 1 root root 64 15 juin  04:43 1 -> /dev/null
lr-x------ 1 root root 64 15 juin  04:43 14 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 15 -> /dev/urandom
lr-x------ 1 root root 64 15 juin  04:43 16 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 17 -> /dev/random
lr-x------ 1 root root 64 15 juin  04:43 18 -> /dev/urandom
lrwx------ 1 root root 64 15 juin  04:43 180 -> socket:[464556]
lrwx------ 1 root root 64 15 juin  04:43 182 -> socket:[465993]
lrwx------ 1 root root 64 15 juin  04:43 183 -> socket:[9472]
lrwx------ 1 root root 64 15 juin  04:43 184 -> socket:[9477]
lrwx------ 1 root root 64 15 juin  04:43 185 -> socket:[9482]
lrwx------ 1 root root 64 15 juin  04:43 186 -> socket:[464941]
lrwx------ 1 root root 64 15 juin  04:43 187 -> socket:[465974]
lrwx------ 1 root root 64 15 juin  04:43 188 -> socket:[468963]
lrwx------ 1 root root 64 15 juin  04:43 189 -> socket:[470685]
lr-x------ 1 root root 64 15 juin  04:43 19 -> /dev/urandom
lrwx------ 1 root root 64 15 juin  04:43 190 -> socket:[465970]
lrwx------ 1 root root 64 15 juin  04:43 191 -> socket:[467080]
lrwx------ 1 root root 64 15 juin  04:43 192 -> socket:[466007]
lrwx------ 1 root root 64 15 juin  04:43 193 -> socket:[469007]
lrwx------ 1 root root 64 15 juin  04:43 194 -> socket:[470703]
lrwx------ 1 root root 64 15 juin  04:43 195 -> socket:[479609]
lrwx------ 1 root root 64 15 juin  04:43 196 -> socket:[464907]
lrwx------ 1 root root 64 15 juin  04:43 197 -> socket:[479694]
lrwx------ 1 root root 64 15 juin  04:43 198 -> socket:[466000]
lrwx------ 1 root root 64 15 juin  04:43 199 -> socket:[467825]
lrwx------ 1 root root 64 15 juin  04:43 2 -> /dev/null
lrwx------ 1 root root 64 15 juin  04:43 200 -> socket:[467796]
lrwx------ 1 root root 64 15 juin  04:43 201 -> socket:[464871]
lrwx------ 1 root root 64 15 juin  04:43 202 -> socket:[464897]
lrwx------ 1 root root 64 15 juin  04:43 203 -> socket:[467085]
lrwx------ 1 root root 64 15 juin  04:43 205 -> socket:[467817]
lrwx------ 1 root root 64 15 juin  04:43 206 -> socket:[464915]
lrwx------ 1 root root 64 15 juin  04:43 207 -> socket:[467822]
lrwx------ 1 root root 64 15 juin  04:43 208 -> socket:[455990]
lrwx------ 1 root root 64 15 juin  04:43 209 -> socket:[9989]
lrwx------ 1 root root 64 15 juin  04:43 210 -> socket:[9995]
lr-x------ 1 root root 64 15 juin  04:43 211 -> pipe:[9996]
l-wx------ 1 root root 64 15 juin  04:43 212 -> pipe:[9996]
lrwx------ 1 root root 64 15 juin  04:43 213 -> anon_inode:[eventpoll]
lr-x------ 1 root root 64 15 juin  04:43 214 -> pipe:[479719]
lrwx------ 1 root root 64 15 juin  04:43 215 -> socket:[465959]
lrwx------ 1 root root 64 15 juin  04:43 217 -> socket:[466004]
lrwx------ 1 root root 64 15 juin  04:43 219 -> anon_inode:[eventpoll]
lrwx------ 1 root root 64 15 juin  04:43 222 -> socket:[23019]
l-wx------ 1 root root 64 15 juin  04:43 224 -> pipe:[479720]
l-wx------ 1 root root 64 15 juin  04:43 226 -> pipe:[479721]
lrwx------ 1 root root 64 15 juin  04:43 28 -> socket:[465997]
lrwx------ 1 root root 64 15 juin  04:43 3 -> socket:[480055]
lrwx------ 1 root root 64 15 juin  04:43 4 -> socket:[480056]
l-wx------ 1 root root 64 15 juin  04:43 5 -> /var/log/rudder/core/rudder-webapp-2018-06-14.log66569034707068.tmp (deleted)
lrwx------ 1 root root 64 15 juin  04:43 6 -> socket:[9452]
lrwx------ 1 root root 64 15 juin  04:43 7 -> socket:[480058]
lrwx------ 1 root root 64 15 juin  04:43 8 -> socket:[480060]
l-wx------ 1 root root 64 15 juin  04:43 81 -> /var/log/rudder/compliance/non-compliant-reports-2018-06-14.log81880899162011.tmp (deleted)
lrwx------ 1 root root 64 15 juin  04:43 86 -> socket:[9466]
lrwx------ 1 root root 64 15 juin  04:43 87 -> socket:[9468]
lrwx------ 1 root root 64 15 juin  04:43 88 -> socket:[9470]
Actions #13

Updated by François ARMAND over 6 years ago

  • Status changed from New to In progress
  • Assignee changed from Félix DALLIDET to François ARMAND
Actions #14

Updated by François ARMAND over 6 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/1971
Actions #15

Updated by François ARMAND over 6 years ago

  • Assignee changed from Nicolas CHARLES to Félix DALLIDET
Actions #16

Updated by François ARMAND over 6 years ago

  • Assignee changed from Félix DALLIDET to Nicolas CHARLES
  • Priority changed from 65 to 64
Actions #17

Updated by Rudder Quality Assistant over 6 years ago

  • Assignee changed from Nicolas CHARLES to François ARMAND
Actions #18

Updated by François ARMAND over 6 years ago

  • Status changed from Pending technical review to Pending release
Actions #19

Updated by Vincent MEMBRÉ over 6 years ago

  • Status changed from Pending release to Released
  • Priority changed from 64 to 63

This bug has been fixed in Rudder 4.1.13, 4.2.7 and 4.3.3 which were released today.

Actions

Also available in: Atom PDF