Project

General

Profile

Actions

Bug #4613

closed

When multiples cf-execd are running at the same time, agent is not behaving properly, and node is in NoAnswer state

Added by Dennis Cabooter over 10 years ago. Updated over 10 years ago.

Status:
Released
Priority:
1 (highest)
Assignee:
Jonathan CLARKE
Category:
System integration
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

It happens quitte often that some or more statuses on a node are in NoAnswer state. For example all reports on node1 are green, except for the Add User directive. However, in this example the status of the Password component is Succes. That sounds weird, doesn't it? If Rudder doesn't know if the user exists, how can it check the passwords of that users? Cf-agent is running and there's only one cf-execd process. After running cf-agent -KI manually all Statuses turn to green again. It looks like the user Management technique is the technique where it happens most often.

Today exactly the same happened on the Rudder server with the Distribute Policy technique. However the Check configuration-repository object was red and showed a nasty error message:

EMERGENCY: THE /var/rudder/configuration-repository DIRECTORY IS *ABSENT*. THIS ORCHESTRATOR WILL *NOT* OPERATE CORRECTLY.

However, after running cf-agent -KI manually everything went back to green.

This strange reporting doesn't only happen on one node, but also not on all nodes. Could it be that logs get lost somehow? I'm sorry I can't provide more info about the prob. Please ask if you need something from me. This problem is there for a long time, but since everything goes back green after manually rubnning cf-agent -KI, and since I can't tackle the problem I didn't created a ticket earlier.


Subtasks 1 (0 open1 closed)

Bug #4923: check-rudder-agent is broken on 2.9 nightly because of a bad mergeReleasedJonathan CLARKE2014-06-03Actions
Actions #1

Updated by Dennis Cabooter over 10 years ago

  • Assignee set to Jonathan CLARKE
  • Priority changed from N/A to 1 (highest)
Actions #2

Updated by Jonathan CLARKE over 10 years ago

  • Target version changed from 2.6.12 to 2.9.4

This is particularly weird because of the randomness with which the bug occurs.

One idea I have is that logs may be getting lost when all agents run at the same time, because of too high load/throughput on the server. This would explain why it always works when the agent is run manually, out of the standard schedule.

IIRC, you saw this in 2.9.3. Retargeting the bug until we know more, so we don't get confused.

Actions #3

Updated by Jonathan CLARKE over 10 years ago

  • Target version changed from 2.9.4 to 2.6.12

Dennis told me he already had this on 2.6.

Actions #4

Updated by Nicolas CHARLES over 10 years ago

Dennis,

could you check in the technical logs of the nodes if the report for the directive in no answer is there ? If you have the possibility to make a screen shot with 100 latest message, it would be very helpful to debug

Actions #5

Updated by Vincent MEMBRÉ over 10 years ago

  • Target version changed from 2.6.12 to 2.6.13
Actions #6

Updated by William Ott over 10 years ago

Experienced the same on version 2.10.
This seemed to occure after I restarted the rudder root server. The non-compliant-reports.log shows the message in the issue description, but the all.log indicates that every node is running the agent and that proper reports are returned from each of them. The same report messages are also displayed on every nodes technical logs in the web ui.
Only the reports tab shows 'NoReport' for either a few or all techniques and it also varies between rules.

Actions #7

Updated by Nicolas CHARLES over 10 years ago

I experience the bug also, it appears in a really random fashion.
When it happened to me, I had executed before /etc/init.d/rudder-agent restart
After running this command, two cf-execd processes where running, one old, and one started by rudder-agent restart

Then agents had partial reporting, or none at all, or sometime correct reporting; it was kind of random.
Killing the two cf-execd solved the issue.

My theory is that the rudder-agent restart didn't stop the cf-execd, but started a new one nonetheless (we can have several running at the same time ( but I can't understand why, it used to be impossible, but plain CFEngine allows it also).
the two cf-execd are not aware of each others, and they both start the cf-agent at the same time, and a weird race condition occurs between them, each one doubling the other in the promise execution, resulting in partial execution for both (some runs reports, other don't, etc), hence chaos.

It seems to me we should enforce only one cf-execd at the same time in the check-rudder-agent script

Actions #8

Updated by Nicolas CHARLES over 10 years ago

Ok, I reproduced it on debian-7-64 by simply running /var/rudder/cfengine-community/bin/cf-execd

then reports gets random

Actions #9

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from New to In progress
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Actions #10

Updated by Nicolas CHARLES over 10 years ago

a check in check_rudder_agent will be necessary to prevent duplicate execution of cf-execd

Actions #11

Updated by Nicolas CHARLES over 10 years ago

  • Subject changed from NoAnswer reports to When multiples cf-execd are running at the same time, agent is not behaving properly, and node is in NoAnswer state
Actions #12

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/331
Actions #13

Updated by Jonathan CLARKE over 10 years ago

  • Status changed from Pending technical review to Discussion
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Actions #14

Updated by Nicolas CHARLES over 10 years ago

  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE

PR has been updated with a comment on why the -9

Actions #15

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from Discussion to Pending release
  • % Done changed from 0 to 100

Applied in changeset packages:commit:af024e9e74a5c81f38f3354576d040c9ac307018.

Actions #16

Updated by Vincent MEMBRÉ over 10 years ago

  • Category changed from Web - Compliance & node report to System integration
Actions #17

Updated by Vincent MEMBRÉ over 10 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 2.6.13 (announcement , changelog), 2.9.5 (announcement , changelog) and 2.10.1 (announcement , changelog), which were released today.

Actions

Also available in: Atom PDF