Project

General

Profile

Actions

Bug #2607

closed

Launching Cfengine Community cf-serverd... make an error

Added by Francois BAYART over 12 years ago. Updated about 12 years ago.

Status:
Released
Priority:
1 (highest)
Assignee:
Jonathan CLARKE
Category:
System techniques
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

Context :
- Debian VM in KVM
- installation with internal repository by a script making this actions : * apt-get -y --force-yes install ntpdate * ntpdate ntp.bedata.net * apt-get -y --force-yes install rudder-agent * echo cmdb.bedata.net > /var/rudder/cfengine-community/policy_server.dat * /etc/init.d/rudder-agent start

This installation are working on my other nodes.

After this installation the rudder-agent doesn't work correctly, so if I make a "/etc/init.d/rudder-agent start" there are some errors like :

/etc/init.d/rudder-agent start
rudder-agent[10797]: [INFO] Using /etc/default/rudder-agent for configuration
rudder-agent[10800]: [INFO] Using /var/rudder/cfengine-community for Cfengine workdir
rudder-agent[10801]: [INFO] Launching Cfengine Community cf-serverd...

Can't stat file "/var/rudder/cfengine-community/inputs/checkGenericFileContent/3.0/checkGenericFileContent.cf" for parsing !!! System error for stat: "No such file or directory" 

I don't understand why, the host work without problem and my other new node don't have this problem.

ncharles give me some directions :
  1. opt/rudder/sbin/cf-agent -KI -b update
    Can't stat file "/var/rudder/cfengine-community/inputs/checkGenericFileContent/3.0/checkGenericFileContent.cf" for parsing !!! System error for stat: "No such file or directory"
  2. check the content of /var/rudder/share/5bd89e6b-089d-4142-a895-df869396aa95 and in the subdirectories rules/cfengine-community & rules/cfengine-community/inputs (last one didn't exist)
  3. ls /var/rudder/share/5bd89e6b-089d-4142-a895-df869396aa95
    rules

search the string chechG in rules/cfengine-community/promises.cf to try to find "checkGenericFileContent/3.0/" calls

After all of this I run this commands :
/opt/rudder/sbin/cf-agent -KI -f /var/rudder/cfengine-community/inputs/failsafe.cf

and the problem was solved


Files

patch (5.97 KB) patch Jonathan CLARKE, 2012-09-25 15:33
Actions #1

Updated by Nicolas CHARLES over 12 years ago

  • Description updated (diff)
  • Priority changed from N/A to 1 (highest)
  • Target version set to 2.3.8

This problem seems to stay for quite a long time (several days), without the failsafe autocorrecting the promises. So there must be something preventing it from running (bad policies implying cf-execd not running implying no execution of cfengine via the cf-execd ? but the cron should have saved this)

There is definitively something fishy there

Thank you for your bug report Francois, this is much appreciated

Actions #2

Updated by Jonathan CLARKE over 12 years ago

  • Target version changed from 2.3.8 to 2.3.9
Actions #3

Updated by Nicolas CHARLES over 12 years ago

  • Status changed from New to In progress
Actions #4

Updated by Nicolas CHARLES over 12 years ago

Ha. I've been able to reproduce this bug.
It happens when the promises are invalid AND the cfengine databases (BDB) is corrupted (most likely because it has been killed) (and that would explains why files are missing : killed during a copy)

The failsafe doesn't starts, and when running manually the agent, the output contains

/var/rudder/cfengine-community/state/cf_lock.db page 14599 is on free list with type 5
PANIC: Invalid argument
rudder> BDB_WriteComplexKeyDB: Error trying to write database: DB_RUNRECOVERY: Fatal error, run database recovery
PANIC: fatal region error detected; run recovery
rudder> BDB_CloseDB: Unable to close database: DB_RUNRECOVERY: Fatal error, run database recovery
rudder> CloseDB: Could not close DB handle.
rudder> CloseDB: Trying to remove handle from open pool anyway.

Killing the .db files solve the issue, but it's inconvenient

I'm afraid we'll have to live with this until we move to a version of CFEngine with a correct database

Actions #5

Updated by Nicolas CHARLES over 12 years ago

  • Status changed from In progress to Discussion
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE

Jon, I'm not really sure of what to do, since there are no obvious solution ...

Actions #6

Updated by Jonathan CLARKE about 12 years ago

Nicolas CHARLES wrote:

Jon, I'm not really sure of what to do, since there are no obvious solution ...

I think we should have an external test (ie, in cron) that detects this case and kill the BDB database if necessary. This would avoid having to have manual intervention to repair things, and is only a temporary workaround until we have a version of CFengine that can use something other than BDB.

Can you please explain the steps to reproduce the situation, so that we can see what error code cf-agent returns and see if we can build something from that?

Actions #7

Updated by Nicolas CHARLES about 12 years ago

To reproduce it, you need to kill -9 a cf-agent while it is running, and all the cfengine component, and then delete a promise file (like userManagement/1.0/userManagement.cf)
Do it several time if you are not sure.
Then the cf-execd won't be able to restart

/var/rudder/cfengine-community/bin/cf-execd
Can't stat file "/var/rudder/cfengine-community/inputs/userManagement/1.0/userManagement.cf" for parsing
 !!! System error for stat: "No such file or directory" 

The error code is 1
You can test this on debian-5-32.labo.normation.com

Thinking of that, the executor daemon is not supposed to fire the failsafe if something is wrong in the promises ... shouldn't we execute cf-agent is cf-execd is not running ???

Actions #8

Updated by Jonathan CLARKE about 12 years ago

I've had a look at this, and cf-execd and cf-agent both have the decency to exit with error code 1 when this happens. Unless you run failsafe.cf, in which case they block...

But actually, cf-agent only blocks when it gets to the part about downloading files:

rudder> *****************************************************************
rudder> BUNDLE update
rudder> *****************************************************************
rudder> 
rudder> 
rudder>    =========================================================
rudder>    vars in bundle update (1)
rudder>    =========================================================
rudder> 
rudder> 
rudder> . . . . . . . . . . . . . . . . . . . . . . . . . . . .
rudder> Skipping whole next promise (server_inputs), as context nova_edition is not relevant
rudder> . . . . . . . . . . . . . . . . . . . . . . . . . . . .
rudder> 
rudder>      +  Private classes augmented:
rudder> 
rudder>      -  Private classes diminished:
rudder> 
rudder> 
rudder> 
rudder>    =========================================================
rudder>    files in bundle update (1)
rudder>    =========================================================
rudder> 
rudder> 
rudder>     .........................................................
rudder>     Promise handle: 
rudder>     Promise made by: /var/rudder/cfengine-community/inputs
rudder>     .........................................................
rudder> 

^Crudder> Received signal 2 (SIGINT) while doing []
rudder> Logical start time Mon Sep 24 16:28:20 2012
rudder> This sub-task started really at Mon Sep 24 16:28:21 2012
rudder> Trying to remove lock - try
rudder> Can't open lock-log file
rudder>  !!! System error for fopen: "No such file or directory" 

It actually executes the previous bundle just fine. So maybe we could implement a promise in another bundle that would be executed first, to somehow detect the corruption, and clean up the BerkeleyDB files... And then run "cf-agent -f failsafe.cf && cf-agent" via cron, instead of cf-execd.

What do you think?

Actions #9

Updated by Nicolas CHARLES about 12 years ago

If there is some way to detect the corruption and clean them with the agent, that would be awesome.
And I totally agree that the cron should not run the executor, rather "cf-agent -f failsafe && cf-agent" if the executor is not running, and the agent would run the executor, which would in turn run the agent

Actions #10

Updated by Jonathan CLARKE about 12 years ago

  • Status changed from Discussion to In progress
Here's an idea:
  • We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
  • Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
  • The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again
Actions #11

Updated by Jonathan CLARKE about 12 years ago

  • File patch patch added
  • Assignee changed from Jonathan CLARKE to Nicolas PERRON

Suggested patch attrached

Actions #12

Updated by Jonathan CLARKE about 12 years ago

  • Target version changed from 2.3.9 to 2.4.0~beta5

This is too much of an impacting change for 2.3 in my opinion, so it will only be fixed in 2.4.

Actions #13

Updated by Jonathan CLARKE about 12 years ago

  • Status changed from In progress to Pending technical review
  • % Done changed from 0 to 100

Applied in changeset commit:9986ba490b88eef9791a2ba29c763189c2249bdb.

Actions #14

Updated by Nicolas PERRON about 12 years ago

Jonathan CLARKE wrote:

Here's an idea:
  • We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
  • Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
  • The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again

The logic has been implemented.

I didn't successfully reproduce the bug with the error message about cf_lock.db but I have a locked state while lauching cf-agent at the step "startExecution". Our logic has unlocked the state and all has been returning to normal state.

I suppose we can consider the issue as fixed ?

Actions #15

Updated by Nicolas CHARLES about 12 years ago

  • Assignee changed from Nicolas PERRON to Jonathan CLARKE

Nicolas PERRON wrote:

Jonathan CLARKE wrote:

Here's an idea:
  • We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
  • Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
  • The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again

The logic has been implemented.

I didn't successfully reproduce the bug with the error message about cf_lock.db but I have a locked state while lauching cf-agent at the step "startExecution". Our logic has unlocked the state and all has been returning to normal state.

I suppose we can consider the issue as fixed ?

I corrected a small typo ( . has precedence over | in logical expression ) but other than that, it looks good to me. Jon, can you confirm ?

Actions #16

Updated by Jonathan CLARKE about 12 years ago

  • Status changed from Pending technical review to Released

Nicolas CHARLES wrote:

I corrected a small typo ( . has precedence over | in logical expression ) but other than that, it looks good to me. Jon, can you confirm ?

Yes, I agree!

Actions

Also available in: Atom PDF