Bug #2607
closedLaunching Cfengine Community cf-serverd... make an error
Description
Context :
- Debian VM in KVM
- installation with internal repository by a script making this actions :
* apt-get -y --force-yes install ntpdate
* ntpdate ntp.bedata.net
* apt-get -y --force-yes install rudder-agent
* echo cmdb.bedata.net > /var/rudder/cfengine-community/policy_server.dat
* /etc/init.d/rudder-agent start
This installation are working on my other nodes.
After this installation the rudder-agent doesn't work correctly, so if I make a "/etc/init.d/rudder-agent start" there are some errors like :
/etc/init.d/rudder-agent start rudder-agent[10797]: [INFO] Using /etc/default/rudder-agent for configuration rudder-agent[10800]: [INFO] Using /var/rudder/cfengine-community for Cfengine workdir rudder-agent[10801]: [INFO] Launching Cfengine Community cf-serverd... Can't stat file "/var/rudder/cfengine-community/inputs/checkGenericFileContent/3.0/checkGenericFileContent.cf" for parsing !!! System error for stat: "No such file or directory"
I don't understand why, the host work without problem and my other new node don't have this problem. ncharles give me some directions :
- opt/rudder/sbin/cf-agent -KI -b update
Can't stat file "/var/rudder/cfengine-community/inputs/checkGenericFileContent/3.0/checkGenericFileContent.cf" for parsing !!! System error for stat: "No such file or directory" - check the content of /var/rudder/share/5bd89e6b-089d-4142-a895-df869396aa95 and in the subdirectories rules/cfengine-community & rules/cfengine-community/inputs (last one didn't exist)
- ls /var/rudder/share/5bd89e6b-089d-4142-a895-df869396aa95
rules
search the string chechG in rules/cfengine-community/promises.cf to try to find "checkGenericFileContent/3.0/" calls
After all of this I run this commands :
/opt/rudder/sbin/cf-agent -KI -f /var/rudder/cfengine-community/inputs/failsafe.cf
and the problem was solved
Files
Updated by Nicolas CHARLES over 12 years ago
- Description updated (diff)
- Priority changed from N/A to 1 (highest)
- Target version set to 2.3.8
This problem seems to stay for quite a long time (several days), without the failsafe autocorrecting the promises. So there must be something preventing it from running (bad policies implying cf-execd not running implying no execution of cfengine via the cf-execd ? but the cron should have saved this)
There is definitively something fishy there
Thank you for your bug report Francois, this is much appreciated
Updated by Jonathan CLARKE over 12 years ago
- Target version changed from 2.3.8 to 2.3.9
Updated by Nicolas CHARLES over 12 years ago
- Status changed from New to In progress
Updated by Nicolas CHARLES over 12 years ago
Ha. I've been able to reproduce this bug.
It happens when the promises are invalid AND the cfengine databases (BDB) is corrupted (most likely because it has been killed) (and that would explains why files are missing : killed during a copy)
The failsafe doesn't starts, and when running manually the agent, the output contains
/var/rudder/cfengine-community/state/cf_lock.db page 14599 is on free list with type 5 PANIC: Invalid argument rudder> BDB_WriteComplexKeyDB: Error trying to write database: DB_RUNRECOVERY: Fatal error, run database recovery PANIC: fatal region error detected; run recovery rudder> BDB_CloseDB: Unable to close database: DB_RUNRECOVERY: Fatal error, run database recovery rudder> CloseDB: Could not close DB handle. rudder> CloseDB: Trying to remove handle from open pool anyway.
Killing the .db files solve the issue, but it's inconvenient
I'm afraid we'll have to live with this until we move to a version of CFEngine with a correct database
Updated by Nicolas CHARLES over 12 years ago
- Status changed from In progress to Discussion
- Assignee changed from Nicolas CHARLES to Jonathan CLARKE
Jon, I'm not really sure of what to do, since there are no obvious solution ...
Updated by Jonathan CLARKE about 12 years ago
Nicolas CHARLES wrote:
Jon, I'm not really sure of what to do, since there are no obvious solution ...
I think we should have an external test (ie, in cron) that detects this case and kill the BDB database if necessary. This would avoid having to have manual intervention to repair things, and is only a temporary workaround until we have a version of CFengine that can use something other than BDB.
Can you please explain the steps to reproduce the situation, so that we can see what error code cf-agent returns and see if we can build something from that?
Updated by Nicolas CHARLES about 12 years ago
To reproduce it, you need to kill -9 a cf-agent while it is running, and all the cfengine component, and then delete a promise file (like userManagement/1.0/userManagement.cf)
Do it several time if you are not sure.
Then the cf-execd won't be able to restart
/var/rudder/cfengine-community/bin/cf-execd Can't stat file "/var/rudder/cfengine-community/inputs/userManagement/1.0/userManagement.cf" for parsing !!! System error for stat: "No such file or directory"
The error code is 1
You can test this on debian-5-32.labo.normation.com
Thinking of that, the executor daemon is not supposed to fire the failsafe if something is wrong in the promises ... shouldn't we execute cf-agent is cf-execd is not running ???
Updated by Jonathan CLARKE about 12 years ago
I've had a look at this, and cf-execd and cf-agent both have the decency to exit with error code 1 when this happens. Unless you run failsafe.cf, in which case they block...
But actually, cf-agent only blocks when it gets to the part about downloading files:
rudder> ***************************************************************** rudder> BUNDLE update rudder> ***************************************************************** rudder> rudder> rudder> ========================================================= rudder> vars in bundle update (1) rudder> ========================================================= rudder> rudder> rudder> . . . . . . . . . . . . . . . . . . . . . . . . . . . . rudder> Skipping whole next promise (server_inputs), as context nova_edition is not relevant rudder> . . . . . . . . . . . . . . . . . . . . . . . . . . . . rudder> rudder> + Private classes augmented: rudder> rudder> - Private classes diminished: rudder> rudder> rudder> rudder> ========================================================= rudder> files in bundle update (1) rudder> ========================================================= rudder> rudder> rudder> ......................................................... rudder> Promise handle: rudder> Promise made by: /var/rudder/cfengine-community/inputs rudder> ......................................................... rudder> ^Crudder> Received signal 2 (SIGINT) while doing [] rudder> Logical start time Mon Sep 24 16:28:20 2012 rudder> This sub-task started really at Mon Sep 24 16:28:21 2012 rudder> Trying to remove lock - try rudder> Can't open lock-log file rudder> !!! System error for fopen: "No such file or directory"
It actually executes the previous bundle just fine. So maybe we could implement a promise in another bundle that would be executed first, to somehow detect the corruption, and clean up the BerkeleyDB files... And then run "cf-agent -f failsafe.cf && cf-agent" via cron, instead of cf-execd.
What do you think?
Updated by Nicolas CHARLES about 12 years ago
If there is some way to detect the corruption and clean them with the agent, that would be awesome.
And I totally agree that the cron should not run the executor, rather "cf-agent -f failsafe && cf-agent" if the executor is not running, and the agent would run the executor, which would in turn run the agent
Updated by Jonathan CLARKE about 12 years ago
- Status changed from Discussion to In progress
- We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
- Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
- The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again
Updated by Jonathan CLARKE about 12 years ago
Suggested patch attrached
Updated by Jonathan CLARKE about 12 years ago
- Target version changed from 2.3.9 to 2.4.0~beta5
This is too much of an impacting change for 2.3 in my opinion, so it will only be fixed in 2.4.
Updated by Jonathan CLARKE about 12 years ago
- Status changed from In progress to Pending technical review
- % Done changed from 0 to 100
Applied in changeset commit:9986ba490b88eef9791a2ba29c763189c2249bdb.
Updated by Nicolas PERRON about 12 years ago
Jonathan CLARKE wrote:
Here's an idea:
- We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
- Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
- The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again
The logic has been implemented.
I didn't successfully reproduce the bug with the error message about cf_lock.db but I have a locked state while lauching cf-agent at the step "startExecution". Our logic has unlocked the state and all has been returning to normal state.
I suppose we can consider the issue as fixed ?
Updated by Nicolas CHARLES about 12 years ago
- Assignee changed from Nicolas PERRON to Jonathan CLARKE
Nicolas PERRON wrote:
Jonathan CLARKE wrote:
Here's an idea:
- We change the cronjob that tests if cf-execd is running to launch "cf-agent -f failsafe.cf && cf-agent" instead of launching cf-execd. This gives the failsafe.cf the chance to repair any problems with the promises files that might stop cf-execd from running.
- Every time we correctly update the promises, we touch a file in /var/rudder/cfengine-community
- The first action in the failsafe would be to check the file above, and if it is older than, say, 1 hour, we remove cf_lock.db (and only this DB), to give CFEngine a chance to run properly again
The logic has been implemented.
I didn't successfully reproduce the bug with the error message about cf_lock.db but I have a locked state while lauching cf-agent at the step "startExecution". Our logic has unlocked the state and all has been returning to normal state.
I suppose we can consider the issue as fixed ?
I corrected a small typo ( . has precedence over | in logical expression ) but other than that, it looks good to me. Jon, can you confirm ?
Updated by Jonathan CLARKE about 12 years ago
- Status changed from Pending technical review to Released
Nicolas CHARLES wrote:
I corrected a small typo ( . has precedence over | in logical expression ) but other than that, it looks good to me. Jon, can you confirm ?
Yes, I agree!