Bug #3928
closedSometimes CFEngine get stuck because of locks on TokyoCabinet
Added by Nicolas CHARLES about 11 years ago. Updated over 9 years ago.
Description
Sometimes CFEngine get stuck, and doesn't do anything because of deadlock on DB
We currently kill the database, but it's not always easy to detect the problem like that, and it's a bit strict
Neil Watson has a neat implementation of for killing cfengine processes using the locks
Updated by Nicolas PERRON about 11 years ago
- Target version changed from 2.8.0~beta1 to 2.6.9
This is a bug about tokyocabinet, then it should be targeted in Rudder 2.6
Updated by Nicolas PERRON about 11 years ago
- Target version changed from 2.6.9 to 2.6.10
Updated by Vincent MEMBRÉ almost 11 years ago
- Priority changed from 3 to 2
CFEngine related bug is here: https://cfengine.com/dev/issues/3430
Updated by Jonathan CLARKE almost 11 years ago
- Target version changed from 2.6.10 to 2.8.2
A more up to date version of Neil's fix is here: https://github.com/evolvethinking/evolve_cfengine_freelib/blob/master/masterfiles/lib/evolve_freelib.cf#L2076 (supports different paths to lsof on redhat and !redhat).
His code is under GPL3 so can be integrated to Rudder.
Nicolas, I agree TokyoCabinet was present in Rudder 2.6, but this issue seems to be with the way locks are handled in CFEngine 3.5, which was introduced in Rudder 2.8. Since we're not 100% sure about this, I'd rather not introduce a potentially dangerous fix in a stable version, so let's focus on 2.8 for now.
Updated by Nicolas CHARLES almost 11 years ago
I seem to recall that this corruption often happen on RedHat/Centos
The cause could also be bad init script as explained in https://cfengine.com/dev/issues/3686
Updated by Nicolas CHARLES almost 11 years ago
- Status changed from New to Pending technical review
- Assignee set to Jonathan CLARKE
- Pull Request set to https://github.com/Normation/rudder-techniques/pull/268/files
PR is there
https://github.com/Normation/rudder-techniques/pull/268/files
I implemented Neil's method, with some modifications/corrections.
I'm not sure it will solve the pb (I didn't manage to get more than one lock on the systems I tried, even when launching dozens of agent at the same time)
Updated by Jonathan CLARKE almost 11 years ago
Nicolas CHARLES wrote:
I seem to recall that this corruption often happen on RedHat/Centos
The cause could also be bad init script as explained in https://cfengine.com/dev/issues/3686
This seems unlikely, because we don't use the init script provided by CFEngine (it's only in their RPM/deb packages, IIRC), we use one of our own. But we don't do anything with /var/lock/subsys... Do you think we should?
Updated by Jonathan CLARKE almost 11 years ago
- Status changed from Pending technical review to Discussion
- Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Nicolas CHARLES wrote:
PR is there
https://github.com/Normation/rudder-techniques/pull/268/filesI implemented Neil's method, with some modifications/corrections.
I'm not sure it will solve the pb (I didn't manage to get more than one lock on the systems I tried, even when launching dozens of agent at the same time)
While doing the technical review, I noticed that the "lsof" command isn't available by default on RHEL/CentOS. Unfortunately, I think that this means we cannot adopt the approach in Neil's method.
However, this doesn't mean we can't salvage a lot of his code. I think the "test" just needs to be changed. Maybe we could simply look at the number of cf-* processes running? If I understand correctly, lsof just gets the number of processes that are reading cf_lock.tcdb. And they must be cf-* (or maybe even cf-agent) processes. So why not count them using CFEngine's built in processes: promises?
Alternatively, some other ideas I came up with:- Kill the DB file when it grows over a certain size
- Measure the DB file size regularly and when it grows regularly and goes over a certain size, then kill it
Last but not least, I feel that it could make sense to include a test/rm in a non-CFEngine file since, IIRC, this can block the agent completely. The prime candidate for this would be https://github.com/Normation/rudder-packages/blob/master/rudder-agent/SOURCES/check-rudder-agent.
What do you think?
Updated by Vincent MEMBRÉ almost 11 years ago
- Target version changed from 2.8.2 to 2.8.3
Updated by Nicolas CHARLES almost 11 years ago
There is also a solution that as been proposed on the CFEngine ML:
https://groups.google.com/forum/#!topic/help-cfengine/H1gn-Hun_GU
that could clearly go in a cron job :)
Updated by Nicolas CHARLES almost 11 years ago
Ok, i tried the following approach
bundle agent tcdb_fix { vars: linux:: "db" slist => splitstring( execresult("/usr/bin/find /var/cfengine -name '*.tcdb' 2>/dev/null", "useshell"), "\n", "1000"); classes: # NOTE: assumes that CFEngine is set to run every 5 minutes "hourly_class" expression => splayclass("$(sys.host)$(sys.ipv4)", "hourly"); hourly_class.linux:: "detected_invalid_record_$(db)" expression => returnszero("/var/cfengine/bin/tchmgr optimize $(db) 2>&1 | grep -q 'invalid record header'", "useshell"); commands: "/bin/rm" args => "-f $(db)", ifvarclass => canonify("detected_invalid_record_$(db)"), classes => tcdb_fix_scoped_classes_generic("bundle", "absent_$(db)"), handle => "fix_tcdb_commands_detected_invalid_record_rm_$(db)", comment => "Invalid record headers indicate that the database corruption is beyond repair. It will be automatically re-created."; reports: "$(this.bundle) $(sys.fqhost): Detected invalid record header in $(db) - tried to repair" ifvarclass => canonify("detected_invalid_record_$(db)"); "$(this.bundle) $(sys.fqhost): Repair failed, removed corrupt database: $(db)" ifvarclass => canonify("absent_$(db)_repaired"); }but tchmgr is not always installed, so i cannot try this solution.
The goal of this solution is double:
- garbage collect the database, so it doesn't get too big
- detect errors, and destroy db
It could go either to cron or to cfengine; I wasn't able to get to a machine with a large enough tcdb AND tchmgr to check compatibility of both solution :(
Updated by Nicolas CHARLES almost 11 years ago
ok, there was some ldd issue, I can fully test now on the system.
Updated by Jonathan CLARKE almost 11 years ago
I had a nice simple idea to address this, at least partly. We know that cf-agent locks up when the DBs are corrupt, and this ends up in multiple cf-agent processes piling up. We already have a policy in our system/common technique to detect this and kill cf-agent if there are more then 8 instances running.
So, let's add a simple count of cf-agent processes to our crontab, and kill (or even kill -9) them and rm the DB if that happens!
That would be a very safe test, IMHO, because it will only ever kick in if CFEngine itself can't do the killing, and it will protect us from the worst case scenario (which is having hundreds of agents pile up). What do you think?
Also, your promise to check for "invalid record header" could be a nice preventative measure too. However, we have noticed with some users that the corrupt DBs don't always contain that message, so it's not 100% reliable.
Updated by Nicolas CHARLES almost 11 years ago
Yeah, I think both approach are relevant.
By running optimize on our test system, we were able to save up to 70% space...
but indeed, sometimes agent piles up, without corruption.
I'm using my bash-fu to create proper scripts for this :)
Updated by Nicolas CHARLES almost 11 years ago
- Status changed from Discussion to Pending technical review
- Assignee changed from Nicolas CHARLES to Jonathan CLARKE
- Pull Request changed from https://github.com/Normation/rudder-techniques/pull/268/files to https://github.com/Normation/rudder-packages/pull/211
My bash-fu created this PR
https://github.com/Normation/rudder-packages/pull/211
Updated by Jonathan CLARKE almost 11 years ago
- Status changed from Pending technical review to Discussion
- Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Nice bash-fu! I've made some comments, because this is a potentially dangerous change, so I want to be doubly-careful. Please take them into consideration and re-submit for review.
Updated by Nicolas CHARLES almost 11 years ago
- Status changed from Discussion to Pending technical review
- Assignee changed from Nicolas CHARLES to Jonathan CLARKE
Thank you.
PR has been updated
Updated by Jonathan CLARKE almost 11 years ago
- Status changed from Pending technical review to Discussion
- Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Updated by Nicolas CHARLES almost 11 years ago
- Status changed from Discussion to Pending technical review
- Assignee changed from Nicolas CHARLES to Jonathan CLARKE
thank you Jon, PR has been updated
Updated by Jonathan CLARKE almost 11 years ago
- Status changed from Pending technical review to Pending release
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Pending release to Discussion
- Assignee changed from Jonathan CLARKE to Vincent MEMBRÉ
Spec file is not correct !
install -m 755 %{SOURCE7} %{buildroot}/opt/rudder/bin/vzps.py
SOURCE7 file is check-rudder-agent script
vzps.py should be included in the specfile as SOURCE8
Reopening the issue
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Discussion to In progress
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
- Pull Request changed from https://github.com/Normation/rudder-packages/pull/211 to https://github.com/Normation/rudder-packages/pull/212
Pull request here: https://github.com/Normation/rudder-packages/pull/212
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset packages:commit:0eb04d822f30f8bd80478795b8734f2219eb8b21.
Updated by Jonathan CLARKE almost 11 years ago
Applied in changeset packages:commit:3d5a8946f8b89eaa5daafe1a29791a10acb866ea.
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Pending release to Discussion
- Assignee changed from Jonathan CLARKE to Vincent MEMBRÉ
Nightlies can't be installed an error occurs at the end of the postinst... here is an extract of the install log
17:54:06 Making sure that the permissions on the CFEngine key directory are correct... 17:54:06 CFEngine binaries copied to workdir 17:54:07 ******************************************************************************** 17:54:07 rudder-agent has been installed (not started). This host can be a Rudder node. 17:54:07 To get started, configure your Rudder server's hostname and launch the agent: 17:54:07 # echo 'rudder.server' > /var/rudder/cfengine-community/policy_server.dat 17:54:07 # /etc/init.d/rudder-agent start 17:54:07 This node will then appear in the Rudder web interface under 'Accept new nodes'. 17:54:07 ******************************************************************************** 17:54:07 INFO: Creating keys for CFEngine agent... Done. 17:54:07 INFO: Creating a new UUID for Rudder as the existing one is invalid... Done. 17:54:07 Setting up rudder-inventory-ldap (2.8.3~rc1~git201401251719-squeeze0) ... 17:54:07 dpkg: error processing rudder-agent (--configure): 17:54:07 subprocess installed post-installation script returned error exit status 1 17:54:07 configured to not write apport reports
This is due to a change in check-rudder-agent script of line:
if [ ! -e ${CFE_DISABLE_FILE} -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then
into:
# List the CFEngine processes running CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep` # Count the number of processes running NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l` if [ ! -e ${CFE_DISABLE_FILE} -a ${NB_CF_PROCESS_RUNNING} -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then
Especially this line which returns an error ( on last grep ):
CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`
Replacing:
# List the CFEngine processes running CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep` # Count the number of processes running NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`
by
# Count the number of CFEngine processes running NB_CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l`
should work!
Updated by Jonathan CLARKE almost 11 years ago
Vincent MEMBRÉ wrote:
Especially this line which returns an error ( on last grep ):
[...]Replacing:
- List the CFEngine processes running
CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`- Count the number of processes running
NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`by
- Count the number of CFEngine processes running
NB_CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l`
should work!
Yes, but we also need the content of CF_PROCESS_RUNNING (see below in the script). So don't remove it.
A simpler fix would be to avoid the last command in the pipeline returning 0:
CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | cat`
Updated by Vincent MEMBRÉ almost 11 years ago
Yes, I have just seen that too...
I tried with '{grep -v grep || true}' but then the value of the variable is an empty line which is counted by the 'wc -l' giving 1 instead of zero!!
Trying with ' | cat' to see if it fits well!
Thank you Jon!
Updated by Vincent MEMBRÉ almost 11 years ago
Not working either with '| cat' only (result of 'wc -l' will be 1)
Modifying other line by filtering empty lines
NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`
by:
NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | grep -v ^$ | wc -l`
would work
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Discussion to Pending technical review
- Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
- Pull Request changed from https://github.com/Normation/rudder-packages/pull/212 to https://github.com/Normation/rudder-packages/pull/214
Updated by Vincent MEMBRÉ almost 11 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset packages:commit:1be5b55f274aa56b95c36295985b9e942775f9d7.
Updated by Jonathan CLARKE almost 11 years ago
Applied in changeset packages:commit:a05df68163747ad462d7b45161eb515896f3ea1c.
Updated by Vincent MEMBRÉ over 10 years ago
- Project changed from 24 to Rudder
- Category changed from Techniques to 14
Updated by Vincent MEMBRÉ over 10 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 2.8.3, which was released today.
Check out:
- The release announcement: http://www.rudder-project.org/pipermail/rudder-announce/2014-March/000079.html
- The full ChangeLog: http://www.rudder-project.org/foswiki/bin/view/System/Documentation:ChangeLog28
- Download information: https://www.rudder-project.org/site/get-rudder/downloads/
Updated by Benoît PECCATTE over 9 years ago
- Category changed from 14 to Web - Config management