Project

General

Profile

Actions

Bug #3928

closed

Sometimes CFEngine get stuck because of locks on TokyoCabinet

Added by Nicolas CHARLES over 9 years ago. Updated almost 8 years ago.

Status:
Released
Priority:
2
Category:
Web - Config management
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Regression:

Description

Sometimes CFEngine get stuck, and doesn't do anything because of deadlock on DB
We currently kill the database, but it's not always easy to detect the problem like that, and it's a bit strict
Neil Watson has a neat implementation of for killing cfengine processes using the locks

https://gist.github.com/neilhwatson/6551521


Related issues 4 (0 open4 closed)

Related to Rudder - Bug #4494: Accumulation of cf-agent processes due to locking on CFEngine tcdb lock fileReleasedJonathan CLARKEActions
Related to Rudder - Bug #4752: cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutesReleasedJonathan CLARKE2014-04-11Actions
Related to Rudder - Bug #4769: rudder-agent may be stucked by tokyo cabinet database bloatingReleasedJonathan CLARKE2014-04-23Actions
Has duplicate Rudder - Bug #4408: Sometimes there are too many cf-agent processes runningRejectedNicolas CHARLES2014-01-27Actions
Actions #1

Updated by Nicolas PERRON over 9 years ago

  • Target version changed from 2.8.0~beta1 to 2.6.9

This is a bug about tokyocabinet, then it should be targeted in Rudder 2.6

Actions #2

Updated by Nicolas PERRON about 9 years ago

  • Target version changed from 2.6.9 to 2.6.10
Actions #3

Updated by Vincent MEMBRÉ about 9 years ago

  • Priority changed from 3 to 2

CFEngine related bug is here: https://cfengine.com/dev/issues/3430

Actions #4

Updated by Jonathan CLARKE about 9 years ago

  • Target version changed from 2.6.10 to 2.8.2

A more up to date version of Neil's fix is here: https://github.com/evolvethinking/evolve_cfengine_freelib/blob/master/masterfiles/lib/evolve_freelib.cf#L2076 (supports different paths to lsof on redhat and !redhat).

His code is under GPL3 so can be integrated to Rudder.

Nicolas, I agree TokyoCabinet was present in Rudder 2.6, but this issue seems to be with the way locks are handled in CFEngine 3.5, which was introduced in Rudder 2.8. Since we're not 100% sure about this, I'd rather not introduce a potentially dangerous fix in a stable version, so let's focus on 2.8 for now.

Actions #5

Updated by Nicolas CHARLES about 9 years ago

I seem to recall that this corruption often happen on RedHat/Centos

The cause could also be bad init script as explained in https://cfengine.com/dev/issues/3686

Actions #6

Updated by Nicolas CHARLES about 9 years ago

  • Status changed from New to Pending technical review
  • Assignee set to Jonathan CLARKE
  • Pull Request set to https://github.com/Normation/rudder-techniques/pull/268/files

PR is there
https://github.com/Normation/rudder-techniques/pull/268/files

I implemented Neil's method, with some modifications/corrections.
I'm not sure it will solve the pb (I didn't manage to get more than one lock on the systems I tried, even when launching dozens of agent at the same time)

Actions #7

Updated by Jonathan CLARKE about 9 years ago

Nicolas CHARLES wrote:

I seem to recall that this corruption often happen on RedHat/Centos

The cause could also be bad init script as explained in https://cfengine.com/dev/issues/3686

This seems unlikely, because we don't use the init script provided by CFEngine (it's only in their RPM/deb packages, IIRC), we use one of our own. But we don't do anything with /var/lock/subsys... Do you think we should?

Actions #8

Updated by Jonathan CLARKE about 9 years ago

  • Status changed from Pending technical review to Discussion
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES

Nicolas CHARLES wrote:

PR is there
https://github.com/Normation/rudder-techniques/pull/268/files

I implemented Neil's method, with some modifications/corrections.
I'm not sure it will solve the pb (I didn't manage to get more than one lock on the systems I tried, even when launching dozens of agent at the same time)

While doing the technical review, I noticed that the "lsof" command isn't available by default on RHEL/CentOS. Unfortunately, I think that this means we cannot adopt the approach in Neil's method.

However, this doesn't mean we can't salvage a lot of his code. I think the "test" just needs to be changed. Maybe we could simply look at the number of cf-* processes running? If I understand correctly, lsof just gets the number of processes that are reading cf_lock.tcdb. And they must be cf-* (or maybe even cf-agent) processes. So why not count them using CFEngine's built in processes: promises?

Alternatively, some other ideas I came up with:
  • Kill the DB file when it grows over a certain size
  • Measure the DB file size regularly and when it grows regularly and goes over a certain size, then kill it

Last but not least, I feel that it could make sense to include a test/rm in a non-CFEngine file since, IIRC, this can block the agent completely. The prime candidate for this would be https://github.com/Normation/rudder-packages/blob/master/rudder-agent/SOURCES/check-rudder-agent.

What do you think?

Actions #9

Updated by Vincent MEMBRÉ about 9 years ago

  • Target version changed from 2.8.2 to 2.8.3
Actions #10

Updated by Nicolas CHARLES about 9 years ago

There is also a solution that as been proposed on the CFEngine ML:
https://groups.google.com/forum/#!topic/help-cfengine/H1gn-Hun_GU

that could clearly go in a cron job :)

Actions #11

Updated by Nicolas CHARLES about 9 years ago

Ok, i tried the following approach

bundle agent tcdb_fix
{
vars:
linux::
  "db" slist => splitstring( execresult("/usr/bin/find /var/cfengine -name '*.tcdb' 2>/dev/null", "useshell"), "\n", "1000");

classes:

# NOTE: assumes that CFEngine is set to run every 5 minutes
"hourly_class" expression => splayclass("$(sys.host)$(sys.ipv4)", "hourly");

hourly_class.linux::
  "detected_invalid_record_$(db)" expression =>
returnszero("/var/cfengine/bin/tchmgr optimize $(db) 2>&1 | grep -q 'invalid record header'", "useshell");

commands:
"/bin/rm" 
  args => "-f $(db)",
  ifvarclass => canonify("detected_invalid_record_$(db)"),
  classes => tcdb_fix_scoped_classes_generic("bundle", "absent_$(db)"),
  handle => "fix_tcdb_commands_detected_invalid_record_rm_$(db)",
  comment => "Invalid record headers indicate that the database corruption is beyond repair. It will be automatically re-created.";

reports:
  "$(this.bundle) $(sys.fqhost): Detected invalid record header in $(db) - tried to repair" 
    ifvarclass => canonify("detected_invalid_record_$(db)");

  "$(this.bundle) $(sys.fqhost): Repair failed, removed corrupt database: $(db)" 
    ifvarclass => canonify("absent_$(db)_repaired");
}

but tchmgr is not always installed, so i cannot try this solution.
The goal of this solution is double:
  1. garbage collect the database, so it doesn't get too big
  2. detect errors, and destroy db

It could go either to cron or to cfengine; I wasn't able to get to a machine with a large enough tcdb AND tchmgr to check compatibility of both solution :(

Actions #12

Updated by Nicolas CHARLES about 9 years ago

ok, there was some ldd issue, I can fully test now on the system.

Actions #13

Updated by Jonathan CLARKE about 9 years ago

I had a nice simple idea to address this, at least partly. We know that cf-agent locks up when the DBs are corrupt, and this ends up in multiple cf-agent processes piling up. We already have a policy in our system/common technique to detect this and kill cf-agent if there are more then 8 instances running.

So, let's add a simple count of cf-agent processes to our crontab, and kill (or even kill -9) them and rm the DB if that happens!

That would be a very safe test, IMHO, because it will only ever kick in if CFEngine itself can't do the killing, and it will protect us from the worst case scenario (which is having hundreds of agents pile up). What do you think?

Also, your promise to check for "invalid record header" could be a nice preventative measure too. However, we have noticed with some users that the corrupt DBs don't always contain that message, so it's not 100% reliable.

Actions #14

Updated by Nicolas CHARLES about 9 years ago

Yeah, I think both approach are relevant.
By running optimize on our test system, we were able to save up to 70% space...
but indeed, sometimes agent piles up, without corruption.

I'm using my bash-fu to create proper scripts for this :)

Actions #15

Updated by Nicolas CHARLES about 9 years ago

  • Status changed from Discussion to Pending technical review
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE
  • Pull Request changed from https://github.com/Normation/rudder-techniques/pull/268/files to https://github.com/Normation/rudder-packages/pull/211
Actions #16

Updated by Jonathan CLARKE about 9 years ago

  • Status changed from Pending technical review to Discussion
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES

Nice bash-fu! I've made some comments, because this is a potentially dangerous change, so I want to be doubly-careful. Please take them into consideration and re-submit for review.

Actions #17

Updated by Nicolas CHARLES about 9 years ago

  • Status changed from Discussion to Pending technical review
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE

Thank you.
PR has been updated

Actions #18

Updated by Jonathan CLARKE about 9 years ago

  • Status changed from Pending technical review to Discussion
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Actions #19

Updated by Nicolas CHARLES about 9 years ago

  • Status changed from Discussion to Pending technical review
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE

thank you Jon, PR has been updated

Actions #20

Updated by Jonathan CLARKE about 9 years ago

  • Status changed from Pending technical review to Pending release
Actions #21

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Pending release to Discussion
  • Assignee changed from Jonathan CLARKE to Vincent MEMBRÉ

Spec file is not correct !

install -m 755 %{SOURCE7} %{buildroot}/opt/rudder/bin/vzps.py

SOURCE7 file is check-rudder-agent script

vzps.py should be included in the specfile as SOURCE8

Reopening the issue

Actions #22

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Discussion to In progress
Actions #23

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
  • Pull Request changed from https://github.com/Normation/rudder-packages/pull/211 to https://github.com/Normation/rudder-packages/pull/212
Actions #24

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100

Applied in changeset packages:commit:0eb04d822f30f8bd80478795b8734f2219eb8b21.

Actions #25

Updated by Jonathan CLARKE about 9 years ago

Applied in changeset packages:commit:3d5a8946f8b89eaa5daafe1a29791a10acb866ea.

Actions #26

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Pending release to Discussion
  • Assignee changed from Jonathan CLARKE to Vincent MEMBRÉ

Nightlies can't be installed an error occurs at the end of the postinst... here is an extract of the install log

17:54:06 Making sure that the permissions on the CFEngine key directory are correct...
17:54:06 CFEngine binaries copied to workdir
17:54:07 ********************************************************************************
17:54:07 rudder-agent has been installed (not started). This host can be a Rudder node.
17:54:07 To get started, configure your Rudder server's hostname and launch the agent:
17:54:07 # echo 'rudder.server' > /var/rudder/cfengine-community/policy_server.dat
17:54:07 # /etc/init.d/rudder-agent start
17:54:07 This node will then appear in the Rudder web interface under 'Accept new nodes'.
17:54:07 ********************************************************************************
17:54:07 INFO: Creating keys for CFEngine agent... Done.
17:54:07 INFO: Creating a new UUID for Rudder as the existing one is invalid... Done.
17:54:07 Setting up rudder-inventory-ldap (2.8.3~rc1~git201401251719-squeeze0) ...
17:54:07 dpkg: error processing rudder-agent (--configure):
17:54:07  subprocess installed post-installation script returned error exit status 1
17:54:07 configured to not write apport reports

This is due to a change in check-rudder-agent script of line:

if [ ! -e ${CFE_DISABLE_FILE} -a `ps -efww | grep -E "(cf-execd|cf-agent)" | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l` -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then

into:

# List the CFEngine processes running
CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`
# Count the number of processes running
NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`

if [ ! -e ${CFE_DISABLE_FILE} -a ${NB_CF_PROCESS_RUNNING} -eq 0 -a -f ${CFE_DIR}/policy_server.dat ]; then

Especially this line which returns an error ( on last grep ):

 CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`

Replacing:

# List the CFEngine processes running
CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`
# Count the number of processes running
NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`

by

# Count the number of CFEngine processes running
NB_CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l`

should work!

Actions #27

Updated by Jonathan CLARKE about 9 years ago

Vincent MEMBRÉ wrote:

Especially this line which returns an error ( on last grep ):
[...]

Replacing:

  1. List the CFEngine processes running
    CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep`
  2. Count the number of processes running
    NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`

by

  1. Count the number of CFEngine processes running
    NB_CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | wc -l`
    should work!

Yes, but we also need the content of CF_PROCESS_RUNNING (see below in the script). So don't remove it.

A simpler fix would be to avoid the last command in the pipeline returning 0:

CF_PROCESS_RUNNING=`${PS} -efww | grep -E "${CFE_BIN_DIR}/(cf-execd|cf-agent)" | grep -v grep | cat`
Actions #28

Updated by Vincent MEMBRÉ about 9 years ago

Yes, I have just seen that too...

I tried with '{grep -v grep || true}' but then the value of the variable is an empty line which is counted by the 'wc -l' giving 1 instead of zero!!

Trying with ' | cat' to see if it fits well!

Thank you Jon!

Actions #29

Updated by Vincent MEMBRÉ about 9 years ago

Not working either with '| cat' only (result of 'wc -l' will be 1)

Modifying other line by filtering empty lines

NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | wc -l`

by:

NB_CF_PROCESS_RUNNING=`echo "${CF_PROCESS_RUNNING}" | grep -v ^$ | wc -l`

would work

Actions #30

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Discussion to Pending technical review
  • Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
  • Pull Request changed from https://github.com/Normation/rudder-packages/pull/212 to https://github.com/Normation/rudder-packages/pull/214
Actions #31

Updated by Vincent MEMBRÉ about 9 years ago

  • Status changed from Pending technical review to Pending release

Applied in changeset packages:commit:1be5b55f274aa56b95c36295985b9e942775f9d7.

Actions #32

Updated by Jonathan CLARKE about 9 years ago

Applied in changeset packages:commit:a05df68163747ad462d7b45161eb515896f3ea1c.

Actions #33

Updated by Vincent MEMBRÉ almost 9 years ago

  • Project changed from 24 to Rudder
  • Category changed from Techniques to 14
Actions #34

Updated by Vincent MEMBRÉ almost 9 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 2.8.3, which was released today.
Check out:

Actions #35

Updated by Benoît PECCATTE almost 8 years ago

  • Category changed from 14 to Web - Config management
Actions

Also available in: Atom PDF