Project

General

Profile

Bug #4769

rudder-agent may be stucked by tokyo cabinet database bloating

Added by Vincent MEMBRÉ almost 6 years ago. Updated over 5 years ago.

Status:
Released
Priority:
1
Category:
System integration
Target version:
Severity:
User visibility:
Effort required:
Priority:

Description

We have issues with tcdb on rudder-agent. the file /var/rudder/cfengine-community/state/cf_lock.tcdb is growing big leading to corruption and slow / block rudder-agent.

We first added a check in check-rudder-agent that looks if there is 8 or more cf-agent process running (#3928).

Then we added a check to look if promises were updated in last 10 minutes ( /var/rudder/cfengine-community/last_successful_inputs_update ) (#4494)

After various fixes on those script (typo, missing files ... #4582, #4604, #4686, #4752), we can see those conditions are not sufficient.

Even with theses fixes, the reporting on nodes will be impacted by that bug. Since the agent is running slower and slower at each execution, reaching the length of the agent interval.
The other thing to see here is that the agent is really more impacted when it's launched by cf-execd, launching manually resultas are far better.

The effects of the big tcdb bases are:
  • The agent are getting slow
  • The agent is using a lot of ressources during a longer period
  • The reporting in Rudder will be broken

I think a solution would be to have a size based check,


Related issues

Related to Rudder - Bug #3928: Sometimes CFEngine get stuck because of locks on TokyoCabinetReleased2013-09-13Jonathan CLARKEActions
Related to Rudder - Bug #4494: Accumulation of cf-agent processes due to locking on CFEngine tcdb lock fileReleasedJonathan CLARKEActions
Related to Rudder - Bug #4582: Last update detection is broken, causing cron remove cf_lock database and flood with emails every 5 minutesReleased2014-03-11Jonathan CLARKEActions
Related to Rudder - Bug #4604: Typo in the deletion of lock file if the promises are not updatedReleased2014-03-12Jonathan CLARKEActions
Related to Rudder - Bug #4686: Typo in /opt/rudder/bin/check-rudder-agent, prevent cleaning of cf-lock and floods with cron mailsReleased2014-03-28Jonathan CLARKEActions
Related to Rudder - Bug #4752: cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutesReleased2014-04-11Jonathan CLARKEActions
Related to Rudder - Bug #4841: Job Scheduler Technique should not use ifelapsed to avoid running several time same jobReleased2014-05-11Jonathan CLARKEActions
#1

Updated by Vincent MEMBRÉ almost 6 years ago

I have one example:

  1. My promises on a new agent (cf_lock ~ 1Mb), takes 10 seconds to run, no problems on reporting or whatever
  2. With time, cf_lock grows big ! ( ~140 Mb) and the agent
    • when ran by cf-execd now takes 4 minutes,
    • reporting was ok during the last minute of before the next agent run
    • manually it stays quite fast ( ~ 20 seconds).
  3. Waiting a little, more (in my case there it was at ~155 Mb),
    • when cf-agent is ran by cf-execd it now takes 11 minutes,
    • Reporting was never Ok, always some missing reports
    • and manual run takes 4 minutes

In all cases, the last_successful_update file is updated correctly, even with the 11 minutes.

I don't have agent piling in the two first cases, only the last one is causing piling up (and there was only 4 at the same time, so quite far from 8 agents)

In all cases, cfagent uses 100% cpu during the whole duration of the execution, leading in the last case in the usage of four cpus, impacting my other applications (maybe it's why the agent manually run cannot work correctly...)

#2

Updated by Vincent MEMBRÉ almost 6 years ago

As said in #4752, coredumb and dnns, are using a cron to check if the size of that file is over 10Mb, and if over, clean (rm) the whole state directory.

They don't have any problems since, and they haven't met any downside using it.

Maybe the size, and the method to clean are not the good one, but i clearly think this is the idea we should add.

About the size: According to coredumb, agent start to slow at 10Mb, and at 100Mb its always slow.

I don't know if there is a strict rule, of if it gets slower randomly ... What i have seen is: the more you have, the slower it can be

Some datas, maybe need to be confirmed

  • <1Mb => 10 seconds
  • 35 Mb: ~ 1 minute
  • 140 Mb: 4 minutes
  • 150 Mb: 11 minutes (maybe some corruption occured here ?)
  • 230Mb: 10 minutes

About the clean methods:

  • I don't think we should delete the whole state dir.
  • I tried using 'tchmgr optimize', but that changed nothing at all (kept the same size, same time), same with -df option
  • I would only delete cf_lock.tcdb files ( that file and tcdb.lock file)
#3

Updated by Vincent MEMBRÉ almost 6 years ago

Another lead would be to use let cfengine optimize itself tcdb using TCDB_OPTIMIZE_PERCENT varaible.

That variable will make cfengine check if it has to optimize tcdb, https://github.com/cfengine/core/blob/master/libpromises/dbm_tokyocab.c#L128

#4

Updated by Vincent MEMBRÉ almost 6 years ago

To use a cron to delete cf_lock when over 10Mb

put in file /etc/cron.d/clean-rudder-locks

*/5 * * * * root if [ `ls -l /var/rudder/cfengine-community/state/cf_lock.tcdb | cut -f 5 -d " "` -gt "10485760" ]; then rm -rf /var/rudder/cfengine-community/state/cf_lock.tcdb* && /opt/rudder/bin/cf-agent -KI > /dev/null 2>&1; fi
#5

Updated by Vincent MEMBRÉ almost 6 years ago

The rudder-clean works correctly on servers.

I have some nodes with TCDB_OPTIMIZE_PERCENT set to 25

my tcdb were reduced from 40Mb to 1Mb, (execution time lowered from 1 minute to 10 seconds!) and now tcdb is growing but really slowly. (from 1Mb to 1.9 in few hours, it was 0.15Mb by run before)

So I still need to look for the behavior in the long run.

#6

Updated by Vincent MEMBRÉ almost 6 years ago

The variable does not prevent the error with cf_lock growing big... The growth rate is slower (1Mb in 45 min, instead of 30 minutes), when the file gets bug (over 10Mb we get some errors too)

The check on the size seems the only solution to me, maybe helped by the variable

#7

Updated by Nicolas CHARLES almost 6 years ago

The impact of killing the cf_lock database is that all locks on promises are removed, meaning everything that uses ifelapsed may fail.
It does not have any impact on persistent classes, nor package list.

#8

Updated by Nicolas CHARLES almost 6 years ago

  • Status changed from New to Pending technical review
  • Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/317
#9

Updated by Jonathan CLARKE almost 6 years ago

  • Status changed from Pending technical review to Discussion
  • Assignee changed from Jonathan CLARKE to Nicolas CHARLES
#10

Updated by Nicolas CHARLES almost 6 years ago

  • Status changed from Discussion to Pending technical review
  • Assignee changed from Nicolas CHARLES to Jonathan CLARKE

PR updated

#11

Updated by Nicolas CHARLES almost 6 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100

Applied in changeset packages:commit:838eace354d6e2be06c09536268fd596086fdb9d.

#12

Updated by Jonathan CLARKE almost 6 years ago

Applied in changeset packages:commit:0d10ba0561ec64aae16a359102b0c1d498926887.

#13

Updated by Vincent MEMBRÉ over 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 2.9.5 (announcement , changelog) and 2.10.1 (announcement , changelog), which were released today.

Also available in: Atom PDF