Bug #4769: rudder-agent may be stucked by tokyo cabinet database bloating - Rudder - Issue Tracker

Actions

Copy link

Bug #4769

closed

rudder-agent may be stucked by tokyo cabinet database bloating

Added by Vincent MEMBRÉ about 11 years ago. Updated about 11 years ago.

Status:

Released

Priority:

1 (highest)

Assignee:

Jonathan CLARKE

Category:

System integration

Target version:

2.9.5

Pull Request:

https://github.com/Normation/rudder-p...

Severity:

UX impact:

User visibility:

Effort required:

Priority:

Name check:

Fix check:

Regression:

Description

We have issues with tcdb on rudder-agent. the file /var/rudder/cfengine-community/state/cf_lock.tcdb is growing big leading to corruption and slow / block rudder-agent.

We first added a check in check-rudder-agent that looks if there is 8 or more cf-agent process running (#3928).

Then we added a check to look if promises were updated in last 10 minutes ( /var/rudder/cfengine-community/last_successful_inputs_update ) (#4494)

After various fixes on those script (typo, missing files ... #4582, #4604, #4686, #4752), we can see those conditions are not sufficient.

Even with theses fixes, the reporting on nodes will be impacted by that bug. Since the agent is running slower and slower at each execution, reaching the length of the agent interval.
The other thing to see here is that the agent is really more impacted when it's launched by cf-execd, launching manually resultas are far better.

The effects of the big tcdb bases are:

The agent are getting slow
The agent is using a lot of ressources during a longer period
The reporting in Rudder will be broken

I think a solution would be to have a size based check,

Related issues 7 (0 open — 7 closed)

Related to Rudder - Bug #3928: Sometimes CFEngine get stuck because of locks on TokyoCabinet	Released	Jonathan CLARKE	2013-09-13	Actions
Related to Rudder - Bug #4494: Accumulation of cf-agent processes due to locking on CFEngine tcdb lock file	Released	Jonathan CLARKE		Actions
Related to Rudder - Bug #4582: Last update detection is broken, causing cron remove cf_lock database and flood with emails every 5 minutes	Released	Jonathan CLARKE	2014-03-11	Actions
Related to Rudder - Bug #4604: Typo in the deletion of lock file if the promises are not updated	Released	Jonathan CLARKE	2014-03-12	Actions
Related to Rudder - Bug #4686: Typo in /opt/rudder/bin/check-rudder-agent, prevent cleaning of cf-lock and floods with cron mails	Released	Jonathan CLARKE	2014-03-28	Actions
Related to Rudder - Bug #4752: cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutes	Released	Jonathan CLARKE	2014-04-11	Actions
Related to Rudder - Bug #4841: Job Scheduler Technique should not use ifelapsed to avoid running several time same job	Released	Jonathan CLARKE	2014-05-11	Actions

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

I have one example:

My promises on a new agent (cf_lock ~ 1Mb), takes 10 seconds to run, no problems on reporting or whatever
With time, cf_lock grows big ! ( ~140 Mb) and the agent
- when ran by cf-execd now takes 4 minutes,
- reporting was ok during the last minute of before the next agent run
- manually it stays quite fast ( ~ 20 seconds).
Waiting a little, more (in my case there it was at ~155 Mb),
- when cf-agent is ran by cf-execd it now takes 11 minutes,
- Reporting was never Ok, always some missing reports
- and manual run takes 4 minutes

In all cases, the last_successful_update file is updated correctly, even with the 11 minutes.

I don't have agent piling in the two first cases, only the last one is causing piling up (and there was only 4 at the same time, so quite far from 8 agents)

In all cases, cfagent uses 100% cpu during the whole duration of the execution, leading in the last case in the usage of four cpus, impacting my other applications (maybe it's why the agent manually run cannot work correctly...)

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

As said in #4752, coredumb and dnns, are using a cron to check if the size of that file is over 10Mb, and if over, clean (rm) the whole state directory.

They don't have any problems since, and they haven't met any downside using it.

Maybe the size, and the method to clean are not the good one, but i clearly think this is the idea we should add.

About the size: According to coredumb, agent start to slow at 10Mb, and at 100Mb its always slow.

I don't know if there is a strict rule, of if it gets slower randomly ... What i have seen is: the more you have, the slower it can be

Some datas, maybe need to be confirmed

<1Mb => 10 seconds
35 Mb: ~ 1 minute
140 Mb: 4 minutes
150 Mb: 11 minutes (maybe some corruption occured here ?)
230Mb: 10 minutes

About the clean methods:

I don't think we should delete the whole state dir.
I tried using 'tchmgr optimize', but that changed nothing at all (kept the same size, same time), same with -df option
I would only delete cf_lock.tcdb files ( that file and tcdb.lock file)

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

Another lead would be to use let cfengine optimize itself tcdb using TCDB_OPTIMIZE_PERCENT varaible.

That variable will make cfengine check if it has to optimize tcdb, https://github.com/cfengine/core/blob/master/libpromises/dbm_tokyocab.c#L128

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

To use a cron to delete cf_lock when over 10Mb

put in file /etc/cron.d/clean-rudder-locks

*/5 * * * * root if [ `ls -l /var/rudder/cfengine-community/state/cf_lock.tcdb | cut -f 5 -d " "` -gt "10485760" ]; then rm -rf /var/rudder/cfengine-community/state/cf_lock.tcdb* && /opt/rudder/bin/cf-agent -KI > /dev/null 2>&1; fi

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

The rudder-clean works correctly on servers.

I have some nodes with TCDB_OPTIMIZE_PERCENT set to 25

my tcdb were reduced from 40Mb to 1Mb, (execution time lowered from 1 minute to 10 seconds!) and now tcdb is growing but really slowly. (from 1Mb to 1.9 in few hours, it was 0.15Mb by run before)

So I still need to look for the behavior in the long run.

Actions

Copy link

Updated by Vincent MEMBRÉ about 11 years ago

The variable does not prevent the error with cf_lock growing big... The growth rate is slower (1Mb in 45 min, instead of 30 minutes), when the file gets bug (over 10Mb we get some errors too)

The check on the size seems the only solution to me, maybe helped by the variable

Actions

Copy link

Updated by Nicolas CHARLES about 11 years ago

The impact of killing the cf_lock database is that all locks on promises are removed, meaning everything that uses ifelapsed may fail.
It does not have any impact on persistent classes, nor package list.

Actions

Copy link