Bug #4769
closedrudder-agent may be stucked by tokyo cabinet database bloating
Description
We have issues with tcdb on rudder-agent. the file /var/rudder/cfengine-community/state/cf_lock.tcdb is growing big leading to corruption and slow / block rudder-agent.
We first added a check in check-rudder-agent that looks if there is 8 or more cf-agent process running (#3928).
Then we added a check to look if promises were updated in last 10 minutes ( /var/rudder/cfengine-community/last_successful_inputs_update ) (#4494)
After various fixes on those script (typo, missing files ... #4582, #4604, #4686, #4752), we can see those conditions are not sufficient.
Even with theses fixes, the reporting on nodes will be impacted by that bug. Since the agent is running slower and slower at each execution, reaching the length of the agent interval.
The other thing to see here is that the agent is really more impacted when it's launched by cf-execd, launching manually resultas are far better.
- The agent are getting slow
- The agent is using a lot of ressources during a longer period
- The reporting in Rudder will be broken
I think a solution would be to have a size based check,
Updated by Vincent MEMBRÉ over 10 years ago
I have one example:
- My promises on a new agent (cf_lock ~ 1Mb), takes 10 seconds to run, no problems on reporting or whatever
- With time, cf_lock grows big ! ( ~140 Mb) and the agent
- when ran by cf-execd now takes 4 minutes,
- reporting was ok during the last minute of before the next agent run
- manually it stays quite fast ( ~ 20 seconds).
- Waiting a little, more (in my case there it was at ~155 Mb),
- when cf-agent is ran by cf-execd it now takes 11 minutes,
- Reporting was never Ok, always some missing reports
- and manual run takes 4 minutes
In all cases, the last_successful_update file is updated correctly, even with the 11 minutes.
I don't have agent piling in the two first cases, only the last one is causing piling up (and there was only 4 at the same time, so quite far from 8 agents)
In all cases, cfagent uses 100% cpu during the whole duration of the execution, leading in the last case in the usage of four cpus, impacting my other applications (maybe it's why the agent manually run cannot work correctly...)
Updated by Vincent MEMBRÉ over 10 years ago
As said in #4752, coredumb and dnns, are using a cron to check if the size of that file is over 10Mb, and if over, clean (rm) the whole state directory.
They don't have any problems since, and they haven't met any downside using it.
Maybe the size, and the method to clean are not the good one, but i clearly think this is the idea we should add.
About the size: According to coredumb, agent start to slow at 10Mb, and at 100Mb its always slow.
I don't know if there is a strict rule, of if it gets slower randomly ... What i have seen is: the more you have, the slower it can be
Some datas, maybe need to be confirmed
- <1Mb => 10 seconds
- 35 Mb: ~ 1 minute
- 140 Mb: 4 minutes
- 150 Mb: 11 minutes (maybe some corruption occured here ?)
- 230Mb: 10 minutes
About the clean methods:
- I don't think we should delete the whole state dir.
- I tried using 'tchmgr optimize', but that changed nothing at all (kept the same size, same time), same with -df option
- I would only delete cf_lock.tcdb files ( that file and tcdb.lock file)
Updated by Vincent MEMBRÉ over 10 years ago
Another lead would be to use let cfengine optimize itself tcdb using TCDB_OPTIMIZE_PERCENT varaible.
That variable will make cfengine check if it has to optimize tcdb, https://github.com/cfengine/core/blob/master/libpromises/dbm_tokyocab.c#L128
Updated by Vincent MEMBRÉ over 10 years ago
To use a cron to delete cf_lock when over 10Mb
put in file /etc/cron.d/clean-rudder-locks
*/5 * * * * root if [ `ls -l /var/rudder/cfengine-community/state/cf_lock.tcdb | cut -f 5 -d " "` -gt "10485760" ]; then rm -rf /var/rudder/cfengine-community/state/cf_lock.tcdb* && /opt/rudder/bin/cf-agent -KI > /dev/null 2>&1; fi
Updated by Vincent MEMBRÉ over 10 years ago
The rudder-clean works correctly on servers.
I have some nodes with TCDB_OPTIMIZE_PERCENT set to 25
my tcdb were reduced from 40Mb to 1Mb, (execution time lowered from 1 minute to 10 seconds!) and now tcdb is growing but really slowly. (from 1Mb to 1.9 in few hours, it was 0.15Mb by run before)
So I still need to look for the behavior in the long run.
Updated by Vincent MEMBRÉ over 10 years ago
The variable does not prevent the error with cf_lock growing big... The growth rate is slower (1Mb in 45 min, instead of 30 minutes), when the file gets bug (over 10Mb we get some errors too)
The check on the size seems the only solution to me, maybe helped by the variable
Updated by Nicolas CHARLES over 10 years ago
The impact of killing the cf_lock database is that all locks on promises are removed, meaning everything that uses ifelapsed may fail.
It does not have any impact on persistent classes, nor package list.
Updated by Nicolas CHARLES over 10 years ago
- Status changed from New to Pending technical review
- Assignee changed from Vincent MEMBRÉ to Jonathan CLARKE
- Pull Request set to https://github.com/Normation/rudder-packages/pull/317
Updated by Jonathan CLARKE over 10 years ago
- Status changed from Pending technical review to Discussion
- Assignee changed from Jonathan CLARKE to Nicolas CHARLES
Updated by Nicolas CHARLES over 10 years ago
- Status changed from Discussion to Pending technical review
- Assignee changed from Nicolas CHARLES to Jonathan CLARKE
PR updated
Updated by Nicolas CHARLES over 10 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset packages:commit:838eace354d6e2be06c09536268fd596086fdb9d.
Updated by Jonathan CLARKE over 10 years ago
Applied in changeset packages:commit:0d10ba0561ec64aae16a359102b0c1d498926887.
Updated by Vincent MEMBRÉ over 10 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 2.9.5 (announcement , changelog) and 2.10.1 (announcement , changelog), which were released today.
- Download information: https://www.rudder-project.org/site/get-rudder/downloads/