Bug #4752: cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutes - Rudder - Issue Tracker

Actions

Copy link

Bug #4752

closed

cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutes

Added by Vincent MEMBRÉ over 11 years ago. Updated about 11 years ago.

Status:

Released

Priority:

1 (highest)

Assignee:

Jonathan CLARKE

Category:

System integration

Target version:

2.9.5

Pull Request:

https://github.com/Normation/rudder-p...

Severity:

UX impact:

User visibility:

Effort required:

Priority:

Name check:

Fix check:

Regression:

Description

Despite all our fixes to prevent agent being stuck by tokyo cabinet databases, it occurs that in some cases it can't preven agents from being stuck...

Our solution is to check every 5 minutes if agents are piling up and if there is more than 8 agents, clean the databases.

However, Dennis Cabooter reported on irc, that even with latests fixes, it's agent were blocked again, and that our fixes leads to lots of other problems (particularly of cron sending tons of mails ...).

It occurs that Dennis put a tmpfs in place of 128Mb that gets full due to cf_lock.tcdb, making agents fails directly and not piling up

To fix the problem Dennis put in place a cron designed by Olivier Mauras to remove the tcdb files if their size is over 10Mb:

if [ `ls -l /var/rudder/cfengine-community/state/cf_lock.tcdb | cut -f 5 -d " "` -gt "10485760" ]; then rm -rf /var/rudder/cfengine-community/state/* && /opt/rudder/bin/cf-agent -KI > /dev/null 2>&1; fi

And with it, he never got any problem with the databases.

A solution could be to replace our check by this one in our check-rudder-agent script.

But 10Mb seems quite arbitrary to me, and we need some feedback on the size of the files:

Size of that file could take more than 100Mb, I don't from which size it starts to stuck the agent.

What I'd like to know is the growing rate of those files, to determine which value can be a good solution.

So I ask you one thing dear community, can you please post here the results of the following command from several nodes?

ls -lh /var/rudder/cfengine-community/state

What i'd like to know most if the state of those files (particularly cf_lock.tcdb) a few hours after it was cleaned.

Related issues 7 (0 open — 7 closed)

Related to Rudder - Bug #4686: Typo in /opt/rudder/bin/check-rudder-agent, prevent cleaning of cf-lock and floods with cron mails	Released	Jonathan CLARKE	2014-03-28	Actions
Related to Rudder - Bug #4604: Typo in the deletion of lock file if the promises are not updated	Released	Jonathan CLARKE	2014-03-12	Actions
Related to Rudder - Bug #4582: Last update detection is broken, causing cron remove cf_lock database and flood with emails every 5 minutes	Released	Jonathan CLARKE	2014-03-11	Actions
Related to Rudder - Bug #4494: Accumulation of cf-agent processes due to locking on CFEngine tcdb lock file	Released	Jonathan CLARKE		Actions
Related to Rudder - Bug #4408: Sometimes there are too many cf-agent processes running	Rejected	Nicolas CHARLES	2014-01-27	Actions
Related to Rudder - Bug #3928: Sometimes CFEngine get stuck because of locks on TokyoCabinet	Released	Jonathan CLARKE	2013-09-13	Actions
Related to Rudder - Bug #4769: rudder-agent may be stucked by tokyo cabinet database bloating	Released	Jonathan CLARKE	2014-04-23	Actions

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

Description updated (diff)

From Olvier feedback why 10MB is a good value:

11:27 < Vince_Mcbuche> THe other thing that could be very important is the size that start the agent stucking :)
11:35 < coredumb> Vince_Mcbuche: 10MB
11:35 < coredumb> initially thought it was around 100MB
11:36 < coredumb> but by reducing, finally found that starting at 10MB agent starts to feel pain

I also added more information about Dennis issue:

It occurs that Dennis put a tmpfs in place of 128Mb that gets full due to cf_lock.tcdb, making agents fails directly and not piling up

Another script from Olivier to monitor the size of the db:

for i in $(seq 0 60); do ls -h /var/rudder/cfengine-community/state >> /tmp/cflock_res; echo "sleeping 60s" >> /tmp/cflock_res; sleep 60; done

Thank you!

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

Another debate would be to knwo if we need to clean all files from state directory or only cf_lock.

I think we can only clean cf_lock.tcdb, but it may be safer to clean everything ... Nicolas, Jon Do you have more ides about that ?

Actions

Copy link

Updated by Dennis Cabooter over 11 years ago

Another script from Olivier to monitor the size of the db:

[...]

I assume lh is an alias for "ls -h" ? :)

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

Thank you Dennis updated the command!

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

First results results from Olivier seems to show that the database is growing ~0.8MB per hour:

6.8M Apr 11 11:36 cf_lock.tcdb
7.4M Apr 11 12:25 cf_lock.tcdb
8.3M Apr 11 13:41 cf_lock.tcdb

Actions

Copy link

Updated by Olivier Mauras over 11 years ago

9.9M Apr 11 15:55 cf_lock.tcdb

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

One of our machine As the same kind of problem in our internal production ... so we can analyse it correctly

Our growth rate is like 0.7Mb per hour... Still don't knwo what could cause it to grow ...

For now, Using 10Mb seems a good value to me to clean the file

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

We have 128Mb tmpfs that gets full in ~12 days

20% is made on the 3 first days => 8 Mb per day ( 6%)

the rest is done during the following days => 12,8 Mb per day (10%)

I created a script based on Olivier command to get datas about cf_lock file: https://github.com/Normation/rudder-tools/blob/master/scripts/rudder-info/cf_lock-size.sh

Actions

Copy link

Updated by Vincent MEMBRÉ over 11 years ago

Hmmm... In check-rudder-agent we only delete the cf_lock.tcdb.lock file and not cf_lock.tcdb when the update promises files is older than 10 minutes...

Deleting that file too should fix the issue

Actions

Copy link

#10

Updated by Vincent MEMBRÉ over 11 years ago

Subject changed from rudder-agent may be stucked by tokyo cabinet database bloating to cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutes
Status changed from New to Pending technical review
Assignee set to Jonathan CLARKE
Pull Request set to https://github.com/Normation/rudder-packages/pull/305

PR here: https://github.com/Normation/rudder-packages/pull/305

Actions

Copy link

#11

Updated by Vincent MEMBRÉ over 11 years ago

Status changed from Pending technical review to Pending release
% Done changed from 0 to 100

Applied in changeset packages:commit:2136a9f44d1702ee17b7773d67141aa215bafbdf.

Actions

Copy link

#12

Updated by Jonathan CLARKE over 11 years ago

Applied in changeset packages:commit:ba6366695f932e5d6e032fbc2cc5d9663ed1de52.

Actions

Copy link

#13

Updated by Vincent MEMBRÉ over 11 years ago

We have seen that huge tcdb (> 100 Mb cf_lock) increase a LOT the agent execution time =>

empty db (~1Mb) => 10 seconds on agent execution
Huge db (140Mb) => 4 minutes

At that size, cf-agent may not be blocked, so the "inputs_update" file is still updated, and agents are not piling up. resulting in bad reporting on nodes (report a coming during 4 minutes, then the agent is run right after, putin reproting to no answer again...)

Definetely the inputs_file is not a good criteria to check tcdb status

Actions

Copy link

#14

Updated by Vincent MEMBRÉ about 11 years ago

Status changed from Pending release to Released

This bug has been fixed in Rudder 2.9.5 (announcement , changelog) and 2.10.1 (announcement , changelog), which were released today.

Download information: https://www.rudder-project.org/site/get-rudder/downloads/

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Rudder

Custom queries

Bug #4752

cf_lock.tcdb is not cleaned by check-rudder-agent script when update file is older than 10 minutes

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Dennis Cabooter over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Olivier Mauras over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Jonathan CLARKE over 11 years ago

Updated by Vincent MEMBRÉ over 11 years ago

Updated by Vincent MEMBRÉ about 11 years ago