Bug #18832
closed
Rudder Agent consumes complete Memory because of fdisk
Added by Lars Koenen almost 4 years ago.
Updated over 3 years ago.
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Description
We have observed a bug on several of our systems where the rudder agent or a part of rudder uses up the entire memory of the system after a rather short time.
I have narrowed it down to the fact that on these systems the calls to fdisk -l never finish.
The rudder agent does not seem to handle this edgecase, which internally seems to lead to a chain of problems.
The reason for fdisk -l hanging on our systems is that the systems have a floppy disk controller, but no floppy drives are attached to them.
Nevertheless, I think that rudder should not blindly trust fdisk.
Interestingly, this behavior only occurs on systems with rudder agent version 6.2.0. Other systems with rudder agent version 6.1.6 can handle the problem.
- Severity changed from Major - prevents use of part of Rudder | no simple workaround to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Operational - other Techniques | Rudder settings | Plugins
- Priority changed from 0 to 76
Does it happen when running the agent (rudder agent run
) or during an inventory (rudder agent inventory
)?
Do you have a list of the processes (with ps auxf
or similar) opened when the problem occurs? It would help find the source of the problem.
- User visibility changed from Operational - other Techniques | Rudder settings | Plugins to Getting started - demo | first install | Technique editor and level 1 Techniques
- Priority changed from 76 to 94
Thanks you for reporting.
The fact that it is a regression is especially bad. You are of course right that whatever fdisk
problem are, it shouldn't lead to a bug in rudder agent. We are looking to that.
We actually have two problems here:
- Memory leak/exhaustion
- Lack of timeout for
fdisk -l
call
Tested with a simple sleep
:
root 5085 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-inventory
root 5125 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-run -N -D force_inventory
root 5224 5.5 4.2 120716 9788 ? Ss 13:56 0:00 | \_ /opt/rudder/bin/cf-agent -I -D info -Cnever -K -b doInventory
root 5273 0.0 0.4 12832 1000 ? S 13:56 0:00 | | \_ /bin/sh /opt/rudder/bin/run-inventory --local=/var/rudder/
root 5319 4.5 24.9 154868 57652 ? S 13:56 0:00 | | \_ fusioninventory-agent: running task Inventory
root 5360 0.0 0.5 9772 1184 ? S 13:56 0:00 | | \_ /bin/sh /sbin/fdisk -v
root 5361 0.0 0.3 4356 720 ? S 13:56 0:00 | | \_ sleep 500
It seems to only affect the inventory.
We can confirm that it's specific to 6.2.x
Actually no, it is also on 6.1.8-nightly at least
Memory exhaustion could come from inventory processes piling up.
Alexis MOUSSET wrote in #note-5:
Tested with a simple sleep
:
[...]
It seems to only affect the inventory.
yes, i can confirm. we had exactly this situation. many stranded cf-agent processes that piled up.
- Target version changed from 6.2.1 to 6.2.2
Tested on 6.1.7, 6.1.8-rc, and 6.2.0:
- switch fdisk
with sleep 1000,
- start a bunch of rudder agent inventory
.
In all case, after some time, rudder agent inventory
returns with the following message:
info Rudder agent was run on a subset of policies - not all policies were checked
But background blockink fdisk is still running and not killed (as shown in message #15).
- Related to Bug #11102: Parent fix does not work: Fusioninventory is not tracked by check-rudder-health added
- Related to Bug #14190: Inventory may never finish if there is a disk issue or invalid mountpoint added
- Target version changed from 6.2.2 to 6.2.3
There are several problems here:
- We authorize several inventories to pile up, but the inventory process uses a lot of memory, so this can leads to DoSing the node
- We authorize inventories to last forever. Using CFEngine's timeout is not enough as it leaves zombies processes and may fail.
What we could do:
- Prevent running a new inventory if another one is already running. If done reliably, it prevents the DoS.
- Adding a watchdog for inventory process in rudder agent check, to kill inventories lasting for more than X minutes and all of their child processes.
- Target version changed from 6.2.3 to 6.1.10
- Status changed from New to In progress
- Assignee set to Alexis Mousset
- Status changed from In progress to Pending technical review
- Assignee changed from Alexis Mousset to Benoît PECCATTE
- Pull Request set to https://github.com/Normation/rudder-techniques/pull/1649
- Status changed from Pending technical review to Pending release
- Priority changed from 94 to 93
- Fix check changed from To do to Checked
- Status changed from Pending release to Released
- Priority changed from 93 to 92
This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.
Also available in: Atom
PDF