Bug #18832
closedRudder Agent consumes complete Memory because of fdisk
Description
We have observed a bug on several of our systems where the rudder agent or a part of rudder uses up the entire memory of the system after a rather short time.
I have narrowed it down to the fact that on these systems the calls to fdisk -l never finish.
The rudder agent does not seem to handle this edgecase, which internally seems to lead to a chain of problems.
The reason for fdisk -l hanging on our systems is that the systems have a floppy disk controller, but no floppy drives are attached to them.
Nevertheless, I think that rudder should not blindly trust fdisk.
Interestingly, this behavior only occurs on systems with rudder agent version 6.2.0. Other systems with rudder agent version 6.1.6 can handle the problem.
Updated by Alexis Mousset about 4 years ago
- Severity changed from Major - prevents use of part of Rudder | no simple workaround to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Operational - other Techniques | Rudder settings | Plugins
- Priority changed from 0 to 76
Updated by Alexis Mousset about 4 years ago
Does it happen when running the agent (rudder agent run
) or during an inventory (rudder agent inventory
)?
Do you have a list of the processes (with ps auxf
or similar) opened when the problem occurs? It would help find the source of the problem.
Updated by François ARMAND about 4 years ago
- User visibility changed from Operational - other Techniques | Rudder settings | Plugins to Getting started - demo | first install | Technique editor and level 1 Techniques
- Priority changed from 76 to 94
Thanks you for reporting.
The fact that it is a regression is especially bad. You are of course right that whatever fdisk
problem are, it shouldn't lead to a bug in rudder agent. We are looking to that.
Updated by Alexis Mousset about 4 years ago
We actually have two problems here:
- Memory leak/exhaustion
- Lack of timeout for
fdisk -l
call
Updated by Alexis Mousset about 4 years ago
Tested with a simple sleep
:
root 5085 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-inventory root 5125 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-run -N -D force_inventory root 5224 5.5 4.2 120716 9788 ? Ss 13:56 0:00 | \_ /opt/rudder/bin/cf-agent -I -D info -Cnever -K -b doInventory root 5273 0.0 0.4 12832 1000 ? S 13:56 0:00 | | \_ /bin/sh /opt/rudder/bin/run-inventory --local=/var/rudder/ root 5319 4.5 24.9 154868 57652 ? S 13:56 0:00 | | \_ fusioninventory-agent: running task Inventory root 5360 0.0 0.5 9772 1184 ? S 13:56 0:00 | | \_ /bin/sh /sbin/fdisk -v root 5361 0.0 0.3 4356 720 ? S 13:56 0:00 | | \_ sleep 500
It seems to only affect the inventory.
Updated by François ARMAND about 4 years ago
We can confirm that it's specific to 6.2.x
Actually no, it is also on 6.1.8-nightly at least
Updated by Alexis Mousset about 4 years ago
Memory exhaustion could come from inventory processes piling up.
Updated by Lars Koenen about 4 years ago
Alexis MOUSSET wrote in #note-5:
Tested with a simple
sleep
:[...]
It seems to only affect the inventory.
yes, i can confirm. we had exactly this situation. many stranded cf-agent processes that piled up.
Updated by Vincent MEMBRÉ about 4 years ago
- Target version changed from 6.2.1 to 6.2.2
Updated by François ARMAND about 4 years ago
Tested on 6.1.7, 6.1.8-rc, and 6.2.0:
- switch fdisk
with sleep 1000,
- start a bunch of rudder agent inventory
.
In all case, after some time, rudder agent inventory
returns with the following message:
info Rudder agent was run on a subset of policies - not all policies were checked
But background blockink fdisk is still running and not killed (as shown in message #15).
Updated by François ARMAND about 4 years ago
- Related to Bug #11102: Parent fix does not work: Fusioninventory is not tracked by check-rudder-health added
Updated by Nicolas CHARLES almost 4 years ago
- Related to Bug #14190: Inventory may never finish if there is a disk issue or invalid mountpoint added
Updated by Vincent MEMBRÉ almost 4 years ago
- Target version changed from 6.2.2 to 6.2.3
Updated by Alexis Mousset almost 4 years ago
There are several problems here:
- We authorize several inventories to pile up, but the inventory process uses a lot of memory, so this can leads to DoSing the node
- We authorize inventories to last forever. Using CFEngine's timeout is not enough as it leaves zombies processes and may fail.
What we could do:
- Prevent running a new inventory if another one is already running. If done reliably, it prevents the DoS.
- Adding a watchdog for inventory process in rudder agent check, to kill inventories lasting for more than X minutes and all of their child processes.
Updated by Alexis Mousset almost 4 years ago
- Target version changed from 6.2.3 to 6.1.10
Updated by Alexis Mousset almost 4 years ago
- Status changed from New to In progress
- Assignee set to Alexis Mousset
Updated by Alexis Mousset almost 4 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Alexis Mousset to Benoît PECCATTE
- Pull Request set to https://github.com/Normation/rudder-techniques/pull/1649
Updated by Alexis Mousset almost 4 years ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-techniques|dc524c125346c091f807d8f9813867d939a4120b.
Updated by Nicolas CHARLES almost 4 years ago
- Priority changed from 94 to 93
- Fix check changed from To do to Checked
Updated by Vincent MEMBRÉ almost 4 years ago
- Status changed from Pending release to Released
- Priority changed from 93 to 92
This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.