Bug #18832
closedRudder Agent consumes complete Memory because of fdisk
Added by Lars Koenen almost 5 years ago. Updated over 4 years ago.
Description
We have observed a bug on several of our systems where the rudder agent or a part of rudder uses up the entire memory of the system after a rather short time.
I have narrowed it down to the fact that on these systems the calls to fdisk -l never finish.
The rudder agent does not seem to handle this edgecase, which internally seems to lead to a chain of problems.
The reason for fdisk -l hanging on our systems is that the systems have a floppy disk controller, but no floppy drives are attached to them.
Nevertheless, I think that rudder should not blindly trust fdisk.
Interestingly, this behavior only occurs on systems with rudder agent version 6.2.0. Other systems with rudder agent version 6.1.6 can handle the problem.
Updated by Alexis Mousset almost 5 years ago
Actions
#1
- Severity changed from Major - prevents use of part of Rudder | no simple workaround to Critical - prevents main use of Rudder | no workaround | data loss | security
- User visibility set to Operational - other Techniques | Rudder settings | Plugins
- Priority changed from 0 to 76
Updated by Alexis Mousset almost 5 years ago
Actions
#2
Does it happen when running the agent (rudder agent run) or during an inventory (rudder agent inventory)?
Do you have a list of the processes (with ps auxf or similar) opened when the problem occurs? It would help find the source of the problem.
Updated by François ARMAND almost 5 years ago
Actions
#3
- User visibility changed from Operational - other Techniques | Rudder settings | Plugins to Getting started - demo | first install | Technique editor and level 1 Techniques
- Priority changed from 76 to 94
Thanks you for reporting.
The fact that it is a regression is especially bad. You are of course right that whatever fdisk problem are, it shouldn't lead to a bug in rudder agent. We are looking to that.
Updated by Alexis Mousset almost 5 years ago
Actions
#4
We actually have two problems here:
- Memory leak/exhaustion
- Lack of timeout for
fdisk -lcall
Updated by Alexis Mousset almost 5 years ago
Actions
#5
Tested with a simple sleep:
root 5085 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-inventory root 5125 0.0 0.2 12832 604 pts/0 S+ 13:56 0:00 | \_ /bin/sh /opt/rudder/share/commands/agent-run -N -D force_inventory root 5224 5.5 4.2 120716 9788 ? Ss 13:56 0:00 | \_ /opt/rudder/bin/cf-agent -I -D info -Cnever -K -b doInventory root 5273 0.0 0.4 12832 1000 ? S 13:56 0:00 | | \_ /bin/sh /opt/rudder/bin/run-inventory --local=/var/rudder/ root 5319 4.5 24.9 154868 57652 ? S 13:56 0:00 | | \_ fusioninventory-agent: running task Inventory root 5360 0.0 0.5 9772 1184 ? S 13:56 0:00 | | \_ /bin/sh /sbin/fdisk -v root 5361 0.0 0.3 4356 720 ? S 13:56 0:00 | | \_ sleep 500
It seems to only affect the inventory.
Updated by François ARMAND almost 5 years ago
Actions
#6
We can confirm that it's specific to 6.2.x
Actually no, it is also on 6.1.8-nightly at least
Updated by Alexis Mousset almost 5 years ago
Actions
#7
Memory exhaustion could come from inventory processes piling up.
Updated by Lars Koenen almost 5 years ago
Actions
#8
Alexis MOUSSET wrote in #note-5:
Tested with a simple
sleep:[...]
It seems to only affect the inventory.
yes, i can confirm. we had exactly this situation. many stranded cf-agent processes that piled up.
Updated by Vincent MEMBRÉ almost 5 years ago
Actions
#9
- Target version changed from 6.2.1 to 6.2.2
Updated by François ARMAND almost 5 years ago
Actions
#10
Tested on 6.1.7, 6.1.8-rc, and 6.2.0:
- switch fdisk with sleep 1000,
- start a bunch of rudder agent inventory.
In all case, after some time, rudder agent inventory returns with the following message:
info Rudder agent was run on a subset of policies - not all policies were checked
But background blockink fdisk is still running and not killed (as shown in message #15).
Updated by François ARMAND almost 5 years ago
Actions
#11
- Related to Bug #11102: Parent fix does not work: Fusioninventory is not tracked by check-rudder-health added
Updated by Nicolas CHARLES almost 5 years ago
Actions
#12
- Related to Bug #14190: Inventory may never finish if there is a disk issue or invalid mountpoint added
Updated by Vincent MEMBRÉ almost 5 years ago
Actions
#13
- Target version changed from 6.2.2 to 6.2.3
Updated by Alexis Mousset almost 5 years ago
Actions
#14
There are several problems here:
- We authorize several inventories to pile up, but the inventory process uses a lot of memory, so this can leads to DoSing the node
- We authorize inventories to last forever. Using CFEngine's timeout is not enough as it leaves zombies processes and may fail.
What we could do:
- Prevent running a new inventory if another one is already running. If done reliably, it prevents the DoS.
- Adding a watchdog for inventory process in rudder agent check, to kill inventories lasting for more than X minutes and all of their child processes.
Updated by Alexis Mousset almost 5 years ago
Actions
#15
- Target version changed from 6.2.3 to 6.1.10
Updated by Alexis Mousset almost 5 years ago
Actions
#16
- Status changed from New to In progress
- Assignee set to Alexis Mousset
Updated by Alexis Mousset almost 5 years ago
Actions
#17
- Status changed from In progress to Pending technical review
- Assignee changed from Alexis Mousset to Benoît PECCATTE
- Pull Request set to https://github.com/Normation/rudder-techniques/pull/1649
Updated by Alexis Mousset almost 5 years ago
Actions
#18
- Status changed from Pending technical review to Pending release
Applied in changeset rudder-techniques|dc524c125346c091f807d8f9813867d939a4120b.
Updated by Nicolas CHARLES over 4 years ago
Actions
#19
- Priority changed from 94 to 93
- Fix check changed from To do to Checked
Updated by Vincent MEMBRÉ over 4 years ago
Actions
#20
- Status changed from Pending release to Released
- Priority changed from 93 to 92
This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.