Project

General

Profile

Actions

Bug #18832

closed

Rudder Agent consumes complete Memory because of fdisk

Added by Lars Koenen almost 4 years ago. Updated almost 4 years ago.

Status:
Released
Priority:
N/A
Category:
Agent
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Getting started - demo | first install | Technique editor and level 1 Techniques
Effort required:
Priority:
92
Name check:
To do
Fix check:
Checked
Regression:

Description

We have observed a bug on several of our systems where the rudder agent or a part of rudder uses up the entire memory of the system after a rather short time.
I have narrowed it down to the fact that on these systems the calls to fdisk -l never finish.
The rudder agent does not seem to handle this edgecase, which internally seems to lead to a chain of problems.

The reason for fdisk -l hanging on our systems is that the systems have a floppy disk controller, but no floppy drives are attached to them.
Nevertheless, I think that rudder should not blindly trust fdisk.

Interestingly, this behavior only occurs on systems with rudder agent version 6.2.0. Other systems with rudder agent version 6.1.6 can handle the problem.


Related issues 2 (0 open2 closed)

Related to Rudder - Bug #11102: Parent fix does not work: Fusioninventory is not tracked by check-rudder-healthReleasedAlexis MoussetActions
Related to Rudder - Bug #14190: Inventory may never finish if there is a disk issue or invalid mountpointReleasedBenoît PECCATTEActions
Actions #1

Updated by Alexis Mousset almost 4 years ago

  • Severity changed from Major - prevents use of part of Rudder | no simple workaround to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility set to Operational - other Techniques | Rudder settings | Plugins
  • Priority changed from 0 to 76
Actions #2

Updated by Alexis Mousset almost 4 years ago

Does it happen when running the agent (rudder agent run) or during an inventory (rudder agent inventory)?

Do you have a list of the processes (with ps auxf or similar) opened when the problem occurs? It would help find the source of the problem.

Actions #3

Updated by François ARMAND almost 4 years ago

  • User visibility changed from Operational - other Techniques | Rudder settings | Plugins to Getting started - demo | first install | Technique editor and level 1 Techniques
  • Priority changed from 76 to 94

Thanks you for reporting.
The fact that it is a regression is especially bad. You are of course right that whatever fdisk problem are, it shouldn't lead to a bug in rudder agent. We are looking to that.

Actions #4

Updated by Alexis Mousset almost 4 years ago

We actually have two problems here:

  • Memory leak/exhaustion
  • Lack of timeout for fdisk -l call
Actions #5

Updated by Alexis Mousset almost 4 years ago

Tested with a simple sleep:

root      5085  0.0  0.2  12832   604 pts/0    S+   13:56   0:00  |                   \_ /bin/sh /opt/rudder/share/commands/agent-inventory
root      5125  0.0  0.2  12832   604 pts/0    S+   13:56   0:00  |                       \_ /bin/sh /opt/rudder/share/commands/agent-run -N -D force_inventory
root      5224  5.5  4.2 120716  9788 ?        Ss   13:56   0:00  |                           \_ /opt/rudder/bin/cf-agent -I -D info -Cnever -K -b doInventory 
root      5273  0.0  0.4  12832  1000 ?        S    13:56   0:00  |                           |   \_ /bin/sh /opt/rudder/bin/run-inventory --local=/var/rudder/
root      5319  4.5 24.9 154868 57652 ?        S    13:56   0:00  |                           |       \_ fusioninventory-agent: running task Inventory         
root      5360  0.0  0.5   9772  1184 ?        S    13:56   0:00  |                           |           \_ /bin/sh /sbin/fdisk -v
root      5361  0.0  0.3   4356   720 ?        S    13:56   0:00  |                           |               \_ sleep 500

It seems to only affect the inventory.

Actions #6

Updated by François ARMAND almost 4 years ago

We can confirm that it's specific to 6.2.x
Actually no, it is also on 6.1.8-nightly at least

Actions #7

Updated by Alexis Mousset almost 4 years ago

Memory exhaustion could come from inventory processes piling up.

Actions #8

Updated by Lars Koenen almost 4 years ago

Alexis MOUSSET wrote in #note-5:

Tested with a simple sleep:

[...]

It seems to only affect the inventory.

yes, i can confirm. we had exactly this situation. many stranded cf-agent processes that piled up.

Actions #9

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version changed from 6.2.1 to 6.2.2
Actions #10

Updated by François ARMAND almost 4 years ago

Tested on 6.1.7, 6.1.8-rc, and 6.2.0:

- switch fdisk with sleep 1000,
- start a bunch of rudder agent inventory.

In all case, after some time, rudder agent inventory returns with the following message:

info     Rudder agent was run on a subset of policies - not all policies were checked

But background blockink fdisk is still running and not killed (as shown in message #15).

Actions #11

Updated by François ARMAND almost 4 years ago

  • Related to Bug #11102: Parent fix does not work: Fusioninventory is not tracked by check-rudder-health added
Actions #12

Updated by Nicolas CHARLES almost 4 years ago

  • Related to Bug #14190: Inventory may never finish if there is a disk issue or invalid mountpoint added
Actions #13

Updated by Vincent MEMBRÉ almost 4 years ago

  • Target version changed from 6.2.2 to 6.2.3
Actions #14

Updated by Alexis Mousset almost 4 years ago

There are several problems here:

  • We authorize several inventories to pile up, but the inventory process uses a lot of memory, so this can leads to DoSing the node
  • We authorize inventories to last forever. Using CFEngine's timeout is not enough as it leaves zombies processes and may fail.

What we could do:

  • Prevent running a new inventory if another one is already running. If done reliably, it prevents the DoS.
  • Adding a watchdog for inventory process in rudder agent check, to kill inventories lasting for more than X minutes and all of their child processes.
Actions #15

Updated by Alexis Mousset almost 4 years ago

  • Target version changed from 6.2.3 to 6.1.10
Actions #16

Updated by Alexis Mousset almost 4 years ago

  • Status changed from New to In progress
  • Assignee set to Alexis Mousset
Actions #17

Updated by Alexis Mousset almost 4 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Alexis Mousset to Benoît PECCATTE
  • Pull Request set to https://github.com/Normation/rudder-techniques/pull/1649
Actions #18

Updated by Alexis Mousset almost 4 years ago

  • Status changed from Pending technical review to Pending release
Actions #19

Updated by Nicolas CHARLES almost 4 years ago

  • Priority changed from 94 to 93
  • Fix check changed from To do to Checked
Actions #20

Updated by Vincent MEMBRÉ almost 4 years ago

  • Status changed from Pending release to Released
  • Priority changed from 93 to 92

This bug has been fixed in Rudder 6.1.10 and 6.2.3 which were released today.

Actions

Also available in: Atom PDF