Bug #7295: slapd core dumps on 1TB RAM - Rudder - Issue Tracker

Actions

Copy link

Bug #7295

closed

slapd core dumps on 1TB RAM

Added by Janos Mattyasovszky over 9 years ago. Updated over 9 years ago.

Status:

Released

Priority:

N/A

Assignee:

Benoît PECCATTE

Category:

Server components

Target version:

2.11.17

Pull Request:

https://github.com/Normation/rudder-p...

Severity:

UX impact:

User visibility:

Effort required:

Priority:

Name check:

Fix check:

Regression:

Description

When starting slapd on a system with 1TB of RAM, it writes a coredump due a memory allocation issue.

OS: SLES11 SP3
Package: rudder-inventory-ldap-2.11.15.release-1.SLES.11
Memory: 1033740 Megabytes

++ sed -n '/^MemTotal/s/[^0-9]//gp' /proc/meminfo
+ MEMSIZE=1058549824
+ CACHESIZE=135490000
+ sed -ie '/^[ \t]*\(cachesize\|idlcachesize\)/d' /opt/rudder/etc/openldap/slapd.conf
+ sed -ie 's/^\([ \t]*suffix[ \t]\+"cn=rudder-configuration".*\)/\1\ncachesize 135490000/' /opt/rudder/etc/openldap/slapd.conf
+ sed -ie 's/^\([ \t]*suffix[ \t]\+"cn=rudder-configuration".*\)/\1\nidlcachesize 406470000/' /opt/rudder/etc/openldap/slapd.conf
[...]
56262153 ch_calloc of 1 elems of 18446744010302859784 bytes failed
slapd: ch_malloc.c:107: ch_calloc: Assertion `0' failed.
/etc/init.d/rudder-slapd: line 359: 58757 Aborted                 (core dumped) $SLAPD_BIN -h "$SLAPD_SERVICES" $SLAPD_PARAMS -d 1

Here is the backtrace from the core:

gdb -batch -ex bt /opt/rudder/libexec/slapd core.slapd.58757 bt
[...]
Core was generated by `/opt/rudder/libexec/slapd -h ldap://127.0.0.1:389 -n rudder-slapd -f /opt/rudde'.
Program terminated with signal 6, Aborted.
#0  0x00007fb98f48a885 in raise () from /lib64/libc.so.6
#0  0x00007fb98f48a885 in raise () from /lib64/libc.so.6
#1  0x00007fb98f48be61 in abort () from /lib64/libc.so.6
#2  0x00007fb98f483740 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000044fc2c in ch_calloc ()
#4  0x000000000043dfa6 in attr_prealloc ()
#5  0x00007fb98cf551df in hdb_db_open (be=0x893400, cr=0x7ffd055011f0) at init.c:536
#6  0x0000000000443f7f in backend_startup_one ()
#7  0x000000000044428b in backend_startup ()
#8  0x000000000041a850 in main ()

When starting the slapd with RUDDER_CACHESIZE="noauto" and removing the cachesize-tunings from the slapd config, then it starts.

Doing an strace on the slapd WITH memory tuning:

21745 mmap(NULL, 35921920, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f1a4673b000
21745 mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f1a46733000
21745 mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1a46433000
21745 mmap(NULL, 10839203840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17c0621000
21745 mmap(NULL, 18446744010302861312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
21745 mmap(NULL, 18446744010302996480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
21745 mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f17b8621000
21745 mmap(NULL, 18446744010302861312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
21745 --- SIGABRT (Aborted) @ 0 (0) ---
21745 +++ killed by SIGABRT (core dumped) +++

When removing all memory tuning, it shows smaller values:

29764 mmap(NULL, 35921920, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a511a3000
29764 mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a57813000
29764 mmap(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 10, 0) = 0x7f4a578d8000
29764 mmap(NULL, 66019328, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a4f4ef000
29764 mmap(NULL, 1073741824, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a0f4ef000
29764 mmap(NULL, 2359296, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a0f2af000
29764 mmap(NULL, 35921920, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a0d06d000
29764 mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_SHARED, 9, 0) = 0x7f4a0d065000
29764 mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4a0cd65000
29764 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4a0c864000
29765 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f4a0c063000

Somehow the calculated memory size does somehow not fit for slapd on a system that big, as it tries to allocate 18.4 exabytes (10^10) of RAM :-D :-D :-D

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Janos Mattyasovszky over 9 years ago

Subject changed from slapd core dumps due malloc issue to slapd core dumps on 1TB RAM
Description updated (diff)

Actions

Copy link

Updated by Vincent MEMBRÉ over 9 years ago

Assignee set to Benoît PECCATTE

Wow ! Don't know if it's a bug in slapd or not!

What do you think of it Benoit ?

Actions

Copy link

Updated by Jonathan CLARKE over 9 years ago

The fact that it works OK with "noauto" indicates this is a bug in our memory calculation. We do this to dynamically adjust the database size for openldap, and obviously if we tell it to use 18.4 exabytes (!!) it will fail.

Now all I need to fix this is a machine with 1 TB of ram :D

Actions

Copy link

Updated by Benoît PECCATTE over 9 years ago

Well, we do not tell it to use 18.4 exabytes, we tell it to use 135490000 cache entries
With a 800 bytes mean cache entry size, this makes 135490000*800/1024/1024/1024 = 100Gb of RAM.
It's a lot (probably too much) but not the size ldap tries to allocate (126Gb per cache entry).

This looks like an integer overflow in openldap (18*10^18 ~ 2^64).

Since 18446744010302859784-2^64 = -60G, I'd say that it overflows at around 40Gb of cache.
So I suggest to limit ldap cache to 32Gb to work around this.

Actions

Copy link

Updated by Janos Mattyasovszky over 9 years ago

If you have a patch, I can test it if you'd like.

However, if the limit will be max 32G, it should start on an arbitrary big system having about 64G+ of RAM.

Actions

Copy link

Updated by Jonathan CLARKE over 9 years ago

Janos Mattyasovszky wrote:

If you have a patch, I can test it if you'd like.

However, if the limit will be max 32G, it should start on an arbitrary big system having about 64G+ of RAM.

The calculation is already proportional to the RAM size, and aims to use about 10%, thus the 100 GB.

We can fix a limit way before that though, because that would be a hugh cache (about 40M entries!)

Actions

Copy link