User story #4115
closedpostgresql database corruption
Description
Hello
We are facing an issue with our rudder database which seems to be corrupted . Each time we try to access "node configuration" tab we see this in the error log :
Oct 29 14:22:59 rudder postgres[16432]: [16-1] user=rudder,db=rudder ERROR: could not read block 109354 of relation base/288068/288178: read only 5440 of 8192 bytes Oct 29 14:22:59 rudder postgres[16432]: [16-2] user=rudder,db=rudder STATEMENT: select executiondate, nodeid, ruleId, directiveid, serial, component, keyValue, executionTimeStamp, eventtype, policy, msg from RudderSysEvents join (select nodeid as Node, max(executiontimestamp) as Time from ruddersysevents where ruleId = 'hasPolicyServer-root' and component = 'common' and keyValue = 'EndRun' and executionTimeStamp > (now() - interval '15 minutes') group by nodeid ) as Ordering on Ordering.Node = nodeid and executionTimeStamp = Ordering.Time where 1=1 and ruleId = $1 and serial = $2 and executionTimeStamp > (now() - interval '15 minutes')
or:
Oct 29 14:23:39 rudder postgres[27246]: [2-1] user=[unknown],db=[unknown] LOG: connection received: host=::1 port=51251 Oct 29 14:23:39 rudder postgres[27246]: [3-1] user=rudder,db=rudder LOG: connection authorized: user=rudder database=rudder Oct 29 14:23:39 rudder postgres[27245]: [5-1] user=rudder,db=rudder LOG: disconnection: session time: 0:00:00.006 user=rudder database=rudder host=::1 port=51250 Oct 29 14:23:39 rudder postgres[27246]: [4-1] user=rudder,db=rudder ERROR: could not read block 1 of relation base/288068/2605: Bad address Oct 29 14:23:39 rudder postgres[27246]: [4-2] user=rudder,db=rudder STATEMENT: insert into RudderSysEvents (executionDate, nodeId, ruleId, directiveId, serial, Component, KeyValue, executionTimeStamp, eventType, msg, Policy) values ('2013-10-29T14:23:29.033367+00:00','root', 'hasPolicyServer-root' , 'common-root', '82', 'common', 'StartRun', '2013-10-29 14:23:25+00:00', 'log_info', 'Start execution', 'common' )
I tried various ways to fix this:
- restore from backup (no use, probably the corruption occurred before our oldest backup)
- cleared data from RudderSysEvents . This fixes the issue for a while, but after more data is added to that table the problem reoccurs.
- recreated indexes.
- made a backup of the database with pg_dump and reimported it.
- made a dump, deleted the rudder database , created a new rudder database, reimported the dump. No use.
I searched the web for tips on how to fix the corruption but I can't find anything useful. I suspect that the corruption may be somewhere in postgres system tables or data files.
Do you have any ideas on how to fix this?
Do you think reinstalling postgres and reimporting the data from a pg_dump file may help?
Any help will be greatly appreciated.
Regards
Daniel Stan
Updated by Matthieu CERDA about 11 years ago
- Status changed from New to Discussion
- Assignee set to Daniel Stan
- Priority changed from N/A to 4
Hello Daniel,
First, let's try to dismiss the most basic causes possible for this kind of issue:- Is your filesystem full ? (Can be seen using "df -h")
- Do your filesystem has any explicit corruption ? ("dmesg", Search for frightening messages about EXT-3/EXT-4 at the end of the output)
Can you please also give us your Rudder version and PostgreSQL version ? (Depends on your OS version, using Yum or dpkg)
Thanks in advance.
Updated by Daniel Stan about 11 years ago
Hello
Is your filesystem full ? (Can be seen using "df -h")
There is enough disk space now on that server
Do your filesystem has any explicit corruption ? ("dmesg", Search for frightening messages about EXT-3/EXT-4 at the end of the output)
The server is a virtual machine and the limits were mistakenly set to low , which, I think , caused the database corruption. When this happened the server ran out of memory and postgresql crashed a few times. Also at some point it ran out of disk space( I know this not the best combination!!!).
Can you please also give us your Rudder version and PostgreSQL version ? (Depends on your OS version, using Yum or dpkg)
bash-4.1# psql -V psql (PostgreSQL) 8.4.13 contains support for command-line editing bash-4.1# bash-4.1# rpm -qa | grep postgresql postgresql-libs-8.4.13-1.el6_3.x86_64 postgresql-8.4.13-1.el6_3.x86_64 postgresql-server-8.4.13-1.el6_3.x86_64 bash-4.1# bash-4.1# rpm -qa | grep rudder rudder-inventory-ldap-2.7.1.release-1.EL.6.x86_64 rudder-inventory-endpoint-2.7.1.release-1.EL.6.noarch rudder-techniques-2.7.1.release-1.EL.6.noarch rudder-agent-2.7.1.release-1.EL.6.x86_64 rudder-server-root-2.7.1.release-1.EL.6.noarch rudder-jetty-2.7.1.release-1.EL.6.noarch rudder-reports-2.7.1.release-1.EL.6.noarch rudder-webapp-2.7.1.release-1.EL.6.noarch bash-4.1#
Updated by Matthieu CERDA about 11 years ago
Oh, ok. So, are things working properly after the VM limits has been raised ?
Updated by Daniel Stan about 11 years ago
Unfortunately not. After those events, the database got corrupted and we are in the situation explained in my first post. I was thinking that you may have encountered a similar situation where the postgres database becomes corrupted and you may know a way to recover from it.
Updated by Nicolas CHARLES about 11 years ago
Daniel,
What is surprising is that dumping, purging and restoring doesn't solve the issue.
It clearly should, as you start on fresh bases in this scenario.
Are you using OpenVZ ? If so, it may be related to the problem described there
http://comments.gmane.org/gmane.comp.db.postgresql.bugs/27033
Updated by Daniel Stan about 11 years ago
I noticed that discussion too and indeed we are using virtuozzo , the commercial version of openvz. After playing with postgresql settings I discovered that the problem was caused by the shared_buffers setting. If I set it to a value equal or higher than 1GB then I start to receive that error. Any value less than 1G(even 999M) doesn't cause any issues. So there may be some OS value that causes this problem. I tried tweaking the sysctl parameters but I didn't manage to find a setting that fixes this issue. I also opened a ticket with virtuozzo to see if they can help. A similar problem is discussed here:
Hopefully this will help others facing these errors.
Updated by Benoît PECCATTE over 9 years ago
- Category set to Server components
- Assignee deleted (
Daniel Stan)
Updated by Benoît PECCATTE over 8 years ago
- Target version set to Ideas (not version specific)
Updated by Benoît PECCATTE almost 8 years ago
- Status changed from Discussion to Rejected
Thank you for the report.