User story #4115: postgresql database corruption - Rudder - Issue Tracker

Actions

Copy link

User story #4115

closed

postgresql database corruption

Added by Daniel Stan almost 12 years ago. Updated over 8 years ago.

Status:

Rejected

Priority:

Assignee:

Category:

Server components

Target version:

Ideas (not version specific)

Pull Request:

UX impact:

Suggestion strength:

User visibility:

Effort required:

Name check:

Fix check:

Regression:

Description

Hello

We are facing an issue with our rudder database which seems to be corrupted . Each time we try to access "node configuration" tab we see this in the error log :

Oct 29 14:22:59 rudder postgres[16432]: [16-1] user=rudder,db=rudder ERROR:  could not read block 109354 of relation base/288068/288178: read only 5440 of 8192 bytes
Oct 29 14:22:59 rudder postgres[16432]: [16-2] user=rudder,db=rudder STATEMENT:  select executiondate, nodeid, ruleId, directiveid, serial, component, keyValue, executionTimeStamp, eventtype, policy, msg from RudderSysEvents join (select nodeid as Node, max(executiontimestamp) as Time from ruddersysevents where ruleId = 'hasPolicyServer-root' and component = 'common' and keyValue = 'EndRun' and executionTimeStamp > (now() - interval '15 minutes') group by nodeid ) as Ordering on Ordering.Node = nodeid and executionTimeStamp = Ordering.Time where 1=1 and ruleId = $1 and serial = $2 and executionTimeStamp > (now() - interval '15 minutes')

or:

Oct 29 14:23:39 rudder postgres[27246]: [2-1] user=[unknown],db=[unknown] LOG:  connection received: host=::1 port=51251
Oct 29 14:23:39 rudder postgres[27246]: [3-1] user=rudder,db=rudder LOG:  connection authorized: user=rudder database=rudder
Oct 29 14:23:39 rudder postgres[27245]: [5-1] user=rudder,db=rudder LOG:  disconnection: session time: 0:00:00.006 user=rudder database=rudder host=::1 port=51250
Oct 29 14:23:39 rudder postgres[27246]: [4-1] user=rudder,db=rudder ERROR:  could not read block 1 of relation base/288068/2605: Bad address
Oct 29 14:23:39 rudder postgres[27246]: [4-2] user=rudder,db=rudder STATEMENT:  insert into RudderSysEvents (executionDate, nodeId, ruleId, directiveId, serial, Component, KeyValue, executionTimeStamp, eventType, msg, Policy) values ('2013-10-29T14:23:29.033367+00:00','root', 'hasPolicyServer-root' , 'common-root', '82', 'common', 'StartRun', '2013-10-29 14:23:25+00:00', 'log_info', 'Start execution', 'common' )

I tried various ways to fix this:

- restore from backup (no use, probably the corruption occurred before our oldest backup)
 - cleared data from RudderSysEvents . This fixes the issue for a while, but after more data is added to that table the problem reoccurs.
 - recreated indexes.
 - made a backup of the database with pg_dump and reimported it. 
 - made a dump, deleted the rudder database , created a new rudder database, reimported the dump. No use.

I searched the web for tips on how to fix the corruption but I can't find anything useful. I suspect that the corruption may be somewhere in postgres system tables or data files.

Do you have any ideas on how to fix this?
Do you think reinstalling postgres and reimporting the data from a pg_dump file may help?

Any help will be greatly appreciated.

Regards

Daniel Stan

Actions

Copy link

Updated by Matthieu CERDA almost 12 years ago

Status changed from New to Discussion
Assignee set to Daniel Stan
Priority changed from N/A to 4

Hello Daniel,

First, let's try to dismiss the most basic causes possible for this kind of issue:

Is your filesystem full ? (Can be seen using "df -h")
Do your filesystem has any explicit corruption ? ("dmesg", Search for frightening messages about EXT-3/EXT-4 at the end of the output)

Can you please also give us your Rudder version and PostgreSQL version ? (Depends on your OS version, using Yum or dpkg)

Thanks in advance.

Actions

Copy link

Updated by Daniel Stan almost 12 years ago

Hello

Is your filesystem full ? (Can be seen using "df -h")

There is enough disk space now on that server

Do your filesystem has any explicit corruption ? ("dmesg", Search for frightening messages about EXT-3/EXT-4 at the end of the output)

The server is a virtual machine and the limits were mistakenly set to low , which, I think , caused the database corruption. When this happened the server ran out of memory and postgresql crashed a few times. Also at some point it ran out of disk space( I know this not the best combination!!!).

Can you please also give us your Rudder version and PostgreSQL version ? (Depends on your OS version, using Yum or dpkg)

bash-4.1# psql -V
psql (PostgreSQL) 8.4.13
contains support for command-line editing
bash-4.1#
bash-4.1# rpm -qa | grep postgresql
postgresql-libs-8.4.13-1.el6_3.x86_64
postgresql-8.4.13-1.el6_3.x86_64
postgresql-server-8.4.13-1.el6_3.x86_64
bash-4.1#
bash-4.1# rpm -qa | grep rudder
rudder-inventory-ldap-2.7.1.release-1.EL.6.x86_64
rudder-inventory-endpoint-2.7.1.release-1.EL.6.noarch
rudder-techniques-2.7.1.release-1.EL.6.noarch
rudder-agent-2.7.1.release-1.EL.6.x86_64
rudder-server-root-2.7.1.release-1.EL.6.noarch
rudder-jetty-2.7.1.release-1.EL.6.noarch
rudder-reports-2.7.1.release-1.EL.6.noarch
rudder-webapp-2.7.1.release-1.EL.6.noarch
bash-4.1#

Actions

Copy link

Updated by Matthieu CERDA almost 12 years ago

Oh, ok. So, are things working properly after the VM limits has been raised ?

Actions

Copy link

Updated by Daniel Stan almost 12 years ago

Unfortunately not. After those events, the database got corrupted and we are in the situation explained in my first post. I was thinking that you may have encountered a similar situation where the postgres database becomes corrupted and you may know a way to recover from it.

Actions

Copy link

Updated by Nicolas CHARLES almost 12 years ago

Daniel,

What is surprising is that dumping, purging and restoring doesn't solve the issue.
It clearly should, as you start on fresh bases in this scenario.

Are you using OpenVZ ? If so, it may be related to the problem described there
http://comments.gmane.org/gmane.comp.db.postgresql.bugs/27033

Actions

Copy link

Updated by Daniel Stan almost 12 years ago

I noticed that discussion too and indeed we are using virtuozzo , the commercial version of openvz. After playing with postgresql settings I discovered that the problem was caused by the shared_buffers setting. If I set it to a value equal or higher than 1GB then I start to receive that error. Any value less than 1G(even 999M) doesn't cause any issues. So there may be some OS value that causes this problem. I tried tweaking the sysctl parameters but I didn't manage to find a setting that fixes this issue. I also opened a ticket with virtuozzo to see if they can help. A similar problem is discussed here:

http://postgresql.1045698.n5.nabble.com/Reproducible-quot-Bus-error-quot-in-9-2-3-during-database-dump-restoration-Ubuntu-Server-12-04-LTS-td5747425.html

Hopefully this will help others facing these errors.

Actions

Copy link