Project

General

Profile

Actions

Bug #14789

closed

Postgres fills up and can't be vaccumed

Added by Tobias Ell almost 5 years ago. Updated almost 5 years ago.

Status:
Released
Priority:
N/A
Category:
Web - Maintenance
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
UX impact:
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Priority:
52
Name check:
Fix check:
Regression:

Description

Hi,
in the last 2 weeks I have had two occasions where the disk for my postgres db filled up.
I vaccum the rudder DB every night but lately that doesn't seem to be enough.
Checking the log of my housekeeping job I noticed that many tables have
"dead row versions cannot be removed yet" entries.
That seems to be a problem with transactions not committed.
I checked table pg_stat_activity and found an entry

16385 | rudder | 8071 | 16384 | rudder | PostgreSQL JDBC Driver | 127.0.0.1 | | 48393 | 2019-04-29 08:53:51.989498+02 |
2019-04-29 08:53:56.704253+02 | 2019-04-29 08:53:56.704567+02 | 2019-04-29 08:53:56.704589+02 | f | idle in transaction | SELECT 1

The client port 48393 belongs to a connection established by the jetty process.

My server is running on centOS7
all rudder components are version 5.0.9
postgres is 9.2.24

Has anyone encountered this problem before?
I am not too familiar with postgres, so it could be misconfiguration.
Any help is appreciated.


Files

vacuum_rudder_full.log (11.6 KB) vacuum_rudder_full.log Tobias Ell, 2019-04-30 15:29

Related issues 1 (0 open1 closed)

Related to Rudder - Bug #15413: Autovacuum may never finish, or take more than one day, on busy Rudder instance, because of resources attritionReleasedAlexis MoussetActions
Actions #1

Updated by François ARMAND almost 5 years ago

Hello,

Thanks for reporting that problem. I don't think the problem you are describing is a config one, more like a bug. Unfortunatly, our postgres expert is on hollidays this week.
I see here that the problem is with "select 1" which is the default request to test connection availability used by the pool manager. I will start to look for that.

Is there other request in errors? Or perhaps do you have more information on the table impacted? It would help a lot to understand why some transaction get stalled (and I'm sorry I can't help more for now).

Also: does the problem started concurrently with some special event? (I would understand that you don't have the least idea about that).

Actions #2

Updated by François ARMAND almost 5 years ago

Thinking a bit more about it, the select 1 may be normal. As it is the query used by the pool to test connection health, it's normal to have some. I'm not sure about the status.

So, do you have other request? Where are reported the "dead row versions cannot be removed yet" entries if you try a VACUUM FULL VERBOSE ANALYZE?

Actions #3

Updated by Tobias Ell almost 5 years ago

Hi Francois,

I see the dead rows on many tables - I attach the vaccum log file.
The problem with the "select 1" connection is the state "idle in transaction".
This seems to block vaccuming on all tables. If the query is to test connection
availability this should not start a transaction.
Like this vaccum only works while jetty is stopped.

Actions #4

Updated by François ARMAND almost 5 years ago

  • Category set to Web - Maintenance
  • Target version set to 5.0.10

OK, it looks like there is a bunch of dead row in a lot of tables.

You are right that there is no need for the transaction for the test, but I believe it's how how it's done by default (and we didn't changed it). We could change that, or we could even update to JDBC4 driver which has a built-in function to do that (). But it should not locks all these tables, the select isn't even done a on a specific table. I will need to find a way to reproduce the vaccuum problem to be sure it actually solves it.

What happen if you kill that transaction request which is in "idle transaction" at the OS level and then start back the VACUUM? IE, in postgres pg_stat_activity, get the "pid" column (8071 in your example), and in a console:
kill 8071
(actually, you can find the "idle in transaction" process and its pid directly in console: ps aux | grep "idle in transaction").

If the VACUUM works, it would be a good evidence that the problem is with the SELECT 1.

Actions #5

Updated by François ARMAND almost 5 years ago

  • Assignee set to François ARMAND
Actions #6

Updated by Tobias Ell almost 5 years ago

Oh the problem is with the "SELECT 1" - I have tested by ending the query with pg_terminate_backend(8071) and
checked by ending jetty (which owns the connection from local port 48393).
As soon as the "SELECT 1" is gone - vaccum works.

Actions #7

Updated by François ARMAND almost 5 years ago

Just to be sure I understand: VACUUM works before you kill jetty? Because it could be an other request done by Rudder which is also cleared when you kill jetty that block things.

OK, so I'm going to try to see if we can use something else (or at least not in a transaction). Thanks for both reporting and helping with the diagnostic!

Actions #8

Updated by Tobias Ell almost 5 years ago

VACCUM works if "SELECT 1" with state "idle in transaction" is gone.
I can do this by
a) ending the postgres process
or
b) ending jetty

With a "idle in transaction" query vaccum reports dead row versions and does not free disk space.

Is the "Seclect 1" only done at jetty start?
I ended the query 1 hour ago and it hasn't come back, yet.

Actions #9

Updated by François ARMAND almost 5 years ago

OK, thank you very much for the clarification, there's no doubt it is the culprit.

"SELECT 1" is used in the connexion pool (hikaricp, for the records) to test if a connexion is alive (and spawn the correct number based on pool size). It should do it at regular interval.

So maybe there is a set of condition needed to have a select 1 stuck. It could be linked with something that happens only at startup (for ex something is not fully ready, and it breaks).

Now that the culprit is clearly identified, we know what to change. Thank you!

Actions #10

Updated by François ARMAND almost 5 years ago

  • Tracker changed from Question to Bug
  • Severity set to Major - prevents use of part of Rudder | no simple workaround
  • User visibility set to Operational - other Techniques | Rudder settings | Plugins
  • Priority set to 52
Actions #11

Updated by François ARMAND almost 5 years ago

  • Status changed from New to In progress
Actions #12

Updated by François ARMAND almost 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Vincent MEMBRÉ
  • Pull Request set to https://github.com/Normation/rudder/pull/2211
Actions #13

Updated by François ARMAND almost 5 years ago

  • Status changed from Pending technical review to Pending release
Actions #14

Updated by François ARMAND almost 5 years ago

We have commited a patch that avoid using SELECT 1 as documented here: https://github.com/brettwooldridge/HikariCP (search for `connectionTestQuery`). You should be able to test with tomorrow nightly if it correct the problem for you.

Actions #15

Updated by Tobias Ell almost 5 years ago

Thank you - I will test the patch next week.

Actions #16

Updated by Nicolas CHARLES almost 5 years ago

Idle in transaction means that a transaction began, and never ended.
So here a query started, and connection pool checked the validity of connection with select 1 (hiding previous query)

Removing the SELECT 1 may or may not fix the issue, but at least it will allow to see the actual query being performed

Actions #17

Updated by Nicolas CHARLES almost 5 years ago

Tobias, I have one last question for you
Do you happen to have messages like

2019-04-19 00:01:31 CEST [11185]: [1-1] LOG:  sending cancel to blocking autovacuum PID 14822
2019-04-19 00:01:31 CEST [11185]: [2-1] DETAIL:  Process 11185 waits for AccessExclusiveLock on relation 16512 of database 16386.
2019-04-19 00:01:31 CEST [11185]: [3-1] STATEMENT:  vacuum full archivedruddersysevents
2019-04-19 00:01:31 CEST [14822]: [1-1] ERROR:  canceling autovacuum task
2019-04-19 00:01:31 CEST [14822]: [2-1] CONTEXT:  automatic analyze of table "rudder.public.archivedruddersysevents" 
in your postgresql log files ? (most probably, in /var/log/postgresql)

Actions #18

Updated by Tobias Ell almost 5 years ago

I do have
LOG: checkpoints are occurring too frequently (18 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rudder.public.archivedruddersysevents"

although my housekeeping script does not have errors anymore (production).

I am setting up a test environment to check the nightly.

Actions #19

Updated by Tobias Ell almost 5 years ago

I installed latest nightly on the test server and since restart I haven't found
a "SELECT 1" or "idle in transaction". Looks good.

Actions #20

Updated by François ARMAND almost 5 years ago

Hello, would you mind looking if you have a problem now? As we don't know exactly what triggered the problem, we need to check after some time.
So:

- do you have a query in "idle in transaction" (other than SELECT 1, since we replaced it) ?
- does the VACUUM say it can perform everything without dead row ?

Thanks !

Actions #21

Updated by Tobias Ell almost 5 years ago

Hello,

there are no "idle in transaction" queries ( and "SELECT 1" hasn't come up again either).
All vacuum operations have finished without warnings about "dead rows".

Fix seems to work - thank you.

Actions #22

Updated by François ARMAND almost 5 years ago

That's great, thank you for letting us know. We are going to cut a release in the next couple of days with that fixe. Thank you again!

Actions #23

Updated by Vincent MEMBRÉ almost 5 years ago

  • Subject changed from postgres fills up and can't be vaccumed to Postgres fills up and can't be vaccumed
Actions #24

Updated by Vincent MEMBRÉ almost 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 5.0.10 which was released today.

Actions #25

Updated by Nicolas CHARLES over 4 years ago

  • Related to Bug #15413: Autovacuum may never finish, or take more than one day, on busy Rudder instance, because of resources attrition added
Actions

Also available in: Atom PDF