Bug #14789: Postgres fills up and can't be vaccumed - Rudder - Issue Tracker

Actions

Copy link

Bug #14789

closed

Postgres fills up and can't be vaccumed

Added by Tobias Ell about 6 years ago. Updated about 6 years ago.

Status:

Released

Priority:

N/A

Assignee:

Vincent MEMBRÉ

Category:

Web - Maintenance

Target version:

5.0.10

Pull Request:

https://github.com/Normation/rudder/p...

Severity:

Major - prevents use of part of Rudder | no simple workaround

UX impact:

User visibility:

Operational - other Techniques | Rudder settings | Plugins

Effort required:

Priority:

Name check:

Fix check:

Regression:

Description

Hi,
in the last 2 weeks I have had two occasions where the disk for my postgres db filled up.
I vaccum the rudder DB every night but lately that doesn't seem to be enough.
Checking the log of my housekeeping job I noticed that many tables have
"dead row versions cannot be removed yet" entries.
That seems to be a problem with transactions not committed.
I checked table pg_stat_activity and found an entry

16385 | rudder | 8071 | 16384 | rudder | PostgreSQL JDBC Driver | 127.0.0.1 | | 48393 | 2019-04-29 08:53:51.989498+02 |
2019-04-29 08:53:56.704253+02 | 2019-04-29 08:53:56.704567+02 | 2019-04-29 08:53:56.704589+02 | f | idle in transaction | SELECT 1

The client port 48393 belongs to a connection established by the jetty process.

My server is running on centOS7
all rudder components are version 5.0.9
postgres is 9.2.24

Has anyone encountered this problem before?
I am not too familiar with postgres, so it could be misconfiguration.
Any help is appreciated.

Files

vacuum_rudder_full.log (11.6 KB) vacuum_rudder_full.log

Tobias Ell, 2019-04-30 15:29

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by François ARMAND about 6 years ago

Hello,

Thanks for reporting that problem. I don't think the problem you are describing is a config one, more like a bug. Unfortunatly, our postgres expert is on hollidays this week.
I see here that the problem is with "select 1" which is the default request to test connection availability used by the pool manager. I will start to look for that.

Is there other request in errors? Or perhaps do you have more information on the table impacted? It would help a lot to understand why some transaction get stalled (and I'm sorry I can't help more for now).

Also: does the problem started concurrently with some special event? (I would understand that you don't have the least idea about that).

Actions

Copy link

Updated by François ARMAND about 6 years ago

Thinking a bit more about it, the select 1 may be normal. As it is the query used by the pool to test connection health, it's normal to have some. I'm not sure about the status.

So, do you have other request? Where are reported the "dead row versions cannot be removed yet" entries if you try a VACUUM FULL VERBOSE ANALYZE?

Actions

Copy link

Updated by Tobias Ell about 6 years ago

File vacuum_rudder_full.log vacuum_rudder_full.log added

Hi Francois,

I see the dead rows on many tables - I attach the vaccum log file.
The problem with the "select 1" connection is the state "idle in transaction".
This seems to block vaccuming on all tables. If the query is to test connection
availability this should not start a transaction.
Like this vaccum only works while jetty is stopped.

Actions

Copy link

Updated by François ARMAND about 6 years ago

Category set to Web - Maintenance
Target version set to 5.0.10

OK, it looks like there is a bunch of dead row in a lot of tables.

You are right that there is no need for the transaction for the test, but I believe it's how how it's done by default (and we didn't changed it). We could change that, or we could even update to JDBC4 driver which has a built-in function to do that (). But it should not locks all these tables, the select isn't even done a on a specific table. I will need to find a way to reproduce the vaccuum problem to be sure it actually solves it.

What happen if you kill that transaction request which is in "idle transaction" at the OS level and then start back the VACUUM? IE, in postgres pg_stat_activity, get the "pid" column (8071 in your example), and in a console:
kill 8071
(actually, you can find the "idle in transaction" process and its pid directly in console: ps aux | grep "idle in transaction").

If the VACUUM works, it would be a good evidence that the problem is with the SELECT 1.

Actions

Copy link

Updated by François ARMAND about 6 years ago

Assignee set to François ARMAND

Actions

Copy link

Updated by Tobias Ell about 6 years ago

Oh the problem is with the "SELECT 1" - I have tested by ending the query with pg_terminate_backend(8071) and
checked by ending jetty (which owns the connection from local port 48393).
As soon as the "SELECT 1" is gone - vaccum works.

Actions

Copy link

Updated by François ARMAND about 6 years ago

Just to be sure I understand: VACUUM works before you kill jetty? Because it could be an other request done by Rudder which is also cleared when you kill jetty that block things.

OK, so I'm going to try to see if we can use something else (or at least not in a transaction). Thanks for both reporting and helping with the diagnostic!

Actions

Copy link

Updated by Tobias Ell about 6 years ago

VACCUM works if "SELECT 1" with state "idle in transaction" is gone.
I can do this by
a) ending the postgres process
or
b) ending jetty

With a "idle in transaction" query vaccum reports dead row versions and does not free disk space.

Is the "Seclect 1" only done at jetty start?
I ended the query 1 hour ago and it hasn't come back, yet.

Actions

Copy link

Updated by François ARMAND about 6 years ago

OK, thank you very much for the clarification, there's no doubt it is the culprit.

"SELECT 1" is used in the connexion pool (hikaricp, for the records) to test if a connexion is alive (and spawn the correct number based on pool size). It should do it at regular interval.

So maybe there is a set of condition needed to have a select 1 stuck. It could be linked with something that happens only at startup (for ex something is not fully ready, and it breaks).

Now that the culprit is clearly identified, we know what to change. Thank you!

Actions

Copy link

#10

Updated by François ARMAND about 6 years ago

Tracker changed from Question to Bug
Severity set to Major - prevents use of part of Rudder | no simple workaround
User visibility set to Operational - other Techniques | Rudder settings | Plugins
Priority set to 52

Actions

Copy link

#11

Updated by François ARMAND about 6 years ago

Status changed from New to In progress

Actions

Copy link

#12

Updated by François ARMAND about 6 years ago

Status changed from In progress to Pending technical review
Assignee changed from François ARMAND to Vincent MEMBRÉ
Pull Request set to https://github.com/Normation/rudder/pull/2211

PR https://github.com/Normation/rudder/pull/2211

Actions

Copy link

#13

Updated by François ARMAND about 6 years ago

Status changed from Pending technical review to Pending release

Applied in changeset rudder|3a075dbe6f77be149da04ffb8e778a9e934ff28a.

Actions

Copy link

#14

Updated by François ARMAND about 6 years ago

We have commited a patch that avoid using SELECT 1 as documented here: https://github.com/brettwooldridge/HikariCP (search for `connectionTestQuery`). You should be able to test with tomorrow nightly if it correct the problem for you.

Actions

Copy link

#15

Updated by Tobias Ell about 6 years ago

Thank you - I will test the patch next week.

Actions

Copy link

#16

Updated by Nicolas CHARLES about 6 years ago

Idle in transaction means that a transaction began, and never ended.
So here a query started, and connection pool checked the validity of connection with select 1 (hiding previous query)

Removing the SELECT 1 may or may not fix the issue, but at least it will allow to see the actual query being performed

Actions

Copy link

#17

Updated by Nicolas CHARLES about 6 years ago

Tobias, I have one last question for you
Do you happen to have messages like

2019-04-19 00:01:31 CEST [11185]: [1-1] LOG:  sending cancel to blocking autovacuum PID 14822
2019-04-19 00:01:31 CEST [11185]: [2-1] DETAIL:  Process 11185 waits for AccessExclusiveLock on relation 16512 of database 16386.
2019-04-19 00:01:31 CEST [11185]: [3-1] STATEMENT:  vacuum full archivedruddersysevents
2019-04-19 00:01:31 CEST [14822]: [1-1] ERROR:  canceling autovacuum task
2019-04-19 00:01:31 CEST [14822]: [2-1] CONTEXT:  automatic analyze of table "rudder.public.archivedruddersysevents"

in your postgresql log files ? (most probably, in /var/log/postgresql)

Actions

Copy link

#18

Updated by Tobias Ell about 6 years ago

I do have
LOG: checkpoints are occurring too frequently (18 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
ERROR: canceling autovacuum task
CONTEXT: automatic vacuum of table "rudder.public.archivedruddersysevents"

although my housekeeping script does not have errors anymore (production).

I am setting up a test environment to check the nightly.

Actions

Copy link

#19

Updated by Tobias Ell about 6 years ago

I installed latest nightly on the test server and since restart I haven't found
a "SELECT 1" or "idle in transaction". Looks good.

Actions

Copy link

#20

Updated by François ARMAND about 6 years ago

Hello, would you mind looking if you have a problem now? As we don't know exactly what triggered the problem, we need to check after some time.
So:

- do you have a query in "idle in transaction" (other than SELECT 1, since we replaced it) ?
- does the VACUUM say it can perform everything without dead row ?

Thanks !

Actions

Copy link

#21

Updated by Tobias Ell about 6 years ago

Hello,

there are no "idle in transaction" queries ( and "SELECT 1" hasn't come up again either).
All vacuum operations have finished without warnings about "dead rows".

Fix seems to work - thank you.

Actions

Copy link

#22