Architecture #11147
openPartition archive tables in postgres, as well as reports table
Description
Archived table have a very high wolumetry, and very few read/write.
However we do once per day a bulk INSERT and a bulk DELETE
This works as expected, but indexes in these tables are larger than necessary, and DELETE/VACUUMing are fairly slow
Partitionning these tables ( cf https://www.postgresql.org/docs/9.2/static/ddl-partitioning.html ) would allow to have one table per day/week/period (select the right value), that could be dropped when archived reports are deleted, reclaiming instantaneously space on disk - or if it's not per day, at least we could vacuum on a much smaller set of data, and reclaiming space faster
Extra advantage: partition can be stored on other device, so a user could choose to keep years of reports
However, implementation can be tricky - by needing to create every day a new partition - however partition manager make it easy (see https://github.com/hipay/postgresql-partition-manager, https://github.com/dimitri/PartMgr , etc)
We could also create two partitions on ruddersysevents, one for all the repaired reports (that we query for list of changes), and one for all the control reports (start and end runs) to improve perfs on reportings
Final though:
The real change would probably be to have only ruddersysevents, with partitions, and skip completely the archived tables, and a clever partition scheme (to be defined), but out of my head, i would keep last 2 days in one partition, then one partition per day (drawback: don't know if it's easy to move data from one partition to another)
Why two days ? so that we don't have reportings issues at midnight
Updated by Nicolas CHARLES over 4 years ago
some other resources
https://read.acloud.guru/how-to-partition-dynamically-in-postgresql-ce3acbaef66c
https://github.com/derkan/pg_party
We will still be stuck on old version of postgresql on some system so we can't be using https://www.postgresql.org/docs/11/runtime-config-query.html#GUC-ENABLE-PARTITION-PRUNING nor all the goodness in https://www.postgresql.org/docs/11/ddl-partitioning.html
Updated by Nicolas CHARLES over 4 years ago
First try with dynamic on a postgresql 9.2
CREATE OR REPLACE FUNCTION reports_insert_function() RETURNS TRIGGER AS $$ DECLARE partition_date TEXT; partition_name TEXT; start_of_day TEXT; start_of_next_day TEXT; BEGIN partition_date := to_char(NEW.executiontimestamp,'YYYYMMDD'); partition_name := 'ruddersysevents_' || partition_date; start_of_day := to_char((NEW.executiontimestamp),'YYYY-MM-DD'); start_of_next_day := to_char((NEW.executiontimestamp + interval '1 day'),'YYYY-MM-DD'); IF NOT EXISTS (SELECT 1 FROM information_schema.tables WHERE table_name = partition_name) THEN -- RAISE NOTICE 'A partition has been created %', partition_name; EXECUTE format(E'CREATE TABLE %I (CHECK ( executiontimestamp >= ''%s'' AND executiontimestamp < ''%s'')) INHERITS (public.ruddersysevents)', partition_name, start_of_day,start_of_next_day); -- EXECUTE format('GRANT SELECT ON TABLE %I TO readonly', partition_name); -- use this if you use role based permission END IF; EXECUTE format('INSERT INTO %I (id, executionDate, nodeid, directiveId, ruleId, serial, component, keyValue, executionTimeStamp, eventType, policy, msg) VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12)', partition_name) using NEW.id, NEW.executionDate, NEW.nodeid, NEW.directiveId, NEW.ruleId, NEW.serial, NEW.component, NEW.keyValue, NEW.executionTimeStamp, NEW.eventType, NEW.policy, NEW.msg; RETURN NULL; END $$ LANGUAGE plpgsql; CREATE TRIGGER insert_reports_trigger BEFORE INSERT ON public.ruddersysevents FOR EACH ROW EXECUTE PROCEDURE public.reports_insert_function();
as write is threaded, i have error in logs
Jul 07 07:44:39 server rudder-relayd[20734]: ERROR relayd::processing::reporting: output error: database error: relation "ruddersyseven Jul 07 07:44:39 server rudder-relayd[20734]: INFO relayd::processing::reporting: transient error, skipping Jul 07 07:44:39 server rudder-relayd[20734]: ERROR relayd::processing::reporting: output error: database error: relation "ruddersyseven Jul 07 07:44:39 server rudder-relayd[20734]: INFO relayd::processing::reporting: transient error, skipping
i'd feel much more happy have a cron that would create the tables in advance, and have the trigger check for the table existance, and if it doesn't, do nothing (return the row itself rather than null, so that it fallback to the standard behaviour)
Updated by Nicolas CHARLES over 4 years ago
inherits does not include index
like can https://www.postgresql.org/docs/9.2/sql-createtable.html
but only inherits allows to have inheritant (like detach it)
so we need to create the index as well on table
Updated by Nicolas CHARLES over 4 years ago
table usage size after 1h of test
oid | table_schema | table_name | row_estimate | total_bytes | index_bytes | toast_bytes | table_bytes | total | index | toast | table
----------+--------------------+----------------------------+--------------+--------------+--------------+-------------+--------------+------------+------------+------------+---------
16388 | public | ruddersysevents | 3.71755e+08 | 400169771008 | 171776524288 | 8192 | 228393238528 | 373 GB | 160 GB | 8192 bytes | 213 GB
16665 | public | nodes | 8.2769e+07 | 36455940096 | 26726203392 | 8192 | 9729728512 | 34 GB | 25 GB | 8192 bytes | 9279 MB
16440 | public | nodeconfigurations | 644204 | 11121590272 | 474628096 | 9655656448 | 991305728 | 10 GB | 453 MB | 9208 MB | 945 MB
222217 | public | changesysevents | 1.29979e+07 | 4904722432 | 583958528 | 8192 | 4320755712 | 4678 MB | 557 MB | 8192 bytes | 4121 MB
16620 | public | directives | 5.94859e+06 | 3925737472 | 1911627776 | 8192 | 2014101504 | 3744 MB | 1823 MB | 8192 bytes | 1921 MB
16595 | public | groups | 5.77633e+06 | 2540552192 | 1851146240 | 8192 | 689397760 | 2423 MB | 1765 MB | 8192 bytes | 657 MB
36809013 | public | ruddersysevents_20200707 | 4.04202e+06 | 2513993728 | 890609664 | 8192 | 1623375872 | 2398 MB | 849 MB | 8192 bytes | 1548 MB
16420 | public | reportsexecution | 1.08461e+06 | 946028544 | 790405120 | 8192 | 155615232 | 902 MB | 754 MB | 8192 bytes | 148 MB
16522 | public | eventlog | 126962 | 640442368 | 16293888 | 583458816 | 40689664 | 611 MB | 16 MB | 556 MB | 39 MB
222258 | public | jsonconfigurations | 9098 | 100868096 | 2416640 | 88391680 | 10059776 | 96 MB | 2360 kB | 84 MB | 9824 kB
16431 | public | nodes_info | 9190 | 59572224 | 802816 | 30826496 | 27942912 | 57 MB | 784 kB | 29 MB | 27 MB
16634 | public | rules | 18776 | 14524416 | 6348800 | 8192 | 8167424 | 14 MB | 6200 kB | 8192 bytes | 7976 kB
16453 | public | archivednodeconfigurations | 0 | 4603904 | 2433024 | 2154496 | 16384 | 4496 kB | 2376 kB | 2104 kB | 16 kB
16491 | public | nodecompliancelevels | 236 | 3563520 | 3465216 | 8192 | 90112 | 3480 kB | 3384 kB | 8192 bytes | 88 kB
2608 | pg_catalog | pg_depend | 4385 | 1859584 | 1220608 | | 638976 | 1816 kB | 1192 kB | | 624 kB
16554 | public | gitcommit | 8850 | 1826816 | 794624 | 8192 | 1024000 | 1784 kB | 776 kB | 8192 bytes | 1000 kB
1249 | pg_catalog | pg_attribute | 3161 | 1589248 | 786432 | | 802816 | 1552 kB | 768 kB | | 784 kB
16645 | public | rulesgroupjoin | 17787 | 1384448 | 450560 | 8192 | 925696 | 1352 kB | 440 kB | 8192 bytes | 904 kB
2619 | pg_catalog | pg_statistic | 514 | 942080 | 40960 | 507904 | 393216 | 920 kB | 40 kB | 496 kB | 384 kB
2606 | pg_catalog | pg_constraint | 72 | 901120 | 327680 | 8192 | 565248 | 880 kB | 320 kB | 8192 bytes | 552 kB
1255 | pg_catalog | pg_proc | 2490 | 819200 | 270336 | 8192 | 540672 | 800 kB | 264 kB | 8192 bytes | 528 kB
2618 | pg_catalog | pg_rewrite | 106 | 540672 | 32768 | 376832 | 131072 | 528 kB | 32 kB | 368 kB | 128 kB
2609 | pg_catalog | pg_description | 3452 | 417792 | 131072 | 8192 | 278528 | 408 kB | 128 kB | 8192 bytes | 272 kB
3456 | pg_catalog | pg_collation | 925 | 368640 | 98304 | | 270336 | 360 kB | 96 kB | | 264 kB
1259 | pg_catalog | pg_class | 438 | 319488 | 172032 | | 147456 | 312 kB | 168 kB | | 144 kB
1247 | pg_catalog | pg_type | 423 | 303104 | 106496 | | 196608 | 296 kB | 104 kB | | 192 kB
2617 | pg_catalog | pg_operator | 728 | 221184 | 73728 | | 147456 | 216 kB | 72 kB | | 144 kB
16652 | public | rulesdirectivesjoin | 850 | 180224 | 73728 | 8192 | 98304 | 176 kB | 72 kB | 8192 bytes | 96 kB
2602 | pg_catalog | pg_amop | 393 | 163840 | 98304 | | 65536 | 160 kB | 96 kB | | 64 kB
2610 | pg_catalog | pg_index | 195 | 139264 | 32768 | | 106496 | 136 kB | 32 kB | | 104 kB
(partition don't show up)
Updated by Nicolas CHARLES over 4 years ago
used
CREATE TABLE ruddersysevents_20200707 ( CHECK ( executiontimestamp >= DATE '2020-07-07' AND executiontimestamp < DATE '2020-07-08' ) ) INHERITS (ruddersysevents); CREATE INDEX executionTimeStamp_idx_20200707 ON ruddersysevents_20200707 (executionTimeStamp); CREATE INDEX composite_node_execution_idx_20200707 ON ruddersysevents_20200707 (nodeId, executionTimeStamp); CREATE INDEX keyValue_idx_20200707 ON ruddersysevents_20200707 (keyValue); CREATE INDEX ruleId_idx_20200707 ON ruddersysevents_20200707 (ruleId); CREATE INDEX endRun_control_idx_20200707 ON ruddersysevents_20200707 (id) WHERE eventType = 'control' and component = 'end'; CREATE INDEX changes_executionTimeStamp_idx_20200707 ON ruddersysevents_20200707 (executionTimeStamp) WHERE eventType = 'result_repaired'; CREATE TABLE ruddersysevents_20200708 ( CHECK ( executiontimestamp >= DATE '2020-07-08' AND executiontimestamp < DATE '2020-07-09' ) ) INHERITS (ruddersysevents); CREATE INDEX executionTimeStamp_idx_20200708 ON ruddersysevents_20200708 (executionTimeStamp); CREATE INDEX composite_node_execution_idx_20200708 ON ruddersysevents_20200708 (nodeId, executionTimeStamp); CREATE INDEX keyValue_idx_20200708 ON ruddersysevents_20200708 (keyValue); CREATE INDEX ruleId_idx_20200708 ON ruddersysevents_20200708 (ruleId); CREATE INDEX endRun_control_idx_20200708 ON ruddersysevents_20200708 (id) WHERE eventType = 'control' and component = 'end'; CREATE INDEX changes_executionTimeStamp_idx_20200708 ON ruddersysevents_20200708 (executionTimeStamp) WHERE eventType = 'result_repaired';
retrospectively, that may not be so clever, as parent table has some report from the same day
Updated by Nicolas CHARLES over 4 years ago
ok, so first analysis show that on postgres 9.2, the planner doesn't take into account the information of partitionning to effectivey search
explain analyze select max(id) as id from RudderSysEvents where executionTimeStamp < '2020-07-07 12:58:48.042+00' and id > 27903878469; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1558204.06..1558204.07 rows=1 width=8) (actual time=14838.828..14838.828 rows=1 loops=1) -> Append (cost=0.00..1558197.38 rows=2672 width=8) (actual time=14838.825..14838.825 rows=0 loops=1) -> Index Scan using ruddersysevents_pkey on ruddersysevents (cost=0.00..22.70 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (id > 27903878469::bigint) Filter: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) -> Seq Scan on ruddersysevents_20200707 ruddersysevents (cost=0.00..1558134.73 rows=2584 width=8) (actual time=14838.781..14838.781 rows=0 loops=1) Filter: ((executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) AND (id > 27903878469::bigint)) Rows Removed by Filter: 25898202 -> Bitmap Heap Scan on ruddersysevents_20200708 ruddersysevents (cost=2.01..13.31 rows=29 width=8) (actual time=0.010..0.010 rows=0 loops=1) Recheck Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) Filter: (id > 27903878469::bigint) -> Bitmap Index Scan on executiontimestamp_idx_20200708 (cost=0.00..2.00 rows=87 width=0) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) -> Bitmap Heap Scan on ruddersysevents_20200709 ruddersysevents (cost=2.01..13.31 rows=29 width=8) (actual time=0.020..0.020 rows=0 loops=1) Recheck Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) Filter: (id > 27903878469::bigint) -> Bitmap Index Scan on executiontimestamp_idx_20200709 (cost=0.00..2.00 rows=87 width=0) (actual time=0.019..0.019 rows=0 loops=1) Index Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) -> Bitmap Heap Scan on ruddersysevents_20200710 ruddersysevents (cost=2.01..13.31 rows=29 width=8) (actual time=0.003..0.003 rows=0 loops=1) Recheck Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) Filter: (id > 27903878469::bigint) -> Bitmap Index Scan on executiontimestamp_idx_20200710 (cost=0.00..2.00 rows=87 width=0) (actual time=0.003..0.003 rows=0 loops=1) Index Cond: (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) Total runtime: 14839.029 ms (24 rows)
and index on partitionned table doesn't work as expected
rudder=> explain analyze select max(id) as id from RudderSysEvents_20200707 where executionTimeStamp < '2020-07-07 12:58:48.042+00' and id > 27903878469; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1598880.98..1598880.99 rows=1 width=8) (actual time=17963.577..17963.577 rows=1 loops=1) -> Seq Scan on ruddersysevents_20200707 (cost=0.00..1598874.36 rows=2644 width=8) (actual time=17963.573..17963.573 rows=0 loops=1) Filter: ((executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone) AND (id > 27903878469::bigint)) Rows Removed by Filter: 26575993 Total runtime: 17963.617 ms (5 rows)
Updated by Nicolas CHARLES over 4 years ago
ok, restricting the query to a smaller time (between the executiontimestamp of the id considered and now) makes it much faster
explain analyze select max(id) as id from RudderSysEvents where executionTimeStamp >= '2020-07-07 12:55:52+00' and executionTimeStamp < '2020-07-07 12:58:48.042+00' and id > 27903878469; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=8214.69..8214.70 rows=1 width=8) (actual time=12.081..12.081 rows=1 loops=1) -> Append (cost=0.00..8193.43 rows=8502 width=8) (actual time=12.077..12.077 rows=0 loops=1) -> Index Scan using ruddersysevents_pkey on ruddersysevents (cost=0.00..22.70 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (id > 27903878469::bigint) Filter: ((executiontimestamp >= '2020-07-07 12:55:52+00'::timestamp with time zone) AND (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone)) -> Index Scan using executiontimestamp_idx_20200707 on ruddersysevents_20200707 ruddersysevents (cost=0.00..8163.31 rows=8498 width=8) (actual time=12.030..12.030 rows=0 loops=1) Index Cond: ((executiontimestamp >= '2020-07-07 12:55:52+00'::timestamp with time zone) AND (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone)) Filter: (id > 27903878469::bigint) Rows Removed by Filter: 2566 -> Index Scan using executiontimestamp_idx_20200708 on ruddersysevents_20200708 ruddersysevents (cost=0.00..2.47 rows=1 width=8) (actual time=0.028..0.028 rows=0 loops=1) Index Cond: ((executiontimestamp >= '2020-07-07 12:55:52+00'::timestamp with time zone) AND (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone)) Filter: (id > 27903878469::bigint) -> Index Scan using executiontimestamp_idx_20200709 on ruddersysevents_20200709 ruddersysevents (cost=0.00..2.47 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: ((executiontimestamp >= '2020-07-07 12:55:52+00'::timestamp with time zone) AND (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone)) Filter: (id > 27903878469::bigint) -> Index Scan using executiontimestamp_idx_20200710 on ruddersysevents_20200710 ruddersysevents (cost=0.00..2.47 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) Index Cond: ((executiontimestamp >= '2020-07-07 12:55:52+00'::timestamp with time zone) AND (executiontimestamp < '2020-07-07 12:58:48.042+00'::timestamp with time zone)) Filter: (id > 27903878469::bigint) Total runtime: 12.173 ms (19 rows)
the question is: is it legit to limit like that, knowing that agent run can be out of order?
Updated by Nicolas CHARLES over 4 years ago
we could have some heuristics:
- take the lowest time from the previous batch considered as the min time where to look from (so we need to store that)
- if the lowest is in the future from current time, take current time -x seconds (minutes?)
- and even always ensure that there is a minimum from now to the min
Updated by Nicolas CHARLES over 4 years ago
partitioning by date doesn't seem to be so great, at least on pg9.2
see the query for fetching agent runs
explain analyze select distinct T.nodeid, T.executiontimestamp, coalesce(C.keyvalue, '') as nodeconfigid, coalesce(C.iscomplete, false) as complete, T.insertionid from (select nodeid, executiontimestamp, min(id) as insertionid from ruddersysevents where id > 27935500596 and id <= 27935626386 group by nodeid, executiontimestamp) as T left join (select true as iscomplete, nodeid, executiontimestamp, keyvalue from ruddersysevents where id >27935500596 and id <= 27935626386 and eventtype = 'control' and component = 'end' ) as C on T.nodeid = C.nodeid and T.executiontimestamp = C.executiontimestamp HashAggregate (cost=3492751.00..3492751.05 rows=5 width=75) (actual time=59539.567..59539.979 rows=356 loops=1) -> Hash Right Join (cost=3492738.45..3492750.93 rows=5 width=75) (actual time=59532.804..59539.127 rows=357 loops=1) Hash Cond: ((public.ruddersysevents.nodeid = public.ruddersysevents.nodeid) AND (public.ruddersysevents.executiontimestamp = public.ruddersysevents .executiontimestamp)) -> Append (cost=0.00..12.44 rows=5 width=67) (actual time=17.510..23.017 rows=357 loops=1) -> Index Scan using endrun_control_idx on ruddersysevents (cost=0.00..2.54 rows=1 width=58) (actual time=8.301..8.301 rows=0 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Index Scan using endrun_control_idx_20200707 on ruddersysevents_20200707 ruddersysevents (cost=0.00..2.48 rows=1 width=57) (actual time= 9.206..14.601 rows=357 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Index Scan using endrun_control_idx_20200708 on ruddersysevents_20200708 ruddersysevents (cost=0.00..2.48 rows=1 width=73) (actual time= 0.025..0.025 rows=0 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Index Scan using endrun_control_idx_20200709 on ruddersysevents_20200709 ruddersysevents (cost=0.00..2.48 rows=1 width=73) (actual time= 0.008..0.008 rows=0 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Index Scan using endrun_control_idx_20200710 on ruddersysevents_20200710 ruddersysevents (cost=0.00..2.48 rows=1 width=73) (actual time= 0.008..0.008 rows=0 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Hash (cost=3492738.37..3492738.37 rows=5 width=50) (actual time=59515.225..59515.225 rows=356 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 31kB -> HashAggregate (cost=3492738.27..3492738.32 rows=5 width=50) (actual time=59514.973..59515.060 rows=356 loops=1) -> Append (cost=0.00..3492738.23 rows=5 width=50) (actual time=1027.755..59425.110 rows=125790 loops=1) -> Index Scan using ruddersysevents_pkey on ruddersysevents (cost=0.00..23.45 rows=1 width=53) (actual time=0.006..0.006 rows=0 loops=1) Index Cond: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Seq Scan on ruddersysevents_20200707 ruddersysevents (cost=0.00..3492673.08 rows=1 width=53) (actual time=1027.746..59405.20 1 rows=125790 loops=1) Filter: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) Rows Removed by Filter: 57930398 -> Seq Scan on ruddersysevents_20200708 ruddersysevents (cost=0.00..13.90 rows=1 width=48) (actual time=0.001..0.001 rows=0 loo ps=1) Filter: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Seq Scan on ruddersysevents_20200709 ruddersysevents (cost=0.00..13.90 rows=1 width=48) (actual time=0.001..0.001 rows=0 loo ps=1) Filter: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) -> Seq Scan on ruddersysevents_20200710 ruddersysevents (cost=0.00..13.90 rows=1 width=48) (actual time=0.001..0.001 rows=0 loo ps=1) Filter: ((id > 27935500596::bigint) AND (id <= 27935626386::bigint)) Total runtime: 59541.049 ms (30 rows)
Updated by Nicolas CHARLES over 4 years ago
with partitionned table, the cleaning of report didn't work correctly (well, it did run, but did not reclaim space)
So it cannot be simply plug'n'play, we'd need also to truncate the tables
it seems that the trigger also have a massive perf penalty (on 9.2)