Bug #22879
closedToo many "Policy Update Started" in event logs
Description
On large instance, wa can be overwhelmed by the Policy Update Startedeventlog, especially with "alreadyPending="true"" details
The alreadyPending=true don't bring meaningful information there, and ought to be a simple log in debug in the webapp logs
As a result, it's nearly impossible to search in eventlog, and index is really huge.
Workaround: removing existing useless AutomaticStart event log¶
1/ delete the useless event logs:
delete from eventlog where eventtype = 'AutomaticStartDeployement' and data::text like '<entry><addPending %';
2/ reclam space and clean-up index. Be careful, a full vaccuum locks the table, which means that during the clean-up, no new event-log can be added. You should do that with rudder stopped if you event log table is big*.
vaccuum full eventlog;
- to get metrics regarding tables size in Rudder, you can execute that SQL query:
WITH RECURSIVE pg_inherit(inhrelid, inhparent) AS (select inhrelid, inhparent FROM pg_inherits UNION SELECT child.inhrelid, parent.inhparent FROM pg_inherit child, pg_inherits parent WHERE child.inhparent = parent.inhrelid), pg_inherit_short AS (SELECT * FROM pg_inherit WHERE inhparent NOT IN (SELECT inhrelid FROM pg_inherit)) SELECT table_schema , TABLE_NAME , row_estimate , pg_size_pretty(total_bytes) AS total , pg_size_pretty(index_bytes) AS INDEX , pg_size_pretty(toast_bytes) AS toast , pg_size_pretty(table_bytes) AS TABLE FROM ( SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM ( SELECT c.oid , nspname AS table_schema , relname AS TABLE_NAME , SUM(c.reltuples) OVER (partition BY parent) AS row_estimate , SUM(pg_total_relation_size(c.oid)) OVER (partition BY parent) AS total_bytes , SUM(pg_indexes_size(c.oid)) OVER (partition BY parent) AS index_bytes , SUM(pg_total_relation_size(reltoastrelid)) OVER (partition BY parent) AS toast_bytes , parent FROM ( SELECT pg_class.oid , reltuples , relname , relnamespace , pg_class.reltoastrelid , COALESCE(inhparent, pg_class.oid) parent FROM pg_class LEFT JOIN pg_inherit_short ON inhrelid = oid WHERE relkind IN ('r', 'p') ) c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace ) a WHERE oid = parent ) a ORDER BY total_bytes DESC;
Files
Updated by François ARMAND over 1 year ago
It seems that yes, the event log "not started because one already running/pending" is useless: it does nothing about tracability of changes in prod.
The place to log app behavior is webapp logs.
The level seems to be debug, because it is only usefull when we try to understand if something is broken in the queuing logic, not really in normal run - it's the same level of information as "I queued the inventory processing and other are already in queue" - only ops and dev care.
Updated by François ARMAND over 1 year ago
- Status changed from New to In progress
- Assignee set to François ARMAND
Updated by François ARMAND over 1 year ago
More preciselly, we have 4 kinds of events related to policy generation:
- ManualStartDeployment (real start, inqueue with 1 running, not queued because already one queued)
- AutomaticStartDeployment (real start, inqueue with 1 running, not queued because already one queued)
- SuccessfulDeployment
- FailedDeployment
I think we need to keep:
- Automatic deployement when it's a real start of policy generation so that we have a trace of all actual policy generation start point ; and none of the other
- Manual deployment in all cases, because it's a rare enought event, and in the case where it's dropped because there's already plenty generation running and queued, the user will likely want to know
- both successful/failed event, because we have to trace when the prod is actually changed.
Updated by François ARMAND over 1 year ago
The entry that we won't want anymore can be manually deleted by:
delete from eventlog where eventtype = 'AutomaticStartDeployement' and data::text like '<entry><addPending %';
( A reindex is then needed to make postgres knows about the fact: nope, vaccuum does it)
A vaccuum is then needed to reclam space:
vaccuum full eventlog;
Updated by François ARMAND over 1 year ago
- Description updated (diff)
Ok, so the risk of automatically cleaning up (incontrolled vaccuum full) in the general case seems to overcome the benefits in tha main use case.
So:
- this ticket will just remove the creation of AutomaticStart event log when they are useless AND NOT clean-up existing eventlog
table,
- for new users, the goal of limiting useless event logs is reached,
- for existing users who didn't see a problem, they will just get a cleaner eventlog page,
- for existing users who does see the problem, the can apply the procédure in the workaround in description to correct the problem and reclam spaces/indexes.
Updated by François ARMAND over 1 year ago
- Status changed from In progress to Pending technical review
- Assignee changed from François ARMAND to Nicolas CHARLES
- Pull Request set to https://github.com/Normation/rudder/pull/4832
Updated by Anonymous over 1 year ago
- Status changed from Pending technical review to Pending release
Applied in changeset rudder|27b985aa54ddd7a56873af1472562ef10cc204df.
Updated by François ARMAND over 1 year ago
- File clipboard-202306291457-mn14d.png clipboard-202306291457-mn14d.png added
- Fix check changed from To do to Checked
LGTM
Updated by Vincent MEMBRÉ over 1 year ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 7.2.8 and 7.3.3 which were released today.
Updated by François ARMAND over 1 year ago
- Related to Bug #23074: Generation not queued when one already started added