Project

General

Profile

Actions

Bug #22879

closed

Too many "Policy Update Started" in event logs

Added by Nicolas CHARLES 11 months ago. Updated 10 months ago.

Status:
Released
Priority:
N/A
Category:
Web - Maintenance
Target version:
Severity:
Minor - inconvenience | misleading | easy workaround
UX impact:
I dislike using that feature
User visibility:
Operational - other Techniques | Rudder settings | Plugins
Effort required:
Priority:
89
Name check:
To do
Fix check:
Checked
Regression:
No

Description

On large instance, wa can be overwhelmed by the Policy Update Startedeventlog, especially with "alreadyPending="true"" details
The alreadyPending=true don't bring meaningful information there, and ought to be a simple log in debug in the webapp logs

As a result, it's nearly impossible to search in eventlog, and index is really huge.

Workaround: removing existing useless AutomaticStart event log

1/ delete the useless event logs:

delete from eventlog where eventtype = 'AutomaticStartDeployement' and data::text like '<entry><addPending %';

2/ reclam space and clean-up index. Be careful, a full vaccuum locks the table, which means that during the clean-up, no new event-log can be added. You should do that with rudder stopped if you event log table is big*.

vaccuum full eventlog;
  • to get metrics regarding tables size in Rudder, you can execute that SQL query:
WITH RECURSIVE pg_inherit(inhrelid, inhparent) AS
    (select inhrelid, inhparent
    FROM pg_inherits
    UNION
    SELECT child.inhrelid, parent.inhparent
    FROM pg_inherit child, pg_inherits parent
    WHERE child.inhparent = parent.inhrelid),
pg_inherit_short AS (SELECT * FROM pg_inherit WHERE inhparent NOT IN (SELECT inhrelid FROM pg_inherit))
SELECT table_schema
    , TABLE_NAME
    , row_estimate
    , pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
  FROM (
    SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes
    FROM (
         SELECT c.oid
              , nspname AS table_schema
              , relname AS TABLE_NAME
              , SUM(c.reltuples) OVER (partition BY parent) AS row_estimate
              , SUM(pg_total_relation_size(c.oid)) OVER (partition BY parent) AS total_bytes
              , SUM(pg_indexes_size(c.oid)) OVER (partition BY parent) AS index_bytes
              , SUM(pg_total_relation_size(reltoastrelid)) OVER (partition BY parent) AS toast_bytes
              , parent
          FROM (
                SELECT pg_class.oid
                    , reltuples
                    , relname
                    , relnamespace
                    , pg_class.reltoastrelid
                    , COALESCE(inhparent, pg_class.oid) parent
                FROM pg_class
                    LEFT JOIN pg_inherit_short ON inhrelid = oid
                WHERE relkind IN ('r', 'p')
             ) c
             LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
  ) a
  WHERE oid = parent
) a
ORDER BY total_bytes DESC;

Files

clipboard-202306291457-mn14d.png (174 KB) clipboard-202306291457-mn14d.png François ARMAND, 2023-06-29 14:57

Related issues 1 (0 open1 closed)

Related to Rudder - Bug #23074: Generation not queued when one already startedReleasedVincent MEMBRÉActions
Actions #1

Updated by François ARMAND 11 months ago

It seems that yes, the event log "not started because one already running/pending" is useless: it does nothing about tracability of changes in prod.
The place to log app behavior is webapp logs.

The level seems to be debug, because it is only usefull when we try to understand if something is broken in the queuing logic, not really in normal run - it's the same level of information as "I queued the inventory processing and other are already in queue" - only ops and dev care.

Actions #2

Updated by François ARMAND 11 months ago

  • Status changed from New to In progress
  • Assignee set to François ARMAND
Actions #3

Updated by François ARMAND 11 months ago

More preciselly, we have 4 kinds of events related to policy generation:

- ManualStartDeployment (real start, inqueue with 1 running, not queued because already one queued)
- AutomaticStartDeployment (real start, inqueue with 1 running, not queued because already one queued)
- SuccessfulDeployment
- FailedDeployment

I think we need to keep:
- Automatic deployement when it's a real start of policy generation so that we have a trace of all actual policy generation start point ; and none of the other
- Manual deployment in all cases, because it's a rare enought event, and in the case where it's dropped because there's already plenty generation running and queued, the user will likely want to know
- both successful/failed event, because we have to trace when the prod is actually changed.

Actions #4

Updated by François ARMAND 11 months ago

The entry that we won't want anymore can be manually deleted by:

delete from eventlog where eventtype = 'AutomaticStartDeployement' and data::text like '<entry><addPending %';

( A reindex is then needed to make postgres knows about the fact: nope, vaccuum does it)

A vaccuum is then needed to reclam space:

vaccuum full eventlog;

Actions #5

Updated by François ARMAND 11 months ago

  • Description updated (diff)

Ok, so the risk of automatically cleaning up (incontrolled vaccuum full) in the general case seems to overcome the benefits in tha main use case.
So:

- this ticket will just remove the creation of AutomaticStart event log when they are useless AND NOT clean-up existing eventlog table,
- for new users, the goal of limiting useless event logs is reached,
- for existing users who didn't see a problem, they will just get a cleaner eventlog page,
- for existing users who does see the problem, the can apply the procédure in the workaround in description to correct the problem and reclam spaces/indexes.

Actions #6

Updated by François ARMAND 11 months ago

  • Description updated (diff)
Actions #7

Updated by François ARMAND 11 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder/pull/4832
Actions #8

Updated by Anonymous 11 months ago

  • Status changed from Pending technical review to Pending release
Actions #9

Updated by François ARMAND 10 months ago

LGTM

Actions #10

Updated by Vincent MEMBRÉ 10 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 7.2.8 and 7.3.3 which were released today.

Actions #11

Updated by François ARMAND 10 months ago

  • Related to Bug #23074: Generation not queued when one already started added
Actions

Also available in: Atom PDF