Project

General

Profile

Actions

Bug #15062

closed

Allow only catching up with recent runs in agent report processing batch

Added by Nicolas CHARLES almost 5 years ago. Updated almost 5 years ago.

Status:
Released
Priority:
N/A
Category:
Performance and scalability
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
0
Name check:
Reviewed
Fix check:
Checked
Regression:

Description

at start of the web interface, we catch up with all reports sent when the web interface was stopped, with a granularity of one day
on very loaded system, this is simply infeasible: there aren't enough ram to fetch it all, and poor postgresql choke on query

2019-06-11 20:21:29 UTCLOG:  duration: 997284.994 ms  execute <unnamed>: select distinct
          T.nodeid, T.executiontimestamp, coalesce(C.keyvalue, '') as nodeconfigid, coalesce(C.iscomplete, false) as complete, T.insertionid
        from
          (select nodeid, executiontimestamp, min(id) as insertionid from ruddersysevents where id > $1 and id <= $2 group by nodeid, executiontimestamp) as T
        left join
          (select
            true as iscomplete, nodeid, executiontimestamp, keyvalue
          from
            ruddersysevents where id > $3 and id <= $4 and
            eventtype = 'control' and
            component = 'end'
          ) as C
        on T.nodeid = C.nodeid and T.executiontimestamp = C.executiontimestamp
2019-06-11 20:21:29 UTCDETAIL:  parameters: $1 = '1428065585', $2 = '1437233953', $3 = '1428065585', $4 = '1437233953'

We should:
  1. be able to turn this feature off (or say: i catch up only xx minutes to avoid a gray compliance at start)
  2. be able to catch all (when using advanced reporting plugin), but deal with it in batches of yy minutes
  3. improve indexes

query

select
            true as iscomplete, nodeid, executiontimestamp, keyvalue
          from
            ruddersysevents where id >  1428065585 and id <= 1437233953 and
            eventtype = 'control' and
            component = 'end' ;

takes 36 seconds (see https://explain.depesz.com/s/T07l ) when there's no load on database

the index on component is used, but that's all.

Index on component is used only for that, we should use this index instead
CREATE INDEX endRun_control_idx ON RudderSysEvents (id) WHERE eventType = 'control' and component = 'end';

having this index results in 2,5s on a crazyly highly loaded system(load of 4 on a 2 CPU system)


Subtasks 4 (0 open4 closed)

Bug #15063: Change index on ruddersysevents to remove inefficient component index and replace it by a composite index ReleasedFrançois ARMANDActions
Bug #15064: Add an entry in rudder-upgrade to run index migration script during upgrate ReleasedFrançois ARMANDActions
Bug #15142: Missing migration script at upgrade from 4.1 to 5.0 on sles12ReleasedVincent MEMBRÉActions
Bug #15076: typo in query from parent ticketReleasedNicolas CHARLESActions

Related issues 1 (0 open1 closed)

Related to Rudder - Bug #14959: Batch Store Run Agent can be limited only in days for catching up old report RejectedActions
Actions

Also available in: Atom PDF