Project

General

Profile

Actions

Bug #3674

closed

Missing index on DB for "reports by nodes" leads to timeout for node list page

Added by Dennis Cabooter over 10 years ago. Updated over 10 years ago.

Status:
Released
Priority:
1
Category:
Web - Compliance & node report
Target version:
Severity:
UX impact:
User visibility:
Effort required:
Priority:
Name check:
Fix check:
Regression:

Description

After upgrading rudder-server-root from 2.5 to 2.6 now and then searching is not working within the webapp. The connection times out and a popup says "The server cannot be contacted at this time". I had this issue before and we managed to solve it with help from Nicolas Charles. We emptied the database and Rudder was fast again. Since then the problem occured again several times. The webapp is not responsive and I can only search for certain groups (the small ones). Please let's find a definitive solution. Do you recommend me to reinstall Rudder from scratch?

Summary of solution: we need two indexes, 1/ one on RudderSysEvent by execution timestamps, and 2/ one on RudderSysEvent by (execution timestamps and by node)

With only 1/ (and not 2/), we get behaviour described here.
With only 2/ (and not 1/), we get #3926.

(1) CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp);
(2) CREATE INDEX composite_node_execution_idx on RudderSysEvents (nodeId, executionTimeStamp); 

Related issues 5 (0 open5 closed)

Related to Rudder - Architecture #3918: Migration script to change indexes on rudder sys eventsReleasedJonathan CLARKE2013-09-09Actions
Related to Rudder - Architecture #3988: Change rudder sysevents indexes on PostgreSQLReleased2013-09-09Actions
Has duplicate Rudder - Bug #3683: Rudder returns "Server cannot be contacted" if a group criteria is based on a wrong regexpReleasedFrançois ARMAND2013-07-03Actions
Has duplicate Rudder - Bug #3902: Postgres performance problem when loading nodeManager > nodesRejected2013-09-03Actions
Has duplicate Rudder - Bug #3926: Rule page display time increasing lineary with the number of reports in DBRejectedFrançois ARMAND2013-09-12Actions
Actions #1

Updated by Nicolas CHARLES over 10 years ago

Hi Dennis,

I'm sorry to heard this problem happens again ...
So, let's start with the basic. What is you OS and Postgres version ?
I recall you are using a VM. Was the VM killed at some point ? or was the postgres killed ?
Postgres doesn't like to be killed (as most DB actually), and it can lead to data corruption; and in your case I believe indexes corruption

Actions #2

Updated by Nicolas CHARLES over 10 years ago

And just to rule out a OOM or something, do you have any line in your syslogs on the server saying :

The Rudder Webapp configuration files could not be updated to set the SQL password

or
The postgresql process was not running and has been restarted

or
The postgresql process has been restarted

or
The postgresql process couldn't be restarted

Actions #3

Updated by Dennis Cabooter over 10 years ago

Hi Nicolas,

# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 12.04.2 LTS
Release:    12.04
Codename:    precise
# dpkg -l | grep postgres | head -1
ii  postgresql                       9.1+129ubuntu1               object-relational SQL database (supported version)

I restarted the server last week without manually stopping rudder and/or postgres. However, this should not be a problem, since the OS usually stops the daemons before it reboots.

Actions #4

Updated by Dennis Cabooter over 10 years ago

I don't see any of the messages you stated above in syslog.

Actions #5

Updated by Nicolas CHARLES over 10 years ago

And do you have a

database system was interrupted

in any of your postgres log ? if so, could you copy and paste, with most of the lines around it ?

Actions #6

Updated by Nicolas CHARLES over 10 years ago

My best shot on this issue is that the package used was updated with 06/09/2012 code last ( http://changelogs.ubuntu.com/changelogs/pool/main/p/postgresql/postgresql_9.1+129ubuntu1/changelog ), but there's been a fix on 24/09/2012 ( http://wiki.postgresql.org/wiki/20120924updaterelease )

Actions #7

Updated by Nicolas CHARLES over 10 years ago

Hum, reindexing didn't solve the issue

Could you run

SELECT
    t.tablename,
    indexname,
    c.reltuples AS num_rows,
    pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
    CASE WHEN x.is_unique = 1  THEN 'Y'
       ELSE 'N'
    END AS UNIQUE,
    idx_scan AS number_of_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
       (SELECT indrelid,
           max(CAST(indisunique AS integer)) AS is_unique
       FROM pg_index
       GROUP BY indrelid) x
       ON c.oid = x.indrelid
LEFT OUTER JOIN
    ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch,indexrelname FROM pg_index x
           JOIN pg_class c ON c.oid = x.indrelid
           JOIN pg_class ipg ON ipg.oid = x.indexrelid
           JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid )
    AS foo
    ON t.tablename = foo.ctablename
WHERE t.schemaname='public'
ORDER BY 1,2;

and paste the result ?

Actions #8

Updated by Nicolas CHARLES over 10 years ago

Finally, since the version used on ubuntu precise pangolin suffers from a data corruption bug happening in case of a kill -9, fixed in later version of postgres ( https://github.com/postgres/postgres/commit/25560fdeb26f75b50793a56f69f64daab4ad586e ) , and the postgres init script does a kill -9 if anything else didn't work, to stop postgres (line 262 of /usr/bin/pg_ctlcluster ), I guess that's the reason of the problem

Actions #9

Updated by Nicolas CHARLES over 10 years ago

Ha, I checked on a ubuntu 12.04, and the version of the code is from Tue, 02 Apr 2013 11:59:41 +0200
So my theory is broken ...

Does this problem happens only when shuting down your system ?

Actions #10

Updated by Nicolas CHARLES over 10 years ago

Dennis Cabooter wrote:

# dpkg -l | grep postgres | head -1
ii  postgresql                       9.1+129ubuntu1               object-relational SQL database (supported version)

Could you do the same, but without the head -1 ?

Thanks !

Actions #11

Updated by Dennis Cabooter over 10 years ago

        tablename        |            indexname            |  num_rows   | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched 
-------------------------+---------------------------------+-------------+------------+------------+--------+-----------------+-------------+----------------
 archivedruddersysevents | archivedruddersysevents_pkey    | 6.11278e+06 | 1816 MB    | 131 MB     | Y      |               0 |           0 |              0
 archivedruddersysevents | executiontimestamp_archived_idx | 6.11278e+06 | 1816 MB    | 131 MB     | Y      |             101 |     3590279 |        3590279
 changerequest           | changerequest_pkey              |           0 | 0 bytes    | 8192 bytes | Y      |               0 |           0 |              0
 directives              | directive_end                   |         192 | 72 kB      | 16 kB      | Y      |               0 |           0 |              0
 directives              | directive_id_start              |         192 | 72 kB      | 32 kB      | Y      |              40 |          71 |              7
 directives              | directives_pkey                 |         192 | 72 kB      | 16 kB      | Y      |               0 |           0 |              0
 eventlog                | creationdate_idx                |        1729 | 800 kB     | 56 kB      | Y      |             104 |         480 |            480
 eventlog                | eventlog_pkey                   |        1729 | 800 kB     | 56 kB      | Y      |            1019 |       20740 |          20740
 eventlog                | eventtype_idx                   |        1729 | 800 kB     | 88 kB      | Y      |             264 |        6477 |              0
 expectedreports         | expectedreports_pkey            |       18219 | 4280 kB    | 416 kB     | Y      |               0 |           0 |              0
 expectedreports         | expectedreports_serialid        |       18219 | 4280 kB    | 1136 kB    | Y      |            4373 |     2836364 |            536
 expectedreports         | expectedreports_versionid       |       18219 | 4280 kB    | 416 kB     | Y      |            8291 |       54472 |          53754
 expectedreportsnodes    | expectedreportsnodes_pkey       |       13410 | 1000 kB    | 888 kB     | Y      |               0 |           0 |              0
 expectedreportsnodes    | expectedreportsnodes_versionid  |       13410 | 1000 kB    | 312 kB     | Y      |            8006 |      140214 |         140214
 gitcommit               | gitcommit_pkey                  |         762 | 120 kB     | 72 kB      | Y      |               0 |           0 |              0
 groups                  | groups_end                      |         236 | 40 kB      | 16 kB      | Y      |               0 |           0 |              0
 groups                  | groups_id_start                 |         236 | 40 kB      | 32 kB      | Y      |               2 |           4 |              3
 groups                  | groups_pkey                     |         236 | 40 kB      | 16 kB      | Y      |               0 |           0 |              0
 groupsnodesjoin         | groupsnodesjoin_pkey            |        2410 | 208 kB     | 176 kB     | Y      |             723 |       13821 |              0
 migrationeventlog       | migrationeventlog_pkey          |          18 | 8192 bytes | 16 kB      | Y      |              64 |          64 |             64
 nodes                   | nodes_end                       |          82 | 16 kB      | 16 kB      | Y      |               0 |           0 |              0
 nodes                   | nodes_id_start                  |          82 | 16 kB      | 16 kB      | Y      |               0 |           0 |              0
 nodes                   | nodes_pkey                      |          82 | 16 kB      | 16 kB      | Y      |               0 |           0 |              0
 rudderproperties        | rudderproperties_pkey           |           1 | 8192 bytes | 16 kB      | Y      |               0 |           0 |              0
 ruddersysevents         | component_idx                   | 9.30168e+06 | 2598 MB    | 326 MB     | Y      |               8 |       14800 |              0
 ruddersysevents         | executiondate_idx               | 9.30168e+06 | 2598 MB    | 204 MB     | Y      |               0 |           0 |              0
 ruddersysevents         | executiontimestamp_idx          | 9.30168e+06 | 2598 MB    | 208 MB     | Y      |           17851 |   238193641 |      237995230
 ruddersysevents         | keyvalue_idx                    | 9.30168e+06 | 2598 MB    | 296 MB     | Y      |              40 |     6367776 |              0
 ruddersysevents         | nodeid_idx                      | 9.30168e+06 | 2598 MB    | 528 MB     | Y      |             171 |       89841 |          89841
 ruddersysevents         | ruddersysevents_pkey            | 9.30168e+06 | 2598 MB    | 204 MB     | Y      |          266959 |    36779591 |       36538294
 ruddersysevents         | ruleid_idx                      | 9.30168e+06 | 2598 MB    | 501 MB     | Y      |               0 |           0 |              0
 rules                   | rule_end                        |         105 | 24 kB      | 16 kB      | Y      |               0 |           0 |              0
 rules                   | rule_id_start                   |         105 | 24 kB      | 16 kB      | Y      |               0 |           0 |              0
 rules                   | rules_pkey                      |         105 | 24 kB      | 16 kB      | Y      |               0 |           0 |              0
 rulesdirectivesjoin     | rulesdirectivesjoin_pkey        |         385 | 40 kB      | 48 kB      | Y      |               0 |           0 |              0
 rulesgroupjoin          | rulesgroupjoin_pkey             |         158 | 24 kB      | 32 kB      | Y      |               0 |           0 |              0
 workflow                |                                 |           0 | 0 bytes    |            | N      |                 |             |               
(37 rows)
Actions #12

Updated by Dennis Cabooter over 10 years ago

# dpkg -l | grep postgres
ii  postgresql                       9.1+129ubuntu1               object-relational SQL database (supported version)
ii  postgresql-9.1                   9.1.9-0ubuntu12.04           object-relational SQL database, version 9.1 server
ii  postgresql-client-9.1            9.1.9-0ubuntu12.04           front-end programs for PostgreSQL 9.1
ii  postgresql-client-common         129ubuntu1                   manager for multiple PostgreSQL client versions
ii  postgresql-common                129ubuntu1                   PostgreSQL database-cluster manager
Actions #13

Updated by Dennis Cabooter over 10 years ago

As far as I can see the problem only occurs in case of a system reboot/shutdown.

Actions #14

Updated by Nicolas CHARLES over 10 years ago

Dennis,

We want to make sure that the postgres is the real culprit. I've made a little script that simulates the query made on the list of nodes page. Could you run it into postgres, and make a pastie of the exit ?
It will output a list of : time when the query was made, nodeid, last inventory date

CREATE LANGUAGE plpgsql;

CREATE FUNCTION scanDb() RETURNS TEXT 
LANGUAGE plpgsql
AS $$DECLARE
  DECLARE

    output TEXT;
    cur_nodeid varchar;
    lasttime timestamp;

  BEGIN
    FOR cur_nodeid in EXECUTE '
SELECT distinct(nodeid) FROM nodes
'
    LOOP

    SELECT into lasttime executionTimeStamp from RudderSysEvents where 1=1 and nodeid = cur_nodeid order by executionTimeStamp desc limit 1;
    RAISE NOTICE 'At %, performed select on %, last inventory time was % ', timeofday() , cur_nodeid, lasttime;
    END LOOP;

    RETURN output;
END$$;

To run the script, you just have to execute

SELECT scanDb();

It may be quite long

Actions #15

Updated by Nicolas CHARLES over 10 years ago

Dennis, could you run also this command

/usr/lib/postgresql/9.1/bin/pg_controldata /var/lib/postgresql/9.1/main/

and this one
pg_dumpall > /dev/null 2> dump.err

and past the output of the first command, and the content of the dump.err file ?

Thanks !

Actions #16

Updated by Dennis Cabooter over 10 years ago

# /usr/lib/postgresql/9.1/bin/pg_controldata /var/lib/postgresql/9.1/main/
pg_control version number:            903
Catalog version number:               201105231
Database system identifier:           5867460532166847588
Database cluster state:               in production
pg_control last modified:             Tue 02 Jul 2013 01:01:45 PM CEST
Latest checkpoint location:           11/47506F80
Prior checkpoint location:            11/47112978
Latest checkpoint's REDO location:    11/473A0A98
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/32986218
Latest checkpoint's NextOID:          37224
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        670
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Time of latest checkpoint:            Tue 02 Jul 2013 01:01:13 PM CEST
Minimum recovery ending location:     0/0
Backup start location:                0/0
Current wal_level setting:            minimal
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Actions #17

Updated by Dennis Cabooter over 10 years ago

This is probably not what you expect:

# cat dump.err 
pg_dumpall: could not connect to database "template1": FATAL:  role "root" does not exist
Actions #18

Updated by Nicolas CHARLES over 10 years ago

Dennis Cabooter wrote:

This is probably not what you expect:

[...]

indeed; could you run it as the user postgres ?

Actions #19

Updated by Dennis Cabooter over 10 years ago

The dump.err file is empty.

Actions #20

Updated by Dennis Cabooter over 10 years ago

# psql -q -U rudder -h localhost
rudder=> select nodeid,max(executiontimestamp) from ruddersysevents group by nodeid;
rudder=> CREATE LANGUAGE plpgsql;
ERROR:  language "plpgsql" already exists
rudder=> 
rudder=> CREATE FUNCTION scanDb() RETURNS TEXT 
rudder-> LANGUAGE plpgsql
rudder-> AS $$DECLARE
rudder$>   DECLARE
rudder$> 
rudder$>     output TEXT;
rudder$>     cur_nodeid varchar;
rudder$>     lasttime timestamp;
rudder$> 
rudder$>   BEGIN
rudder$>     FOR cur_nodeid in EXECUTE '
rudder$> SELECT distinct(nodeid) FROM nodes
rudder$> '
rudder$>     LOOP
rudder$> 
rudder$>     SELECT into lasttime executionTimeStamp from RudderSysEvents where 1=1 and nodeid = cur_nodeid order by executionTimeStamp desc limit 1;
rudder$>     RAISE NOTICE 'At %, performed select on %, last inventory time was % ', timeofday() , cur_nodeid, lasttime;
rudder$>     END LOOP;
rudder$> 
rudder$>     RETURN output;
rudder$> END$$;
rudder=> SELECT scanDb();
NOTICE:  At Tue Jul 02 11:38:05.101472 2013 CEST, performed select on f538a113-eba7-421e-9750-6abce6c50252, last inventory time was 2013-07-02 11:35:33 
NOTICE:  At Tue Jul 02 11:38:05.102514 2013 CEST, performed select on c821ed95-698b-4368-b510-7b9b5c47e224, last inventory time was 2013-07-02 11:35:49 
NOTICE:  At Tue Jul 02 12:08:42.269042 2013 CEST, performed select on # The content of this file should be replaced with a valid U, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 12:08:42.270656 2013 CEST, performed select on c8ef418b-8dfd-48a0-8e44-945e46d4b93b, last inventory time was 2013-07-02 12:05:56 
NOTICE:  At Tue Jul 02 12:08:42.272259 2013 CEST, performed select on 09a115e7-8322-4089-9d7d-00d647ece18f, last inventory time was 2013-07-02 12:05:22 
NOTICE:  At Tue Jul 02 12:08:42.272602 2013 CEST, performed select on 281d598a-3d4f-4cdb-acc1-6a19087b4ea1, last inventory time was 2013-07-02 12:06:50 
NOTICE:  At Tue Jul 02 12:08:42.273626 2013 CEST, performed select on b977ee9b-91e2-49f9-92ae-ff8c287eff78, last inventory time was 2013-07-02 12:05:49 
NOTICE:  At Tue Jul 02 12:08:42.274354 2013 CEST, performed select on 31957caa-520b-4db5-b8a7-66a1548de1b2, last inventory time was 2013-07-02 12:06:02 
NOTICE:  At Tue Jul 02 12:08:42.275071 2013 CEST, performed select on e8c59d61-09a2-437f-ad69-7398f13b7dec, last inventory time was 2013-07-02 12:05:59 
NOTICE:  At Tue Jul 02 12:08:42.276161 2013 CEST, performed select on 2500fd48-2905-42ba-bfdf-e5c170b9a176, last inventory time was 2013-07-02 12:05:31 
NOTICE:  At Tue Jul 02 12:08:42.277094 2013 CEST, performed select on fed5f844-de90-47d1-83fd-aa205f192633, last inventory time was 2013-07-02 12:05:46 
NOTICE:  At Tue Jul 02 12:08:42.278567 2013 CEST, performed select on 16cfe090-fecf-4433-8465-aeff1e5ef5b6, last inventory time was 2013-07-02 12:05:15 
NOTICE:  At Tue Jul 02 12:08:42.279118 2013 CEST, performed select on 80cd9420-65af-404d-bfdc-926fb0fd6364, last inventory time was 2013-07-02 12:06:22 
NOTICE:  At Tue Jul 02 12:08:42.279914 2013 CEST, performed select on 8587523a-f9e3-4384-b89d-b0c0a697078e, last inventory time was 2013-07-02 12:05:55 
NOTICE:  At Tue Jul 02 12:08:42.280485 2013 CEST, performed select on 4d35582b-98b2-4538-95fb-77410305225b, last inventory time was 2013-07-02 12:06:06 
NOTICE:  At Tue Jul 02 12:34:26.663505 2013 CEST, performed select on ba09ede2-4abf-40f4-97e8-5cbabcc5746c, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 12:58:31.153248 2013 CEST, performed select on e97b7188-131c-4f0d-9c82-76cf2d12c8ff, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 12:58:31.156417 2013 CEST, performed select on ea321759-5892-4239-a954-4473a8daabee, last inventory time was 2013-07-02 12:55:21 
NOTICE:  At Tue Jul 02 12:58:31.157089 2013 CEST, performed select on 51f46035-169e-41cf-9c5f-c9a736554740, last inventory time was 2013-07-02 12:56:10 
NOTICE:  At Tue Jul 02 12:58:31.158340 2013 CEST, performed select on 2286a209-5535-461f-bfe7-1df01071c183, last inventory time was 2013-07-02 12:55:38 
NOTICE:  At Tue Jul 02 13:23:36.257280 2013 CEST, performed select on 82a32926-0aaa-49cb-bc5f-0c79e61a5b6e, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 13:23:36.258265 2013 CEST, performed select on b748549d-5229-49a0-8c07-9efc17de79c5, last inventory time was 2013-07-02 13:21:26 
NOTICE:  At Tue Jul 02 13:23:36.260109 2013 CEST, performed select on 593604d8-f885-44a7-b186-ae6fe885415c, last inventory time was 2013-07-02 13:20:18 
NOTICE:  At Tue Jul 02 13:23:36.261480 2013 CEST, performed select on ac91fc05-0043-4238-8de8-c1289ec00d67, last inventory time was 2013-07-02 13:20:27 
NOTICE:  At Tue Jul 02 13:23:36.262462 2013 CEST, performed select on 009e0af1-1a3c-4ee0-a91d-10b68a008f5d, last inventory time was 2013-07-02 13:20:43 
NOTICE:  At Tue Jul 02 13:47:38.688213 2013 CEST, performed select on 868a518f-93b3-4d00-af57-2253b529e8c7, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 14:09:53.889301 2013 CEST, performed select on 32a37cb6-6e56-475f-989e-ddb6e644f1f1, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 14:09:53.890917 2013 CEST, performed select on 073239d7-c633-4b60-8477-57131839fd34, last inventory time was 2013-07-02 14:05:40 
NOTICE:  At Tue Jul 02 14:09:53.891753 2013 CEST, performed select on a99e9d1d-3030-44b4-90e3-859d23e3c16b, last inventory time was 2013-07-02 14:05:42 
NOTICE:  At Tue Jul 02 14:09:53.892109 2013 CEST, performed select on 987801cc-0bd5-4029-8cb9-21b9504fe177, last inventory time was 2013-07-02 14:06:20 
NOTICE:  At Tue Jul 02 14:09:53.892580 2013 CEST, performed select on 303d5379-b768-40fb-9fa0-8d05722405fb, last inventory time was 2013-07-02 14:06:11 
NOTICE:  At Tue Jul 02 14:09:53.893072 2013 CEST, performed select on 906a1318-b935-4862-b779-1b427a0b757a, last inventory time was 2013-07-02 14:06:07 
NOTICE:  At Tue Jul 02 14:32:03.690490 2013 CEST, performed select on 0d719356-3373-4dd3-b8f0-5ef54c8b363c, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 14:32:03.692021 2013 CEST, performed select on f26bac0e-3e6d-453e-81fa-0511e9291e79, last inventory time was 2013-07-02 14:30:22 
NOTICE:  At Tue Jul 02 14:32:03.692240 2013 CEST, performed select on 37cc2d4e-d5f2-401e-af72-a7c44eca00ae, last inventory time was 2013-07-02 14:31:35 
NOTICE:  At Tue Jul 02 14:32:03.693037 2013 CEST, performed select on 972e9124-e59b-4e9b-a857-6d0c0b4db9b1, last inventory time was 2013-07-02 14:30:41 
NOTICE:  At Tue Jul 02 14:53:43.150966 2013 CEST, performed select on bafcbc77-e7f6-4d0b-bc9d-ad4513ee24d3, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 14:53:43.152075 2013 CEST, performed select on 120dab37-1922-469e-8c76-b77250742190, last inventory time was 2013-07-02 14:51:19 
NOTICE:  At Tue Jul 02 15:14:38.262659 2013 CEST, performed select on a6f072f2-3410-4dec-a7d4-e431fdb28526, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 15:14:38.264771 2013 CEST, performed select on 6bee8a86-b2a0-4751-86a9-41c2d14506e5, last inventory time was 2013-07-02 15:10:42 
NOTICE:  At Tue Jul 02 15:14:38.265928 2013 CEST, performed select on f84d4283-f6e8-4e54-bc3a-8509299add6e, last inventory time was 2013-07-02 15:10:50 
NOTICE:  At Tue Jul 02 15:38:01.378342 2013 CEST, performed select on 902eecb3-8d31-494e-acc4-e79d3739549d, last inventory time was <NULL> 
NOTICE:  At Tue Jul 02 15:38:01.380795 2013 CEST, performed select on 9073b71d-823b-4caa-9e4a-3925369c779a, last inventory time was 2013-07-02 15:35:28 
NOTICE:  At Tue Jul 02 15:38:01.381281 2013 CEST, performed select on 57a6e9a5-64dc-4109-9824-4177e533901f, last inventory time was 2013-07-02 15:36:44 
NOTICE:  At Tue Jul 02 15:45:32.933143 2013 CEST, performed select on 693018e3-f652-4e28-9503-f3094082af0d, last inventory time was 2013-06-24 13:21:04 
NOTICE:  At Tue Jul 02 15:45:32.933663 2013 CEST, performed select on b52adaac-eea1-4b76-a30c-5ba7ed5d836e, last inventory time was 2013-07-02 15:41:25 
NOTICE:  At Tue Jul 02 15:57:19.107623 2013 CEST, performed select on 0fa41cb8-029a-4055-a6ba-00a825a87b5c, last inventory time was 2013-06-21 11:30:53 
NOTICE:  At Tue Jul 02 15:57:19.115473 2013 CEST, performed select on 44345f44-a515-442e-8b9e-31f3f2270a5f, last inventory time was 2013-07-02 15:55:35 
 scandb 
--------

(1 row)

rudder=>
Actions #21

Updated by Nicolas CHARLES over 10 years ago

Ok, if the dump.err is empty, it probably means that there is no corruption.
Rebuilding the indexes didn't help either.

Maybe it's simply a performance issue. A you said, you have only 558 MB used out of 2 GB available. So we'll boost the postgres config, to see if there are any improvement
In the file /etc/postgresql/9.1/main/postgresql.conf , could you set the folloxing parameters :

work_mem = 8MB
max_stack_depth = 4MB
effective_cache_size = 256MB

This should help a lot.
Then restart postgres : /etc/init.d/postgresql restart

I you can, try also try to set

shared_buffers=54MB

(you might find http://www.postgresql.org/docs/9.1/static/kernel-resources.html#SYSVIPC handy for this)

Actions #22

Updated by Dennis Cabooter over 10 years ago

Done.

select * from expectedreportsnodes where nodejoinkey = (select max(nodejoinkey) from expectedreports where ruleid = 'hasPolicyServer-root');
 nodejoinkey |                nodeid                
-------------+--------------------------------------
        1534 | 2500fd48-2905-42ba-bfdf-e5c170b9a176
        1534 | 0fa41cb8-029a-4055-a6ba-00a825a87b5c
        1534 | c821ed95-698b-4368-b510-7b9b5c47e224
        1534 | 073239d7-c633-4b60-8477-57131839fd34
        1534 | 009e0af1-1a3c-4ee0-a91d-10b68a008f5d
        1534 | 987801cc-0bd5-4029-8cb9-21b9504fe177
        1534 | b977ee9b-91e2-49f9-92ae-ff8c287eff78
        1534 | f84d4283-f6e8-4e54-bc3a-8509299add6e
        1534 | c8ef418b-8dfd-48a0-8e44-945e46d4b93b
        1534 | 57a6e9a5-64dc-4109-9824-4177e533901f
        1534 | 281d598a-3d4f-4cdb-acc1-6a19087b4ea1
        1534 | fed5f844-de90-47d1-83fd-aa205f192633
        1534 | a99e9d1d-3030-44b4-90e3-859d23e3c16b
        1534 | 303d5379-b768-40fb-9fa0-8d05722405fb
        1534 | e8c59d61-09a2-437f-ad69-7398f13b7dec
        1534 | 593604d8-f885-44a7-b186-ae6fe885415c
        1534 | 9073b71d-823b-4caa-9e4a-3925369c779a
        1534 | 2286a209-5535-461f-bfe7-1df01071c183
        1534 | b52adaac-eea1-4b76-a30c-5ba7ed5d836e
        1534 | 120dab37-1922-469e-8c76-b77250742190
        1534 | 6bee8a86-b2a0-4751-86a9-41c2d14506e5
        1534 | 16cfe090-fecf-4433-8465-aeff1e5ef5b6
        1534 | 37cc2d4e-d5f2-401e-af72-a7c44eca00ae
        1534 | ea321759-5892-4239-a954-4473a8daabee
        1534 | 51f46035-169e-41cf-9c5f-c9a736554740
        1534 | b748549d-5229-49a0-8c07-9efc17de79c5
        1534 | 8587523a-f9e3-4384-b89d-b0c0a697078e
        1534 | 906a1318-b935-4862-b779-1b427a0b757a
        1534 | f26bac0e-3e6d-453e-81fa-0511e9291e79
        1534 | 972e9124-e59b-4e9b-a857-6d0c0b4db9b1
        1534 | ac91fc05-0043-4238-8de8-c1289ec00d67
        1534 | root
        1534 | 31957caa-520b-4db5-b8a7-66a1548de1b2
        1534 | 80cd9420-65af-404d-bfdc-926fb0fd6364
        1534 | f538a113-eba7-421e-9750-6abce6c50252
        1534 | 4d35582b-98b2-4538-95fb-77410305225b
        1534 | 44345f44-a515-442e-8b9e-31f3f2270a5f
        1534 | 09a115e7-8322-4089-9d7d-00d647ece18f
(38 rows)
Actions #23

Updated by Dennis Cabooter over 10 years ago

The settings mentioned above do not solve the webapp time out problem.

Actions #24

Updated by Nicolas CHARLES over 10 years ago

There is something odd with the node 0fa41cb8-029a-4055-a6ba-00a825a87b5c
Last time it sent messages to the server was 21/06/2013, and it was quite slow to fetch its reports from the database

Actions #25

Updated by Dennis Cabooter over 10 years ago

Finally we found the bugger (thanks ncharles :))

Let me summarize what happened:
- the webapp did not respond
- the database was ok (checked)
- one rudder-agent on a node wasn't correctly installed (that should not be a problem)
- In the webapp all statuses of that node are in No Answer, the technical logs tab timed out
- I did "apt-get install rudder-agent" on the node
- All webapp timeouts are gone

Actions #26

Updated by Dennis Cabooter over 10 years ago

rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..5035.81 rows=1000 width=237) (actual time=0.440..477240.725 rows=1000 loops=1)
   ->  Index Scan Backward using executiontimestamp_idx on ruddersysevents  (cost=0.00..618951.93 rows=122910 width=237) (actual time=0.435..477240.483 rows=1000 loops=1)
         Filter: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text)
 Total runtime: 477263.399 ms
(4 rows)
Actions #27

Updated by Nicolas CHARLES over 10 years ago

The agent somehow didn't work on node 0fa41cb8-029a-4055-a6ba-00a825a87b5c, and didn't send any reports to the server from 21/06/2013

Looking in the webapp for this node shows everything in no answer, and the technical log timed out
Reinstalling agent made it send reports to the server, and solved the timeout for the search, but not the technical logs

My guess (i might be wrong) is that postgresql is aggressively optimizing index for recent reports, as they are the most used; looking for old data make it more difficult to look at

Actions #28

Updated by Dennis Cabooter over 10 years ago

rudder=> drop index executiontimestamp_idx; CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp);
DROP INDEX
CREATE INDEX
rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..5014.16 rows=1000 width=237) (actual time=0.425..442442.124 rows=1000 loops=1)
   ->  Index Scan Backward using executiontimestamp_idx on ruddersysevents  (cost=0.00..616836.88 rows=123019 width=237) (actual time=0.424..442441.988 rows=1000 loops=1)
         Filter: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text)
 Total runtime: 442442.230 ms
(4 rows)
Actions #29

Updated by Dennis Cabooter over 10 years ago

rudder=> CREATE INDEX  composite_node_execution_idx on RudderSysEvents(nodeid, executionTimeStamp);
CREATE INDEX
rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3770.62 rows=1000 width=237) (actual time=0.067..603.994 rows=1000 loops=1)
   ->  Index Scan Backward using composite_node_execution_idx on ruddersysevents  (cost=0.00..464445.55 rows=123175 width=237) (actual time=0.065..603.838 rows=1000 loops=1)
         Index Cond: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text)
 Total runtime: 618.914 ms
(4 rows)
Actions #30

Updated by Nicolas CHARLES over 10 years ago

That's much better !

Based on this, we'll make some extensive study on our use of indexes, I guess there are some changes that are due to be done, since we changed many things since when we created them (and we learnt a lot since then too)

Actions #31

Updated by Nicolas PERRON over 10 years ago

  • Target version changed from 2.6.3 to 2.6.4
Actions #32

Updated by Nicolas PERRON over 10 years ago

  • Target version changed from 2.6.4 to 2.6.5
Actions #33

Updated by Andrew Cranson over 10 years ago

I'm also seeing the same problem (as reported in #3902) while running Rudder 2.7.1 (most agents 2.7.0), around 400 nodes, CentOS 6 x86_64 with postgres 8.4.13 (CentOS version).

If we can help test any proposed fixes for this please feel free to contact me as this is having a significant negative effect on our usage of Rudder so far.

Thanks!

Actions #34

Updated by Nicolas CHARLES over 10 years ago

The fix suggested above should work: in postgres, on the rudder database, you run the following command

drop index executiontimestamp_idx; CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp);

We didn't have any side effect with this, and should greatly improve perfs

Actions #35

Updated by Nicolas CHARLES over 10 years ago

  • Project changed from 24 to Rudder
Actions #36

Updated by Andrew Cranson over 10 years ago

Sorry, I didn't see your update (i forgot to 'watch' this). I was just coming back to say that the fix suggested is working great. This has fixed the issue with loading the node list and regex searches on software versions ("The server cannot be contacted at this time"). MUCH better performance. Thanks!

Actions #37

Updated by Nicolas CHARLES over 10 years ago

I'm glad it helped !

Actions #38

Updated by Nicolas PERRON over 10 years ago

  • Target version changed from 2.6.5 to 2.6.6
Actions #39

Updated by François ARMAND over 10 years ago

  • Category set to Web - Compliance & node report
  • Status changed from New to 8
  • Assignee set to Nicolas CHARLES

Given the feedback, I believe we should go that way. What do you thing Nico ?

Actions #40

Updated by Nicolas CHARLES over 10 years ago

The correct fix is:

drop index executiontimestamp_idx;  CREATE INDEX  composite_node_execution_idx on RudderSysEvents(nodeid, executionTimeStamp);

Actions #41

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from 8 to In progress

It seems clear that it's reliable and ought to be merged

Actions #42

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request set to https://github.com/Normation/rudder/pull/313
Actions #43

Updated by François ARMAND over 10 years ago

Why not do that in 2.4 ? Even if information used in node list are not used yet, isn't there any other place where these indexes may have an interest (like in reports or something ?)

Actions #44

Updated by Nicolas CHARLES over 10 years ago

2.4 shouldn't benefit from it, as it is very useful for the list of nodes, and it seems to have little to none advantages for typical reports

But I can be wrong.

Actions #45

Updated by François ARMAND over 10 years ago

OK, let's do like that. User in 2.4 would have reported since the time...

Actions #46

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100
Actions #47

Updated by Anonymous over 10 years ago

Actions #48

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from Pending release to In progress
  • Assignee changed from François ARMAND to Nicolas CHARLES

Ha, there's something to correct there probably, removing the index on executionTimeStamp only did break stuff for one user (slowing drastically the rules page)

Actions #49

Updated by Nicolas CHARLES over 10 years ago

So, this modification introduced the following regressions for coredumb:
  1. rules pages slower (about 50% slower)
  2. reports archive page very very slow (> 1minutes against instanteneous before)

Having both indexes seems to be the correct solution.
I'm reverting this commit and creating the new one

Actions #50

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request changed from https://github.com/Normation/rudder/pull/313 to https://github.com/Normation/rudder/pull/314
Actions #51

Updated by François ARMAND over 10 years ago

  • Subject changed from The server cannot be contacted at this time to Missing index on DB for "reports by nodes" leads to timeout for node list page
Actions #52

Updated by Nicolas CHARLES over 10 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100
Actions #53

Updated by François ARMAND over 10 years ago

The behaviour leading to readding the index on node only is describe in details in #3926.

Actions #54

Updated by François ARMAND over 10 years ago

  • Description updated (diff)
Actions #55

Updated by Nicolas PERRON over 10 years ago

  • Status changed from Pending release to Released
Actions

Also available in: Atom PDF