Bug #3674
closedMissing index on DB for "reports by nodes" leads to timeout for node list page
Description
After upgrading rudder-server-root from 2.5 to 2.6 now and then searching is not working within the webapp. The connection times out and a popup says "The server cannot be contacted at this time". I had this issue before and we managed to solve it with help from Nicolas Charles. We emptied the database and Rudder was fast again. Since then the problem occured again several times. The webapp is not responsive and I can only search for certain groups (the small ones). Please let's find a definitive solution. Do you recommend me to reinstall Rudder from scratch?
Summary of solution: we need two indexes, 1/ one on RudderSysEvent by execution timestamps, and 2/ one on RudderSysEvent by (execution timestamps and by node)
With only 1/ (and not 2/), we get behaviour described here.
With only 2/ (and not 1/), we get #3926.
(1) CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp); (2) CREATE INDEX composite_node_execution_idx on RudderSysEvents (nodeId, executionTimeStamp);
Updated by Nicolas CHARLES over 11 years ago
Hi Dennis,
I'm sorry to heard this problem happens again ...
So, let's start with the basic. What is you OS and Postgres version ?
I recall you are using a VM. Was the VM killed at some point ? or was the postgres killed ?
Postgres doesn't like to be killed (as most DB actually), and it can lead to data corruption; and in your case I believe indexes corruption
Updated by Nicolas CHARLES over 11 years ago
And just to rule out a OOM or something, do you have any line in your syslogs on the server saying :
The Rudder Webapp configuration files could not be updated to set the SQL password
or
The postgresql process was not running and has been restarted
or
The postgresql process has been restarted
or
The postgresql process couldn't be restarted
Updated by Dennis Cabooter over 11 years ago
Hi Nicolas,
# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.2 LTS Release: 12.04 Codename: precise
# dpkg -l | grep postgres | head -1 ii postgresql 9.1+129ubuntu1 object-relational SQL database (supported version)
I restarted the server last week without manually stopping rudder and/or postgres. However, this should not be a problem, since the OS usually stops the daemons before it reboots.
Updated by Dennis Cabooter over 11 years ago
I don't see any of the messages you stated above in syslog.
Updated by Nicolas CHARLES over 11 years ago
And do you have a
database system was interrupted
in any of your postgres log ? if so, could you copy and paste, with most of the lines around it ?
Updated by Nicolas CHARLES over 11 years ago
My best shot on this issue is that the package used was updated with 06/09/2012 code last ( http://changelogs.ubuntu.com/changelogs/pool/main/p/postgresql/postgresql_9.1+129ubuntu1/changelog ), but there's been a fix on 24/09/2012 ( http://wiki.postgresql.org/wiki/20120924updaterelease )
Updated by Nicolas CHARLES over 11 years ago
Hum, reindexing didn't solve the issue
Could you run
SELECT t.tablename, indexname, c.reltuples AS num_rows, pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size, pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size, CASE WHEN x.is_unique = 1 THEN 'Y' ELSE 'N' END AS UNIQUE, idx_scan AS number_of_scans, idx_tup_read AS tuples_read, idx_tup_fetch AS tuples_fetched FROM pg_tables t LEFT OUTER JOIN pg_class c ON t.tablename=c.relname LEFT OUTER JOIN (SELECT indrelid, max(CAST(indisunique AS integer)) AS is_unique FROM pg_index GROUP BY indrelid) x ON c.oid = x.indrelid LEFT OUTER JOIN ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch,indexrelname FROM pg_index x JOIN pg_class c ON c.oid = x.indrelid JOIN pg_class ipg ON ipg.oid = x.indexrelid JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid ) AS foo ON t.tablename = foo.ctablename WHERE t.schemaname='public' ORDER BY 1,2;
and paste the result ?
Updated by Nicolas CHARLES over 11 years ago
Finally, since the version used on ubuntu precise pangolin suffers from a data corruption bug happening in case of a kill -9, fixed in later version of postgres ( https://github.com/postgres/postgres/commit/25560fdeb26f75b50793a56f69f64daab4ad586e ) , and the postgres init script does a kill -9 if anything else didn't work, to stop postgres (line 262 of /usr/bin/pg_ctlcluster ), I guess that's the reason of the problem
Updated by Nicolas CHARLES over 11 years ago
Ha, I checked on a ubuntu 12.04, and the version of the code is from Tue, 02 Apr 2013 11:59:41 +0200
So my theory is broken ...
Does this problem happens only when shuting down your system ?
Updated by Nicolas CHARLES over 11 years ago
Dennis Cabooter wrote:
# dpkg -l | grep postgres | head -1 ii postgresql 9.1+129ubuntu1 object-relational SQL database (supported version)
Could you do the same, but without the head -1 ?
Thanks !
Updated by Dennis Cabooter over 11 years ago
tablename | indexname | num_rows | table_size | index_size | unique | number_of_scans | tuples_read | tuples_fetched -------------------------+---------------------------------+-------------+------------+------------+--------+-----------------+-------------+---------------- archivedruddersysevents | archivedruddersysevents_pkey | 6.11278e+06 | 1816 MB | 131 MB | Y | 0 | 0 | 0 archivedruddersysevents | executiontimestamp_archived_idx | 6.11278e+06 | 1816 MB | 131 MB | Y | 101 | 3590279 | 3590279 changerequest | changerequest_pkey | 0 | 0 bytes | 8192 bytes | Y | 0 | 0 | 0 directives | directive_end | 192 | 72 kB | 16 kB | Y | 0 | 0 | 0 directives | directive_id_start | 192 | 72 kB | 32 kB | Y | 40 | 71 | 7 directives | directives_pkey | 192 | 72 kB | 16 kB | Y | 0 | 0 | 0 eventlog | creationdate_idx | 1729 | 800 kB | 56 kB | Y | 104 | 480 | 480 eventlog | eventlog_pkey | 1729 | 800 kB | 56 kB | Y | 1019 | 20740 | 20740 eventlog | eventtype_idx | 1729 | 800 kB | 88 kB | Y | 264 | 6477 | 0 expectedreports | expectedreports_pkey | 18219 | 4280 kB | 416 kB | Y | 0 | 0 | 0 expectedreports | expectedreports_serialid | 18219 | 4280 kB | 1136 kB | Y | 4373 | 2836364 | 536 expectedreports | expectedreports_versionid | 18219 | 4280 kB | 416 kB | Y | 8291 | 54472 | 53754 expectedreportsnodes | expectedreportsnodes_pkey | 13410 | 1000 kB | 888 kB | Y | 0 | 0 | 0 expectedreportsnodes | expectedreportsnodes_versionid | 13410 | 1000 kB | 312 kB | Y | 8006 | 140214 | 140214 gitcommit | gitcommit_pkey | 762 | 120 kB | 72 kB | Y | 0 | 0 | 0 groups | groups_end | 236 | 40 kB | 16 kB | Y | 0 | 0 | 0 groups | groups_id_start | 236 | 40 kB | 32 kB | Y | 2 | 4 | 3 groups | groups_pkey | 236 | 40 kB | 16 kB | Y | 0 | 0 | 0 groupsnodesjoin | groupsnodesjoin_pkey | 2410 | 208 kB | 176 kB | Y | 723 | 13821 | 0 migrationeventlog | migrationeventlog_pkey | 18 | 8192 bytes | 16 kB | Y | 64 | 64 | 64 nodes | nodes_end | 82 | 16 kB | 16 kB | Y | 0 | 0 | 0 nodes | nodes_id_start | 82 | 16 kB | 16 kB | Y | 0 | 0 | 0 nodes | nodes_pkey | 82 | 16 kB | 16 kB | Y | 0 | 0 | 0 rudderproperties | rudderproperties_pkey | 1 | 8192 bytes | 16 kB | Y | 0 | 0 | 0 ruddersysevents | component_idx | 9.30168e+06 | 2598 MB | 326 MB | Y | 8 | 14800 | 0 ruddersysevents | executiondate_idx | 9.30168e+06 | 2598 MB | 204 MB | Y | 0 | 0 | 0 ruddersysevents | executiontimestamp_idx | 9.30168e+06 | 2598 MB | 208 MB | Y | 17851 | 238193641 | 237995230 ruddersysevents | keyvalue_idx | 9.30168e+06 | 2598 MB | 296 MB | Y | 40 | 6367776 | 0 ruddersysevents | nodeid_idx | 9.30168e+06 | 2598 MB | 528 MB | Y | 171 | 89841 | 89841 ruddersysevents | ruddersysevents_pkey | 9.30168e+06 | 2598 MB | 204 MB | Y | 266959 | 36779591 | 36538294 ruddersysevents | ruleid_idx | 9.30168e+06 | 2598 MB | 501 MB | Y | 0 | 0 | 0 rules | rule_end | 105 | 24 kB | 16 kB | Y | 0 | 0 | 0 rules | rule_id_start | 105 | 24 kB | 16 kB | Y | 0 | 0 | 0 rules | rules_pkey | 105 | 24 kB | 16 kB | Y | 0 | 0 | 0 rulesdirectivesjoin | rulesdirectivesjoin_pkey | 385 | 40 kB | 48 kB | Y | 0 | 0 | 0 rulesgroupjoin | rulesgroupjoin_pkey | 158 | 24 kB | 32 kB | Y | 0 | 0 | 0 workflow | | 0 | 0 bytes | | N | | | (37 rows)
Updated by Dennis Cabooter over 11 years ago
# dpkg -l | grep postgres ii postgresql 9.1+129ubuntu1 object-relational SQL database (supported version) ii postgresql-9.1 9.1.9-0ubuntu12.04 object-relational SQL database, version 9.1 server ii postgresql-client-9.1 9.1.9-0ubuntu12.04 front-end programs for PostgreSQL 9.1 ii postgresql-client-common 129ubuntu1 manager for multiple PostgreSQL client versions ii postgresql-common 129ubuntu1 PostgreSQL database-cluster manager
Updated by Dennis Cabooter over 11 years ago
As far as I can see the problem only occurs in case of a system reboot/shutdown.
Updated by Nicolas CHARLES over 11 years ago
Dennis,
We want to make sure that the postgres is the real culprit. I've made a little script that simulates the query made on the list of nodes page. Could you run it into postgres, and make a pastie of the exit ?
It will output a list of : time when the query was made, nodeid, last inventory date
CREATE LANGUAGE plpgsql; CREATE FUNCTION scanDb() RETURNS TEXT LANGUAGE plpgsql AS $$DECLARE DECLARE output TEXT; cur_nodeid varchar; lasttime timestamp; BEGIN FOR cur_nodeid in EXECUTE ' SELECT distinct(nodeid) FROM nodes ' LOOP SELECT into lasttime executionTimeStamp from RudderSysEvents where 1=1 and nodeid = cur_nodeid order by executionTimeStamp desc limit 1; RAISE NOTICE 'At %, performed select on %, last inventory time was % ', timeofday() , cur_nodeid, lasttime; END LOOP; RETURN output; END$$;
To run the script, you just have to execute
SELECT scanDb();
It may be quite long
Updated by Nicolas CHARLES over 11 years ago
Dennis, could you run also this command
/usr/lib/postgresql/9.1/bin/pg_controldata /var/lib/postgresql/9.1/main/
and this one
pg_dumpall > /dev/null 2> dump.err
and past the output of the first command, and the content of the dump.err file ?
Thanks !
Updated by Dennis Cabooter over 11 years ago
# /usr/lib/postgresql/9.1/bin/pg_controldata /var/lib/postgresql/9.1/main/ pg_control version number: 903 Catalog version number: 201105231 Database system identifier: 5867460532166847588 Database cluster state: in production pg_control last modified: Tue 02 Jul 2013 01:01:45 PM CEST Latest checkpoint location: 11/47506F80 Prior checkpoint location: 11/47112978 Latest checkpoint's REDO location: 11/473A0A98 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 0/32986218 Latest checkpoint's NextOID: 37224 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 670 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Time of latest checkpoint: Tue 02 Jul 2013 01:01:13 PM CEST Minimum recovery ending location: 0/0 Backup start location: 0/0 Current wal_level setting: minimal Current max_connections setting: 100 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value
Updated by Dennis Cabooter over 11 years ago
This is probably not what you expect:
# cat dump.err pg_dumpall: could not connect to database "template1": FATAL: role "root" does not exist
Updated by Nicolas CHARLES over 11 years ago
Dennis Cabooter wrote:
This is probably not what you expect:
[...]
indeed; could you run it as the user postgres ?
Updated by Dennis Cabooter over 11 years ago
# psql -q -U rudder -h localhost rudder=> select nodeid,max(executiontimestamp) from ruddersysevents group by nodeid; rudder=> CREATE LANGUAGE plpgsql; ERROR: language "plpgsql" already exists rudder=> rudder=> CREATE FUNCTION scanDb() RETURNS TEXT rudder-> LANGUAGE plpgsql rudder-> AS $$DECLARE rudder$> DECLARE rudder$> rudder$> output TEXT; rudder$> cur_nodeid varchar; rudder$> lasttime timestamp; rudder$> rudder$> BEGIN rudder$> FOR cur_nodeid in EXECUTE ' rudder$> SELECT distinct(nodeid) FROM nodes rudder$> ' rudder$> LOOP rudder$> rudder$> SELECT into lasttime executionTimeStamp from RudderSysEvents where 1=1 and nodeid = cur_nodeid order by executionTimeStamp desc limit 1; rudder$> RAISE NOTICE 'At %, performed select on %, last inventory time was % ', timeofday() , cur_nodeid, lasttime; rudder$> END LOOP; rudder$> rudder$> RETURN output; rudder$> END$$; rudder=> SELECT scanDb(); NOTICE: At Tue Jul 02 11:38:05.101472 2013 CEST, performed select on f538a113-eba7-421e-9750-6abce6c50252, last inventory time was 2013-07-02 11:35:33 NOTICE: At Tue Jul 02 11:38:05.102514 2013 CEST, performed select on c821ed95-698b-4368-b510-7b9b5c47e224, last inventory time was 2013-07-02 11:35:49 NOTICE: At Tue Jul 02 12:08:42.269042 2013 CEST, performed select on # The content of this file should be replaced with a valid U, last inventory time was <NULL> NOTICE: At Tue Jul 02 12:08:42.270656 2013 CEST, performed select on c8ef418b-8dfd-48a0-8e44-945e46d4b93b, last inventory time was 2013-07-02 12:05:56 NOTICE: At Tue Jul 02 12:08:42.272259 2013 CEST, performed select on 09a115e7-8322-4089-9d7d-00d647ece18f, last inventory time was 2013-07-02 12:05:22 NOTICE: At Tue Jul 02 12:08:42.272602 2013 CEST, performed select on 281d598a-3d4f-4cdb-acc1-6a19087b4ea1, last inventory time was 2013-07-02 12:06:50 NOTICE: At Tue Jul 02 12:08:42.273626 2013 CEST, performed select on b977ee9b-91e2-49f9-92ae-ff8c287eff78, last inventory time was 2013-07-02 12:05:49 NOTICE: At Tue Jul 02 12:08:42.274354 2013 CEST, performed select on 31957caa-520b-4db5-b8a7-66a1548de1b2, last inventory time was 2013-07-02 12:06:02 NOTICE: At Tue Jul 02 12:08:42.275071 2013 CEST, performed select on e8c59d61-09a2-437f-ad69-7398f13b7dec, last inventory time was 2013-07-02 12:05:59 NOTICE: At Tue Jul 02 12:08:42.276161 2013 CEST, performed select on 2500fd48-2905-42ba-bfdf-e5c170b9a176, last inventory time was 2013-07-02 12:05:31 NOTICE: At Tue Jul 02 12:08:42.277094 2013 CEST, performed select on fed5f844-de90-47d1-83fd-aa205f192633, last inventory time was 2013-07-02 12:05:46 NOTICE: At Tue Jul 02 12:08:42.278567 2013 CEST, performed select on 16cfe090-fecf-4433-8465-aeff1e5ef5b6, last inventory time was 2013-07-02 12:05:15 NOTICE: At Tue Jul 02 12:08:42.279118 2013 CEST, performed select on 80cd9420-65af-404d-bfdc-926fb0fd6364, last inventory time was 2013-07-02 12:06:22 NOTICE: At Tue Jul 02 12:08:42.279914 2013 CEST, performed select on 8587523a-f9e3-4384-b89d-b0c0a697078e, last inventory time was 2013-07-02 12:05:55 NOTICE: At Tue Jul 02 12:08:42.280485 2013 CEST, performed select on 4d35582b-98b2-4538-95fb-77410305225b, last inventory time was 2013-07-02 12:06:06 NOTICE: At Tue Jul 02 12:34:26.663505 2013 CEST, performed select on ba09ede2-4abf-40f4-97e8-5cbabcc5746c, last inventory time was <NULL> NOTICE: At Tue Jul 02 12:58:31.153248 2013 CEST, performed select on e97b7188-131c-4f0d-9c82-76cf2d12c8ff, last inventory time was <NULL> NOTICE: At Tue Jul 02 12:58:31.156417 2013 CEST, performed select on ea321759-5892-4239-a954-4473a8daabee, last inventory time was 2013-07-02 12:55:21 NOTICE: At Tue Jul 02 12:58:31.157089 2013 CEST, performed select on 51f46035-169e-41cf-9c5f-c9a736554740, last inventory time was 2013-07-02 12:56:10 NOTICE: At Tue Jul 02 12:58:31.158340 2013 CEST, performed select on 2286a209-5535-461f-bfe7-1df01071c183, last inventory time was 2013-07-02 12:55:38 NOTICE: At Tue Jul 02 13:23:36.257280 2013 CEST, performed select on 82a32926-0aaa-49cb-bc5f-0c79e61a5b6e, last inventory time was <NULL> NOTICE: At Tue Jul 02 13:23:36.258265 2013 CEST, performed select on b748549d-5229-49a0-8c07-9efc17de79c5, last inventory time was 2013-07-02 13:21:26 NOTICE: At Tue Jul 02 13:23:36.260109 2013 CEST, performed select on 593604d8-f885-44a7-b186-ae6fe885415c, last inventory time was 2013-07-02 13:20:18 NOTICE: At Tue Jul 02 13:23:36.261480 2013 CEST, performed select on ac91fc05-0043-4238-8de8-c1289ec00d67, last inventory time was 2013-07-02 13:20:27 NOTICE: At Tue Jul 02 13:23:36.262462 2013 CEST, performed select on 009e0af1-1a3c-4ee0-a91d-10b68a008f5d, last inventory time was 2013-07-02 13:20:43 NOTICE: At Tue Jul 02 13:47:38.688213 2013 CEST, performed select on 868a518f-93b3-4d00-af57-2253b529e8c7, last inventory time was <NULL> NOTICE: At Tue Jul 02 14:09:53.889301 2013 CEST, performed select on 32a37cb6-6e56-475f-989e-ddb6e644f1f1, last inventory time was <NULL> NOTICE: At Tue Jul 02 14:09:53.890917 2013 CEST, performed select on 073239d7-c633-4b60-8477-57131839fd34, last inventory time was 2013-07-02 14:05:40 NOTICE: At Tue Jul 02 14:09:53.891753 2013 CEST, performed select on a99e9d1d-3030-44b4-90e3-859d23e3c16b, last inventory time was 2013-07-02 14:05:42 NOTICE: At Tue Jul 02 14:09:53.892109 2013 CEST, performed select on 987801cc-0bd5-4029-8cb9-21b9504fe177, last inventory time was 2013-07-02 14:06:20 NOTICE: At Tue Jul 02 14:09:53.892580 2013 CEST, performed select on 303d5379-b768-40fb-9fa0-8d05722405fb, last inventory time was 2013-07-02 14:06:11 NOTICE: At Tue Jul 02 14:09:53.893072 2013 CEST, performed select on 906a1318-b935-4862-b779-1b427a0b757a, last inventory time was 2013-07-02 14:06:07 NOTICE: At Tue Jul 02 14:32:03.690490 2013 CEST, performed select on 0d719356-3373-4dd3-b8f0-5ef54c8b363c, last inventory time was <NULL> NOTICE: At Tue Jul 02 14:32:03.692021 2013 CEST, performed select on f26bac0e-3e6d-453e-81fa-0511e9291e79, last inventory time was 2013-07-02 14:30:22 NOTICE: At Tue Jul 02 14:32:03.692240 2013 CEST, performed select on 37cc2d4e-d5f2-401e-af72-a7c44eca00ae, last inventory time was 2013-07-02 14:31:35 NOTICE: At Tue Jul 02 14:32:03.693037 2013 CEST, performed select on 972e9124-e59b-4e9b-a857-6d0c0b4db9b1, last inventory time was 2013-07-02 14:30:41 NOTICE: At Tue Jul 02 14:53:43.150966 2013 CEST, performed select on bafcbc77-e7f6-4d0b-bc9d-ad4513ee24d3, last inventory time was <NULL> NOTICE: At Tue Jul 02 14:53:43.152075 2013 CEST, performed select on 120dab37-1922-469e-8c76-b77250742190, last inventory time was 2013-07-02 14:51:19 NOTICE: At Tue Jul 02 15:14:38.262659 2013 CEST, performed select on a6f072f2-3410-4dec-a7d4-e431fdb28526, last inventory time was <NULL> NOTICE: At Tue Jul 02 15:14:38.264771 2013 CEST, performed select on 6bee8a86-b2a0-4751-86a9-41c2d14506e5, last inventory time was 2013-07-02 15:10:42 NOTICE: At Tue Jul 02 15:14:38.265928 2013 CEST, performed select on f84d4283-f6e8-4e54-bc3a-8509299add6e, last inventory time was 2013-07-02 15:10:50 NOTICE: At Tue Jul 02 15:38:01.378342 2013 CEST, performed select on 902eecb3-8d31-494e-acc4-e79d3739549d, last inventory time was <NULL> NOTICE: At Tue Jul 02 15:38:01.380795 2013 CEST, performed select on 9073b71d-823b-4caa-9e4a-3925369c779a, last inventory time was 2013-07-02 15:35:28 NOTICE: At Tue Jul 02 15:38:01.381281 2013 CEST, performed select on 57a6e9a5-64dc-4109-9824-4177e533901f, last inventory time was 2013-07-02 15:36:44 NOTICE: At Tue Jul 02 15:45:32.933143 2013 CEST, performed select on 693018e3-f652-4e28-9503-f3094082af0d, last inventory time was 2013-06-24 13:21:04 NOTICE: At Tue Jul 02 15:45:32.933663 2013 CEST, performed select on b52adaac-eea1-4b76-a30c-5ba7ed5d836e, last inventory time was 2013-07-02 15:41:25 NOTICE: At Tue Jul 02 15:57:19.107623 2013 CEST, performed select on 0fa41cb8-029a-4055-a6ba-00a825a87b5c, last inventory time was 2013-06-21 11:30:53 NOTICE: At Tue Jul 02 15:57:19.115473 2013 CEST, performed select on 44345f44-a515-442e-8b9e-31f3f2270a5f, last inventory time was 2013-07-02 15:55:35 scandb -------- (1 row) rudder=>
Updated by Nicolas CHARLES over 11 years ago
Ok, if the dump.err is empty, it probably means that there is no corruption.
Rebuilding the indexes didn't help either.
Maybe it's simply a performance issue. A you said, you have only 558 MB used out of 2 GB available. So we'll boost the postgres config, to see if there are any improvement
In the file /etc/postgresql/9.1/main/postgresql.conf , could you set the folloxing parameters :
work_mem = 8MB max_stack_depth = 4MB effective_cache_size = 256MB
This should help a lot.
Then restart postgres : /etc/init.d/postgresql restart
I you can, try also try to set
shared_buffers=54MB
(you might find http://www.postgresql.org/docs/9.1/static/kernel-resources.html#SYSVIPC handy for this)
Updated by Dennis Cabooter over 11 years ago
Done.
select * from expectedreportsnodes where nodejoinkey = (select max(nodejoinkey) from expectedreports where ruleid = 'hasPolicyServer-root');
nodejoinkey | nodeid -------------+-------------------------------------- 1534 | 2500fd48-2905-42ba-bfdf-e5c170b9a176 1534 | 0fa41cb8-029a-4055-a6ba-00a825a87b5c 1534 | c821ed95-698b-4368-b510-7b9b5c47e224 1534 | 073239d7-c633-4b60-8477-57131839fd34 1534 | 009e0af1-1a3c-4ee0-a91d-10b68a008f5d 1534 | 987801cc-0bd5-4029-8cb9-21b9504fe177 1534 | b977ee9b-91e2-49f9-92ae-ff8c287eff78 1534 | f84d4283-f6e8-4e54-bc3a-8509299add6e 1534 | c8ef418b-8dfd-48a0-8e44-945e46d4b93b 1534 | 57a6e9a5-64dc-4109-9824-4177e533901f 1534 | 281d598a-3d4f-4cdb-acc1-6a19087b4ea1 1534 | fed5f844-de90-47d1-83fd-aa205f192633 1534 | a99e9d1d-3030-44b4-90e3-859d23e3c16b 1534 | 303d5379-b768-40fb-9fa0-8d05722405fb 1534 | e8c59d61-09a2-437f-ad69-7398f13b7dec 1534 | 593604d8-f885-44a7-b186-ae6fe885415c 1534 | 9073b71d-823b-4caa-9e4a-3925369c779a 1534 | 2286a209-5535-461f-bfe7-1df01071c183 1534 | b52adaac-eea1-4b76-a30c-5ba7ed5d836e 1534 | 120dab37-1922-469e-8c76-b77250742190 1534 | 6bee8a86-b2a0-4751-86a9-41c2d14506e5 1534 | 16cfe090-fecf-4433-8465-aeff1e5ef5b6 1534 | 37cc2d4e-d5f2-401e-af72-a7c44eca00ae 1534 | ea321759-5892-4239-a954-4473a8daabee 1534 | 51f46035-169e-41cf-9c5f-c9a736554740 1534 | b748549d-5229-49a0-8c07-9efc17de79c5 1534 | 8587523a-f9e3-4384-b89d-b0c0a697078e 1534 | 906a1318-b935-4862-b779-1b427a0b757a 1534 | f26bac0e-3e6d-453e-81fa-0511e9291e79 1534 | 972e9124-e59b-4e9b-a857-6d0c0b4db9b1 1534 | ac91fc05-0043-4238-8de8-c1289ec00d67 1534 | root 1534 | 31957caa-520b-4db5-b8a7-66a1548de1b2 1534 | 80cd9420-65af-404d-bfdc-926fb0fd6364 1534 | f538a113-eba7-421e-9750-6abce6c50252 1534 | 4d35582b-98b2-4538-95fb-77410305225b 1534 | 44345f44-a515-442e-8b9e-31f3f2270a5f 1534 | 09a115e7-8322-4089-9d7d-00d647ece18f (38 rows)
Updated by Dennis Cabooter over 11 years ago
The settings mentioned above do not solve the webapp time out problem.
Updated by Nicolas CHARLES over 11 years ago
There is something odd with the node 0fa41cb8-029a-4055-a6ba-00a825a87b5c
Last time it sent messages to the server was 21/06/2013, and it was quite slow to fetch its reports from the database
Updated by Dennis Cabooter over 11 years ago
Finally we found the bugger (thanks ncharles :))
Let me summarize what happened:
- the webapp did not respond
- the database was ok (checked)
- one rudder-agent on a node wasn't correctly installed (that should not be a problem)
- In the webapp all statuses of that node are in No Answer, the technical logs tab timed out
- I did "apt-get install rudder-agent" on the node
- All webapp timeouts are gone
Updated by Dennis Cabooter over 11 years ago
rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..5035.81 rows=1000 width=237) (actual time=0.440..477240.725 rows=1000 loops=1) -> Index Scan Backward using executiontimestamp_idx on ruddersysevents (cost=0.00..618951.93 rows=122910 width=237) (actual time=0.435..477240.483 rows=1000 loops=1) Filter: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text) Total runtime: 477263.399 ms (4 rows)
Updated by Nicolas CHARLES over 11 years ago
The agent somehow didn't work on node 0fa41cb8-029a-4055-a6ba-00a825a87b5c, and didn't send any reports to the server from 21/06/2013
Looking in the webapp for this node shows everything in no answer, and the technical log timed out
Reinstalling agent made it send reports to the server, and solved the timeout for the search, but not the technical logs
My guess (i might be wrong) is that postgresql is aggressively optimizing index for recent reports, as they are the most used; looking for old data make it more difficult to look at
Updated by Dennis Cabooter over 11 years ago
rudder=> drop index executiontimestamp_idx; CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp); DROP INDEX CREATE INDEX
rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..5014.16 rows=1000 width=237) (actual time=0.425..442442.124 rows=1000 loops=1) -> Index Scan Backward using executiontimestamp_idx on ruddersysevents (cost=0.00..616836.88 rows=123019 width=237) (actual time=0.424..442441.988 rows=1000 loops=1) Filter: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text) Total runtime: 442442.230 ms (4 rows)
Updated by Dennis Cabooter over 11 years ago
rudder=> CREATE INDEX composite_node_execution_idx on RudderSysEvents(nodeid, executionTimeStamp); CREATE INDEX
rudder=> explain analyse select * from ruddersysevents where nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c' order by executiontimestamp desc limit 1000;
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=0.00..3770.62 rows=1000 width=237) (actual time=0.067..603.994 rows=1000 loops=1) -> Index Scan Backward using composite_node_execution_idx on ruddersysevents (cost=0.00..464445.55 rows=123175 width=237) (actual time=0.065..603.838 rows=1000 loops=1) Index Cond: (nodeid = '0fa41cb8-029a-4055-a6ba-00a825a87b5c'::text) Total runtime: 618.914 ms (4 rows)
Updated by Nicolas CHARLES over 11 years ago
That's much better !
Based on this, we'll make some extensive study on our use of indexes, I guess there are some changes that are due to be done, since we changed many things since when we created them (and we learnt a lot since then too)
Updated by Nicolas PERRON over 11 years ago
- Target version changed from 2.6.3 to 2.6.4
Updated by Nicolas PERRON over 11 years ago
- Target version changed from 2.6.4 to 2.6.5
Updated by Andrew Cranson over 11 years ago
I'm also seeing the same problem (as reported in #3902) while running Rudder 2.7.1 (most agents 2.7.0), around 400 nodes, CentOS 6 x86_64 with postgres 8.4.13 (CentOS version).
If we can help test any proposed fixes for this please feel free to contact me as this is having a significant negative effect on our usage of Rudder so far.
Thanks!
Updated by Nicolas CHARLES over 11 years ago
The fix suggested above should work: in postgres, on the rudder database, you run the following command
drop index executiontimestamp_idx; CREATE INDEX executionTimeStamp_idx on RudderSysEvents (executionTimeStamp);
We didn't have any side effect with this, and should greatly improve perfs
Updated by Andrew Cranson over 11 years ago
Sorry, I didn't see your update (i forgot to 'watch' this). I was just coming back to say that the fix suggested is working great. This has fixed the issue with loading the node list and regex searches on software versions ("The server cannot be contacted at this time"). MUCH better performance. Thanks!
Updated by Nicolas PERRON over 11 years ago
- Target version changed from 2.6.5 to 2.6.6
Updated by François ARMAND over 11 years ago
- Category set to Web - Compliance & node report
- Status changed from New to 8
- Assignee set to Nicolas CHARLES
Given the feedback, I believe we should go that way. What do you thing Nico ?
Updated by Nicolas CHARLES over 11 years ago
The correct fix is:
drop index executiontimestamp_idx; CREATE INDEX composite_node_execution_idx on RudderSysEvents(nodeid, executionTimeStamp);
Updated by Nicolas CHARLES over 11 years ago
- Status changed from 8 to In progress
It seems clear that it's reliable and ought to be merged
Updated by Nicolas CHARLES over 11 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to François ARMAND
- Pull Request set to https://github.com/Normation/rudder/pull/313
PR is there: https://github.com/Normation/rudder/pull/313
Updated by François ARMAND over 11 years ago
Why not do that in 2.4 ? Even if information used in node list are not used yet, isn't there any other place where these indexes may have an interest (like in reports or something ?)
Updated by Nicolas CHARLES over 11 years ago
2.4 shouldn't benefit from it, as it is very useful for the list of nodes, and it seems to have little to none advantages for typical reports
But I can be wrong.
Updated by François ARMAND over 11 years ago
OK, let's do like that. User in 2.4 would have reported since the time...
Updated by Nicolas CHARLES over 11 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset 40ed6ddbee92b6a42eaf667b8077b10005c85f84.
Updated by Anonymous over 11 years ago
Applied in changeset 27e2c8c6f51cba1d728b5cee638ee9537f8e802d.
Updated by Nicolas CHARLES over 11 years ago
- Status changed from Pending release to In progress
- Assignee changed from François ARMAND to Nicolas CHARLES
Ha, there's something to correct there probably, removing the index on executionTimeStamp only did break stuff for one user (slowing drastically the rules page)
Updated by Nicolas CHARLES over 11 years ago
- rules pages slower (about 50% slower)
- reports archive page very very slow (> 1minutes against instanteneous before)
Having both indexes seems to be the correct solution.
I'm reverting this commit and creating the new one
Updated by Nicolas CHARLES over 11 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Nicolas CHARLES to François ARMAND
- Pull Request changed from https://github.com/Normation/rudder/pull/313 to https://github.com/Normation/rudder/pull/314
Updated by François ARMAND over 11 years ago
- Subject changed from The server cannot be contacted at this time to Missing index on DB for "reports by nodes" leads to timeout for node list page
Updated by Nicolas CHARLES over 11 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset 3ed9266032095a4a6ecba29ea5c7789f3b51a519.
Updated by François ARMAND over 11 years ago
The behaviour leading to readding the index on node only is describe in details in #3926.
Updated by Nicolas PERRON over 11 years ago
- Status changed from Pending release to Released