Project

General

Profile

Actions

Bug #7793

closed

Random wrong or partial copy of promises from node to server

Added by Nicolas CHARLES over 5 years ago. Updated over 5 years ago.

Status:
Released
Priority:
N/A
Category:
Agent
Target version:
Severity:
User visibility:
Effort required:
Priority:

Description

I had a case where I scratched my head for a few minutes: reporting told me I had unexpected reports for one rule, but they looked correct. This messages first appeared witha missing "Update" reports

The messages for unexpected never disapeared.
However, drilling down in the issue, I noticed the serial did not match between generated promises and actual promises (but the config ID did match)
This does strongly sounds like a partial copy (or invalid copy) of promises

This is random, and happens quite frequently (10 times out of 84 runs on one of my platform)
(happened on 3.2.0, but may be on any version)


Files

debug (12.2 KB) debug debug output from server Nicolas CHARLES, 2016-01-17 21:41

Subtasks 2 (0 open2 closed)

User story #7797: Warn when the servers answers an empty string to a (S)OPENDIRReleasedJonathan CLARKE2016-01-18Actions
User story #7798: Remove the patch to warn when the servers answers an empty string to a (S)OPENDIR from 3.2ReleasedJonathan CLARKE2016-01-18Actions

Related issues

Related to Rudder - Bug #7796: Rudder agent does not detect when promise copy failedReleasedNicolas CHARLES2016-01-18Actions
Actions #1

Updated by Nicolas CHARLES over 5 years ago

logs show this message
error: Empty first server packet when listing directory '/var/rudder/share/c45e0e9f-5d9b-4685-b26c-35a3dcf35954/rules/cfengine-community/clockConfiguration/3.0'!

Actions #2

Updated by Nicolas CHARLES over 5 years ago

  • Assignee set to Alexis MOUSSET
Actions #3

Updated by Nicolas CHARLES over 5 years ago

server did not crash

root      6639  0.2  0.4 458236  7320 ?        Ss   15:28   0:17 /var/rudder/cfengine-community/bin/cf-serverd

Actions #4

Updated by Nicolas CHARLES over 5 years ago

promise generation was at

2016-01-15 17:09:51.786+00

but the failing copy at
2016-01-15 17:11:37+00:00

so this should not be related

Actions #5

Updated by Nicolas CHARLES over 5 years ago

This seems to be quite common on my relay

cf_relay_rudder_local__1452876910_Fri_Jan_15_16_55_10_2016_0x7fda26b9e700:   error: Empty first server packet when listing directory '/var/rudder/configuration-repository/shared-files'!
cf_relay_rudder_local__1452877517_Fri_Jan_15_17_05_17_2016_0x7fda26b9e700:   error: Empty first server packet when listing directory '/usr/share/ncf/tree/30_generic_methods'!
cf_relay_rudder_local__1452879326_Fri_Jan_15_17_35_26_2016_0x7fda26b9e700:   error: Empty first server packet when listing directory '/usr/share/ncf/tree/60_services/baseline/stable'!

(note that the failing node was directly connected to the server)

Actions #6

Updated by Nicolas CHARLES over 5 years ago

On the relay, it did detect that the copy was invalid, but not on the node, (or a wring condition is used on the node)
so the file rudder_promises_generated remains, and it is never solved (until next regeneration)

Actions #7

Updated by Nicolas CHARLES over 5 years ago

This also happens on CFEngine Enterprise 3.7.2

Actions #8

Updated by Nicolas CHARLES over 5 years ago

More details:
On se server side:

rudder  verbose: 192.168.41.3> Host relay.rudder.local granted access to /var/rudder/configuration-repository/ncf/ncf_hash_file
rudder  verbose: 192.168.41.3> Filename /var/rudder/configuration-repository/ncf/ncf_hash_file is resolved to /var/rudder/configuration-repository/ncf/ncf_hash_file
rudder  verbose: 192.168.41.3> Found a matching rule in access list (/var/rudder/configuration-repository/ncf/ncf_hash_file in /var/rudder/configuration-repository/ncf)
rudder  verbose: 192.168.41.3> Mapping root privileges to access non-root files
rudder  verbose: 192.168.41.3> Host relay.rudder.local granted access to /var/rudder/configuration-repository/ncf/ncf_hash_file
rudder  verbose: Obtained IP address of '192.168.41.3' on socket 9 from accept
rudder  verbose: New connection (from 192.168.41.3, sd 9), spawning new thread...
rudder     info: 192.168.41.3> Accepting connection
rudder  verbose: 192.168.41.3> Setting socket timeout to 600 seconds.
rudder  verbose: 192.168.41.3> Peeked CAUTH in TCP stream, considering the protocol as Classic
rudder  verbose: 192.168.41.3> Peer's identity is: MD5=00600d9ee1389841e95ad24fe88c1d68
rudder  verbose: 192.168.41.3> A public key was already known from relay.rudder.local/192.168.41.3 - no trust required
rudder  verbose: 192.168.41.3> The public key identity was confirmed as root@relay.rudder.local
rudder  verbose: 192.168.41.3> Authentication of client relay.rudder.local/192.168.41.3 achieved
rudder  verbose: 192.168.41.3> Filename /usr/share/ncf/tree is resolved to /usr/share/ncf/tree
rudder  verbose: 192.168.41.3> Found a matching rule in access list (/usr/share/ncf/tree in /usr/share/ncf/tree)
rudder  verbose: 192.168.41.3> Mapping root privileges to access non-root files
rudder  verbose: 192.168.41.3> Host relay.rudder.local granted access to /usr/share/ncf/tree
rudder  verbose: 192.168.41.3> Filename /usr/share/ncf/tree is resolved to /usr/share/ncf/tree
rudder  verbose: 192.168.41.3> Found a matching rule in access list (/usr/share/ncf/tree in /usr/share/ncf/tree)
rudder  verbose: 192.168.41.3> Mapping root privileges to access non-root files
rudder  verbose: 192.168.41.3> Host relay.rudder.local granted access to /usr/share/ncf/tree
rudder  verbose: Obtained IP address of '192.168.41.3' on socket 10 from accept
rudder  verbose: New connection (from 192.168.41.3, sd 10), spawning new thread...
rudder     info: 192.168.41.3> Accepting connection
rudder  verbose: 192.168.41.3> Setting socket timeout to 600 seconds.
rudder  verbose: 192.168.41.3> Peeked CAUTH in TCP stream, considering the protocol as Classic
rudder  verbose: 192.168.41.3> Peer's identity is: MD5=00600d9ee1389841e95ad24fe88c1d68
rudder  verbose: 192.168.41.3> A public key was already known from relay.rudder.local/192.168.41.3 - no trust required
rudder  verbose: 192.168.41.3> The public key identity was confirmed as root@relay.rudder.local
rudder  verbose: 192.168.41.3> Authentication of client relay.rudder.local/192.168.41.3 achieved

on the client side

R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@16@@Binaries update@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#The CFengine binaries in /var/rudder/cfengine-community/bin are up to date
   error: Empty first server packet when listing directory '/usr/share/ncf/tree'!
R: @@DistributePolicy@@result_na@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-DP@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-distributePolicy@@13@@Configure ncf@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#Configuration of ncf is not necessary on non root servers
R: @@DistributePolicy@@result_error@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-DP@@5f852b2b-3120-a69-8f44-f4e403b6c9ad-distributePolicy@@13@@Propagate promises@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#Cannot propagate common ncf promises

It looks as if the connection is reinitialized, or agent tries to connect twice.

Actions #9

Updated by Nicolas CHARLES over 5 years ago

  • File debug debug added
  • Priority changed from N/A to 1

Attached, debug output from the server.
The client failed to fetch error: Empty first server packet when listing directory '/usr/share/ncf/tree/60_services/baseline/stable'!

I'm wondering if it is not the agent that fails ..

Actions #10

Updated by Nicolas CHARLES over 5 years ago

  • Description updated (diff)
Actions #11

Updated by Jonathan CLARKE over 5 years ago

This is indeed worrisome.

I found the origin of that message in cf-agent code:

libcfnet/client_code.c:354:
293 /* Returning NULL (an empty list) does not mean empty directory but ERROR,
294  * since every directory has to contain at least . and .. */
295 Item *RemoteDirList(const char *dirname, bool encrypt, AgentConnection *conn)
296 {
...
351         if (recvbuffer[0] == '\0')
352         {
353             Log(LOG_LEVEL_ERR,
354                 "Empty%s server packet when listing directory '%s'!",
355                 (start == NULL) ? " first" : "",
356                 dirname);
357             goto err;
358         }                                                                                                                                                                                                                                 
...
405   err:                                                         /* free list */
406     for (Item *ip = start; ip != NULL; ip = start)
407     {
408         start = ip->next;
409         free(ip->name);
410         free(ip);
411     }

So there is indeed an error printed out, but as far as I can tell the connection is not marked as bad or closed.

The cause for the error is a packet from the server containing exactly nothing (just the final '\0' byte). This basically means the error is on the server side, because it is either sending an empty packet when it shouldn't, or possibly sending the right packet that happens to have a "\0" as it's first byte.

Nico, what exact versions of the agent and the server are you using? Have you tried varying server versions or just agent versions in above tests?

Actions #12

Updated by Jonathan CLARKE over 5 years ago

If you can reproduce this pretty often, we can build a patched version of cf-serverd (and cf-agent, if needed) and print debug information to figure out what is really going on. Or just run cf-serverd under gdb, but that could be very tiresome if we don't know whether the bug is occuring or not in a given run, since don't yet have a way of detecting that on the server side.

Actions #13

Updated by Nicolas CHARLES over 5 years ago

so far, i reproduiced it with any combinaision of 3.7.x <-> 3.7.x
I'm building a 3.6 to validate there

Actions #14

Updated by Jonathan CLARKE over 5 years ago

Could you run a find on the directories that cause this "Empty first server packet when listing directory" on the server please? That is, from your logs above:

  • /usr/share/ncf/tree/60_services/baseline/stable
  • /usr/share/ncf/tree

at least, and attach them to this bug?

One option I see from reading the source code is a filename longer than 256 chars:

int CfOpenDirectory(ServerConnectionState *conn, char *sendbuffer, char *oldDirname)
...
        /* TODO fix copying names greater than 256. */
Actions #15

Updated by Alexis MOUSSET over 5 years ago

  • Status changed from New to In progress

The problem is that the check for the NULL character is done before decrypting the server response (https://github.com/cfengine/core/blob/3.7.x/libcfnet/client_code.c#L351). Preparing a fix.

Actions #16

Updated by Alexis MOUSSET over 5 years ago

Upstream PR: https://github.com/cfengine/core/pull/2475. Preparing a fix for Rudder 3.2.

Actions #17

Updated by Alexis MOUSSET over 5 years ago

  • Target version set to 3.2.0
Actions #18

Updated by Alexis MOUSSET over 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Alexis MOUSSET to Jonathan CLARKE
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/862
Actions #19

Updated by Alexis MOUSSET over 5 years ago

  • Related to Bug #7796: Rudder agent does not detect when promise copy failed added
Actions #20

Updated by Alexis MOUSSET over 5 years ago

  • Status changed from Pending technical review to Pending release
  • % Done changed from 0 to 100
Actions #21

Updated by Nicolas CHARLES over 5 years ago

\o/

Actions #22

Updated by Nicolas CHARLES over 5 years ago

Actions #23

Updated by Vincent MEMBRÉ over 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 3.2.0 which were released today.

Actions

Also available in: Atom PDF