Bug #7793
closedRandom wrong or partial copy of promises from node to server
Description
I had a case where I scratched my head for a few minutes: reporting told me I had unexpected reports for one rule, but they looked correct. This messages first appeared witha missing "Update" reports
The messages for unexpected never disapeared.
However, drilling down in the issue, I noticed the serial did not match between generated promises and actual promises (but the config ID did match)
This does strongly sounds like a partial copy (or invalid copy) of promises
This is random, and happens quite frequently (10 times out of 84 runs on one of my platform)
(happened on 3.2.0, but may be on any version)
Files
Updated by Nicolas CHARLES about 9 years ago
logs show this message
error: Empty first server packet when listing directory '/var/rudder/share/c45e0e9f-5d9b-4685-b26c-35a3dcf35954/rules/cfengine-community/clockConfiguration/3.0'!
Updated by Nicolas CHARLES about 9 years ago
server did not crash
root 6639 0.2 0.4 458236 7320 ? Ss 15:28 0:17 /var/rudder/cfengine-community/bin/cf-serverd
Updated by Nicolas CHARLES about 9 years ago
promise generation was at
2016-01-15 17:09:51.786+00
but the failing copy at
2016-01-15 17:11:37+00:00
so this should not be related
Updated by Nicolas CHARLES about 9 years ago
This seems to be quite common on my relay
cf_relay_rudder_local__1452876910_Fri_Jan_15_16_55_10_2016_0x7fda26b9e700: error: Empty first server packet when listing directory '/var/rudder/configuration-repository/shared-files'! cf_relay_rudder_local__1452877517_Fri_Jan_15_17_05_17_2016_0x7fda26b9e700: error: Empty first server packet when listing directory '/usr/share/ncf/tree/30_generic_methods'! cf_relay_rudder_local__1452879326_Fri_Jan_15_17_35_26_2016_0x7fda26b9e700: error: Empty first server packet when listing directory '/usr/share/ncf/tree/60_services/baseline/stable'!
(note that the failing node was directly connected to the server)
Updated by Nicolas CHARLES about 9 years ago
On the relay, it did detect that the copy was invalid, but not on the node, (or a wring condition is used on the node)
so the file rudder_promises_generated remains, and it is never solved (until next regeneration)
Updated by Nicolas CHARLES about 9 years ago
This also happens on CFEngine Enterprise 3.7.2
Updated by Nicolas CHARLES about 9 years ago
More details:
On se server side:
rudder verbose: 192.168.41.3> Host relay.rudder.local granted access to /var/rudder/configuration-repository/ncf/ncf_hash_file rudder verbose: 192.168.41.3> Filename /var/rudder/configuration-repository/ncf/ncf_hash_file is resolved to /var/rudder/configuration-repository/ncf/ncf_hash_file rudder verbose: 192.168.41.3> Found a matching rule in access list (/var/rudder/configuration-repository/ncf/ncf_hash_file in /var/rudder/configuration-repository/ncf) rudder verbose: 192.168.41.3> Mapping root privileges to access non-root files rudder verbose: 192.168.41.3> Host relay.rudder.local granted access to /var/rudder/configuration-repository/ncf/ncf_hash_file rudder verbose: Obtained IP address of '192.168.41.3' on socket 9 from accept rudder verbose: New connection (from 192.168.41.3, sd 9), spawning new thread... rudder info: 192.168.41.3> Accepting connection rudder verbose: 192.168.41.3> Setting socket timeout to 600 seconds. rudder verbose: 192.168.41.3> Peeked CAUTH in TCP stream, considering the protocol as Classic rudder verbose: 192.168.41.3> Peer's identity is: MD5=00600d9ee1389841e95ad24fe88c1d68 rudder verbose: 192.168.41.3> A public key was already known from relay.rudder.local/192.168.41.3 - no trust required rudder verbose: 192.168.41.3> The public key identity was confirmed as root@relay.rudder.local rudder verbose: 192.168.41.3> Authentication of client relay.rudder.local/192.168.41.3 achieved rudder verbose: 192.168.41.3> Filename /usr/share/ncf/tree is resolved to /usr/share/ncf/tree rudder verbose: 192.168.41.3> Found a matching rule in access list (/usr/share/ncf/tree in /usr/share/ncf/tree) rudder verbose: 192.168.41.3> Mapping root privileges to access non-root files rudder verbose: 192.168.41.3> Host relay.rudder.local granted access to /usr/share/ncf/tree rudder verbose: 192.168.41.3> Filename /usr/share/ncf/tree is resolved to /usr/share/ncf/tree rudder verbose: 192.168.41.3> Found a matching rule in access list (/usr/share/ncf/tree in /usr/share/ncf/tree) rudder verbose: 192.168.41.3> Mapping root privileges to access non-root files rudder verbose: 192.168.41.3> Host relay.rudder.local granted access to /usr/share/ncf/tree rudder verbose: Obtained IP address of '192.168.41.3' on socket 10 from accept rudder verbose: New connection (from 192.168.41.3, sd 10), spawning new thread... rudder info: 192.168.41.3> Accepting connection rudder verbose: 192.168.41.3> Setting socket timeout to 600 seconds. rudder verbose: 192.168.41.3> Peeked CAUTH in TCP stream, considering the protocol as Classic rudder verbose: 192.168.41.3> Peer's identity is: MD5=00600d9ee1389841e95ad24fe88c1d68 rudder verbose: 192.168.41.3> A public key was already known from relay.rudder.local/192.168.41.3 - no trust required rudder verbose: 192.168.41.3> The public key identity was confirmed as root@relay.rudder.local rudder verbose: 192.168.41.3> Authentication of client relay.rudder.local/192.168.41.3 achieved
on the client side
R: @@Common@@result_success@@hasPolicyServer-root@@common-root@@16@@Binaries update@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#The CFengine binaries in /var/rudder/cfengine-community/bin are up to date error: Empty first server packet when listing directory '/usr/share/ncf/tree'! R: @@DistributePolicy@@result_na@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-DP@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-distributePolicy@@13@@Configure ncf@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#Configuration of ncf is not necessary on non root servers R: @@DistributePolicy@@result_error@@5f852b2b-3120-4a69-8f44-f4e403b6c9ad-DP@@5f852b2b-3120-a69-8f44-f4e403b6c9ad-distributePolicy@@13@@Propagate promises@@None@@2016-01-17 20:14:04+00:00##5f852b2b-3120-4a69-8f44-f4e403b6c9ad@#Cannot propagate common ncf promises
It looks as if the connection is reinitialized, or agent tries to connect twice.
Updated by Nicolas CHARLES about 9 years ago
Attached, debug output from the server.
The client failed to fetch error: Empty first server packet when listing directory '/usr/share/ncf/tree/60_services/baseline/stable'!
I'm wondering if it is not the agent that fails ..
Updated by Jonathan CLARKE about 9 years ago
This is indeed worrisome.
I found the origin of that message in cf-agent code:
libcfnet/client_code.c:354: 293 /* Returning NULL (an empty list) does not mean empty directory but ERROR, 294 * since every directory has to contain at least . and .. */ 295 Item *RemoteDirList(const char *dirname, bool encrypt, AgentConnection *conn) 296 { ... 351 if (recvbuffer[0] == '\0') 352 { 353 Log(LOG_LEVEL_ERR, 354 "Empty%s server packet when listing directory '%s'!", 355 (start == NULL) ? " first" : "", 356 dirname); 357 goto err; 358 } ... 405 err: /* free list */ 406 for (Item *ip = start; ip != NULL; ip = start) 407 { 408 start = ip->next; 409 free(ip->name); 410 free(ip); 411 }
So there is indeed an error printed out, but as far as I can tell the connection is not marked as bad or closed.
The cause for the error is a packet from the server containing exactly nothing (just the final '\0' byte). This basically means the error is on the server side, because it is either sending an empty packet when it shouldn't, or possibly sending the right packet that happens to have a "\0" as it's first byte.
Nico, what exact versions of the agent and the server are you using? Have you tried varying server versions or just agent versions in above tests?
Updated by Jonathan CLARKE about 9 years ago
If you can reproduce this pretty often, we can build a patched version of cf-serverd (and cf-agent, if needed) and print debug information to figure out what is really going on. Or just run cf-serverd under gdb, but that could be very tiresome if we don't know whether the bug is occuring or not in a given run, since don't yet have a way of detecting that on the server side.
Updated by Nicolas CHARLES about 9 years ago
so far, i reproduiced it with any combinaision of 3.7.x <-> 3.7.x
I'm building a 3.6 to validate there
Updated by Jonathan CLARKE about 9 years ago
Could you run a find on the directories that cause this "Empty first server packet when listing directory" on the server please? That is, from your logs above:
- /usr/share/ncf/tree/60_services/baseline/stable
- /usr/share/ncf/tree
at least, and attach them to this bug?
One option I see from reading the source code is a filename longer than 256 chars:
int CfOpenDirectory(ServerConnectionState *conn, char *sendbuffer, char *oldDirname) ... /* TODO fix copying names greater than 256. */
Updated by Alexis Mousset about 9 years ago
- Status changed from New to In progress
The problem is that the check for the NULL character is done before decrypting the server response (https://github.com/cfengine/core/blob/3.7.x/libcfnet/client_code.c#L351). Preparing a fix.
Updated by Alexis Mousset about 9 years ago
Upstream PR: https://github.com/cfengine/core/pull/2475. Preparing a fix for Rudder 3.2.
Updated by Alexis Mousset about 9 years ago
- Status changed from In progress to Pending technical review
- Assignee changed from Alexis Mousset to Jonathan CLARKE
- Pull Request set to https://github.com/Normation/rudder-packages/pull/862
Updated by Alexis Mousset about 9 years ago
- Related to Bug #7796: Rudder agent does not detect when promise copy failed added
Updated by Alexis Mousset about 9 years ago
- Status changed from Pending technical review to Pending release
- % Done changed from 0 to 100
Applied in changeset rudder-packages|c367f984ebb50f3252ec5d27bcb1cfdd6aef71bc.
Updated by Nicolas CHARLES about 9 years ago
related cfengine ticket
https://dev.cfengine.com/issues/7908
Updated by Vincent MEMBRÉ almost 9 years ago
- Status changed from Pending release to Released
This bug has been fixed in Rudder 3.2.0 which were released today.