Bug #6375
closedMissing node in Rules Manager
Added by Clem Def over 9 years ago. Updated over 9 years ago.
Description
Hi everyone,
I have created a dynamic group with 25 nodes but when I apply a rule on this group, I have only 24 nodes in the list.
However, the node which is not listed, the rule is applied but I have no report.
This problem appear just after an upgrade to rudder 3.0
Thanks for your help :-)
Updated by François ARMAND over 9 years ago
- Description updated (diff)
Some questions:
- the group being dynamic, no chance the node lost the properties that made it be in the group ? (of course, if it's a kind of distribution, that's unlikelly... But an environment variable, a software version, etc may change)
- you say that on the missing node, the rule is applied: how do you know that ?
- is there other other rules applied to that node ? If so, do you get there reports (i.e: is there any reports comming from that node ? - you can see them in "technical logs" tab in the node details screen).
Thanks,
Updated by François ARMAND over 9 years ago
- Project changed from 41 to Rudder
- Category set to Web - Config management
- Target version set to 3.0.3
Updated by Clem Def over 9 years ago
Hi,
Thanks for your answer.
What is surprising is that the node belongs to the group but does not appear in the list 'Compliance detail - Node'of the rule.
The rule is to download a file from a shared space and the file is present on the node.
And in the 'technical log' of the node, I see the execution of the rule.
I have an other rule with the same target group and I do not see the node.
Updated by François ARMAND over 9 years ago
The problem seems to be with the reporting, since you get technical logs and the rule are applied.
OK, so we are in the weird space.
So, some weird question:
- do you have any compliance reports for that node, in the node details => "reports" tab ? If so, do you see a pattern on the missing rules ? Does the present rules have all the wanted directives ?
- did you cloned that node from an other (usually, the symptoms are not these one, but well...)
- does cleaning caches (administration -> settings -> clear caches, in the bottom of the page) change anything ?
After, depanding of you motivation to debug, you could enable the compliance logs: edit the file: /opt/rudder/etc/logback.xml, there is a "explain_compliance" line. Change the status to trace and restart jetty (service restart rudder-jetty). Then, go to the node details (without going to the rule or directive pages), on the "reports" tab.
There should be a lot of logs in /var/log/rudder/webapp/${TODAY}.stderrout.log (you can follow them appearing with the tail -f /var/log/rudder/webapp/${TODAY}.stderrout.log before going to node compliance reports).
There should be information about what Rudder understand for that node.
Thanks,
Updated by Clem Def over 9 years ago
In reports tab , I have an error : "Could not fetch reports information" but in technical logs, I see the rule.
I don't clone this node. It is a virtual machine which have been cloned but I do a re-installation of rudder (I have deleted the backup folder when I have remove rudder).
And I have cleared the cache.
But now I have something wrong on the node, I can't send inventory or update the node.
Output of 'rudder agent update' :
2015-03-10T09:56:48+0100 error: /default/update/methods/'update'[0]: Method 'update_action' failed in some repairs
R: *************************************************************************
- rudder-agent could not get an updated configuration from the policy server. *
- This can be caused by a network issue, an unavailable server, or if this *
- node was deleted from the Rudder root server. *
- Any existing configuration policy will continue to be applied without change. * *************************************************************************
Thanks for your time...
Updated by François ARMAND over 9 years ago
Clem Def wrote:
In reports tab, I have an error : "Could not fetch reports information" but in technical logs, I see the rule.
OK, this is strange. It may happen if last logs are rather old (and so postgres is too long to fetch information because it better indexes recent entries). In the technical logs, do you recent (i.e < 10 min old) lines ?
I don't clone this node. It is a virtual machine which have been cloned but I do a re-installation of rudder (I have deleted the backup folder when I have remove rudder).
And I have cleared the cache.
Perfect :)
In Rudder 3.0, you can also use the command "rudder agent reinit" to completly reinit the agent (new uuid, cf-key) - typically after a VM clone :)
In /var/log/rudder/webapp/... I have this error " [2015-03-10 09:48:00] ERROR com.normation.rudder.web.services.ReportDisplayer - Failure(Found bad number of reports for node NodeId(509ddc2c-b968-4525-8874-301b41d029e9),Empty,Empty)" Seems to be 'normal'....
It is not. It's saying that the node sent too many or too few reports compared to what was expected from Rudder point of view. That may happen if the node didn't succeed in getting its new promises, and so it is still sending reports for the old ones.
But now I have something wrong on the node, I can't send inventory or update the node.
Output of 'rudder agent update' :
2015-03-10T09:56:48+0100 error: /default/update/methods/'update'[0]: Method 'update_action' failed in some repairs
R: *************************************************************************
- rudder-agent could not get an updated configuration from the policy server. *
- This can be caused by a network issue, an unavailable server, or if this *
- node was deleted from the Rudder root server. *
- Any existing configuration policy will continue to be applied without change. * *************************************************************************
OK, so that's a problem (less weird).
That something we are able to debug. First, does the node reach the cf-serverd, and if so, why cf-serverd kick it off ?
On the server, could you please exec the command: rudder server debug IP_OF_NODE
It will start a dedicated cf-serverd in debug mode, and say a lot of thing on the standard output.
When it's ready, exec on the node: rudder agent inventory
And look for things like "connection refused because host name not know" or similar sentences.
Thanks for your time...
No pb, the bug are one our side ;)
Updated by Clem Def over 9 years ago
Hi,
Thanks for your answer,
When I execute rudder server debug and I launch a rudder agent inventory I do not notice anything wrong :
2015-03-11T10:28:14+0100 verbose: Obtained IP address of 'NODE' on socket 7 from accept
2015-03-11T10:28:14+0100 verbose: New connection (from , sd 7), spawning new thread...
2015-03-11T10:28:14+0100 info: > Accepting connection
2015-03-11T10:28:14+0100 verbose: > Setting socket timeout to 600 seconds.
2015-03-11T10:28:14+0100 verbose: > Peeked CAUTH in TCP stream, considering the protocol as Classic
2015-03-11T10:28:14+0100 verbose: > Peer's identity is: MD5=ce1bf86a4b7700931bf049e296dff38d
2015-03-11T10:28:14+0100 verbose: > A public key was already known from NODE - no trust required
2015-03-11T10:28:14+0100 verbose: > The public key identity was confirmed as root@NODE
2015-03-11T10:28:14+0100 verbose: > Authentication of client NODE achieved
2015-03-11T10:28:14+0100 verbose: > Filename /var/rudder/configuration-repository/shared-files/check_sauvegardes.sh is resolved to /var/rudder/configuration-repository/shared-files/check_sauvegardes.sh
2015-03-11T10:28:14+0100 verbose: > Found a matching rule in access list (/var/rudder/configuration-repository/shared-files/check_sauvegardes.sh in /var/rudder/configuration-repository/shared-files)
2015-03-11T10:28:14+0100 verbose: > Mapping root privileges to access non-root files
2015-03-11T10:28:14+0100 verbose: > Host granted access to /var/rudder/configuration-repository/shared-files/check_sauvegardes.sh
2015-03-11T10:28:15+0100 info: > Closed connection, terminating thread
There is only one rule in log but I have an other rule which is not log in root-server and which is log on the node :
rudder agent inventory
R:
@copyFile@result_success
@3f323322-1af7-439d-889f-3347bbd6fa84@62c72623-3bf4-47f8-87f0-82252bfe93e1
@19@Copy file
@check_sauvegardes.sh@2015-03-11 09:20:37+00:00##509ddc2c-b968-4525-8874-301b41d029e9
#The content of the file(s) is valid
#The other rule which is not log
R: @checkGenericFileContent
@result_success@39e816c4-9a53-463c-9d93-aa4cf81e21a1
@5caf863f-b1b1-4c83-8416-7f87da9fce6f@49
@File@/etc/nagios/nrpe.cfg
@2015-03-11 09:20:37+00:00##509ddc2c-b968-4525-8874-301b41d029e9@#The file /etc/nagios/nrpe.cfg was already in accordance with the policy@
Any ideas ? :-p
Updated by François ARMAND over 9 years ago
- Reproduced set to No
Well, that works on debug and not otherwise. Nice.
Some random ideas:
- did you try to run "rudder agent update" after "run agent inventory" ? Does it worked ?
- debug start a new server cf-serverd, so perhaps something is wrong with the currently running. Could you try to kill cf-agent, cf-execd, cf-serverd on the server and then exec "rudder agent run" on the server, so that everything and restarted, and then try again to run "rudder agent update" on the node ?
Updated by Clem Def over 9 years ago
After kill every process one more time and execute rudder agent update then inventory...
It's working :-)
Some weird problems. Unfortunately I think it's impossible to reproduce so... I hope I will the first and the last who have these problems :-p
Thanks for your help and to the rudder's team for this awesome project !
Updated by François ARMAND over 9 years ago
Oh, that's great, thanks!
I think we can build the following scenario:
- there was a DNS problem, making the node not correctly update its promises, but in non obvious way
- during that period, any attemp to solve the problem lead to nothing
- when working on #6377, something changed with the DNS, and the node problem with promise update became obvious
- the problem was located on cf-serverd, something was wrong with it - certainly something due to the previous DNS problem or it's correction that triggered a bug
- so using the debugging server lead a good result,
- and at that point, restarting the services in Rudder make things OK (because DNS (or something that was induce before the DNS correction) was OK now)/
It's full of hypothesis and nice stories, but the scenario may be OK.
All in all, I'm glad it's now working, and I'm glad to see that the new "rudder server debug" command is great for that kind of debugging nightmare :)
I thing I'm going to close it as "rejected", because I'm not sure about what else can be done. Clement, is this OK with you ?
Updated by François ARMAND over 9 years ago
- Status changed from New to Rejected
Rejecting, can not be reproduced + read the previous comments