User story #4234
closedAdd online|offline check before calculating status
Description
Would it be an idea to check if a node is online, before calculating status? When managing desktops, desktops will be offline sometimes, so i have no way to see if a nodes offline or if rudder-agent disfunctions. I'm going to manage 20+ desktops with Rudder and they will be offline several times. I can assume a machine with an no answer state is offline, but it could also be that rudder-agent is broken on that node.
Some thoughts on IRC:
16:30 < jooooooon> dnns: it's not always possible to check if a node is offline, because of firewalls rules, network topology etc 16:30 < jooooooon> :/ 16:34 < dnns> jooooooon: it's maybe not always possible to see if a node's online by pinging. but maybe the node could send a message with curl to the rudder server to say hi' i'm online 16:34 < jooooooon> that's kinda the logic we already apply with reports tho, no? 16:35 < dnns> jooooooon: how can i see the difference between a node which is offline and a node with a disfunctional rudder-agent/rsyslog? 16:36 < jooooooon> dnns: ahhh, I see what you mean 16:37 < jooooooon> any ideas on how to display that differently? 16:38 < dnns> jooooooon: Succes | Repaired | Error | No Answer | Offline 16:38 < dnns> ? 16:38 < jooooooon> I like it :) 16:39 < jooooooon> I just worry that we can't really *know* a node is offline 16:39 < jooooooon> but I suppose a node that doesn't contact the Rudder server is pretty much offline 16:39 < jooooooon> maybe "No answer" should be renamed too? 16:40 < ncharles> maybe we could have some kind of snmp probe ? 16:40 < Kegeruneku> Like uh, a ping probe ? 16:41 < Kegeruneku> instead 16:41 < dnns> well, no anwer can also mean that the node is up but doesn't send out logs 16:44 < jooooooon> but we can't really differentiate between that scenario and "offline" 16:45 < Kegeruneku> Well, Off line = Off the line = No connection between two peers 16:45 < Kegeruneku> It's not really wrong :
Updated by Erwin Vrolijk about 11 years ago
There is really no way to differentiate a non functioning node from an offline node if the pinging (curl or whatever, from node to server) is done from the main rudder agent.
This can be sidestepped by relying for the pinging on a different process, like cron. This is a bit ugly, but cron is already a requirement for the rudder agent.
My proposal would be to use the bundled curl to regurarly send a ping to the rudder server via HTTP post. This process must not have any dependencies on the rudder agent, cfengine or rsyslog and must be controlled via cron. Thes cron entry is added during the installation of the rudder agent.
The HTTP POST could simply only contain the nodes rudderid and a NOOP or Keepalive message.
A nodes status can become offline when no ping is received for 2x the configured ping time in cron.
The HTTP post messages can be turned into a technical log by the rudder server and appended to the nodes log. This allows for debugging of the ping itself, for instance when the rudder agent is working fine but the pinging is not.
Updated by Vincent MEMBRÉ about 11 years ago
- Status changed from New to Discussion
- Target version set to Ideas (not version specific)
Thanks to both of you about your ideas and proposal.
It would be definitely a good thing to be able to determine whether a node is shutdown or if it has issue sending reports.
And I like your idea, Erwin, of sending a "ping" from each agent that would transform into a report from the node. (with a dedicated API on the server)
However this is very tricky, and if the node cannot send reports, maybe the node will not be able to send that signal, leading to false "offline" instead of "no anwser".
I have no ideas of what would be the best solution here, and what should be done.
Everyone, what do you think about that feature? do you have any problem with it, do you have any more ideas to add ?
Updated by Jonathan CLARKE about 11 years ago
- Status changed from Discussion to New
- Target version deleted (
Ideas (not version specific))
I like the idea. Sure, Vincent, you're right that if network conditions are adverse, the "ping by curl" won't work anymore than sending reports, BUT there are many cases where syslog reports and/or rudder-agent can fail to send, but a simple HTTP ping could get through. This wouldn't be foolproof, but could be nice to have.
Updated by Olivier Mauras about 11 years ago
Please make it an option and not a requirement :)
Updated by Benoît PECCATTE over 9 years ago
- Category set to Web - Compliance & node report
- Target version set to Ideas (not version specific)
Updated by François ARMAND 7 months ago
- Related to Architecture #24963: Persist compliance in base to know last state for a long time added
Updated by François ARMAND 4 months ago
- Status changed from New to Resolved
- Regression set to No
We can't really check is the node is online, and we want to avoid direct node call from relay.
So we added the possibility for the user to control "how long" a node last compliance should be kept before the node should be considered not available.
With #24963, people can ask for rudder to report a problem (grey node) after 10 minutes of no reports for a server, but says it's OK to wait 4 days for a laptop.
I'm closing this one, since we won't do more on the subject without new expressed needs.