Project

General

Profile

Bug #10019

send-clean.sh blocks on inventory parsing error

Added by Janos Mattyasovszky almost 2 years ago. Updated 10 months ago.

Status:
Released
Priority:
N/A
Category:
System integration
Target version:
Severity:
Major - prevents use of part of Rudder | no simple workaround
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Small
Priority:
47

Description

When there is a bug in the inventory parsing, and the a Java Exception is thrown by the endpoint, and the POST request returns a 500 error, but this is handled by send-clean.sh as if the endpoint would throw a 503 (too busy):

See: https://github.com/Normation/rudder-techniques/blob/master/tools/send-clean.sh#L88

It would make sense to only try to upload it "n" times or add somekind of queueing, so that the agent does not always try to re-upload the same (possibly) broken inventory, because that could cause the upload process of always fail on the same inventory, rendering the complete inventory-system dead...

Associated revisions

Revision 508a8a66 (diff)
Added by Benoît PECCATTE 10 months ago

Fixes #10019: send-clean.sh blocks on inventory parsing error

History

#1 Updated by Vincent MEMBRÉ almost 2 years ago

  • Category set to System integration
  • Status changed from New to Discussion
  • Assignee set to Benoît PECCATTE

send-clean.sh is stateless and do not keep track of what happened, so it may be quite complicated to change it directly.

Maybe inventory should be moved to a "retry" folder that would either: try every 30 minutes to send them / Try one more time then send to fail inventories ?

Benoit do you have more ideas ?

#2 Updated by Janos Mattyasovszky almost 2 years ago

Well, it does have some minimal state, because it checks the inventory to be older than X minutes, to ensure the signature was also uploaded, it one was sent on it's way. You could say it is only retried if it is not older than X minutes/hours, if yes, fail it. The downside would be, that if the inventory endpoint is saturated, you'd throw away inventories, so there should be a better check than the is it 5xx or not, so you can handle a plain 500 different from a 503, since 500 could mean the inventory breaks the parsing and should not be tried more than X times, and a 503 means that it has nothing to do with the inventory, but the saturation of the endpoint.

#3 Updated by Benoît PECCATTE over 1 year ago

  • Severity set to Minor - inconvenience | misleading | easy workaround
  • User visibility set to Infrequent - complex configurations | third party integrations

#5 Updated by Jonathan CLARKE over 1 year ago

  • Effort required set to Small

#6 Updated by Benoît PECCATTE over 1 year ago

  • Priority set to 14

#7 Updated by Janos Mattyasovszky over 1 year ago

Well, it's not completely stateless, since it only sends the inventories, if (IIRC) 2mins have elapsed, so any possible delayed signatures can also be uploaded if the sender supports signing inventories.

The same idea could also be used to not try to send inventories that are older than (pi/random) 4 hours.

This could ensure that you don't saturate your sending queue of (default) 50 with broken inventories and never process anything never, and they just keep piling up.

#8 Updated by Jonathan CLARKE over 1 year ago

  • Status changed from Discussion to New

#9 Updated by Jonathan CLARKE over 1 year ago

  • Assignee deleted (Benoît PECCATTE)

#10 Updated by Benoît PECCATTE over 1 year ago

  • Priority changed from 14 to 24

#11 Updated by Benoît PECCATTE about 1 year ago

  • Priority changed from 24 to 32

#12 Updated by Janos Mattyasovszky 12 months ago

  • Priority changed from 32 to 31

Florian found this as well: As soon you have 50 failed inventories, you get no inventory uploaded, because they get retried until the queue is saturated and send-clean exists with 1...

A valid idea would be to put all failed to a failed folder, and always process those after all the "fresh" ones, so you make sure you always try to process those which have not failed before

#13 Updated by Florian Heigl 12 months ago

  • Severity changed from Minor - inconvenience | misleading | easy workaround to Major - prevents use of part of Rudder | no simple workaround
  • Priority changed from 31 to 48

I found something, not exactly the same.

The inventory processing aborts once an exit 1 is propagated from send_clean.
This happens for any bad inventory that is moved to "failed"

The bad inventories are normally distributed among all inventories.
That means each inventory run can hit one.
At that point this run aborts, so it will process not 50 inventories, but some number between 1 and 50.
Without doing too much math there is a point where the remaining runs are not sufficient to process the non-broken inventories.
This, of course keeps repeating forever.

Simply put, if you inject 200 bad inventories over the name space of all inventories you will NEVER AGAIN reach a point where you'd have working inventories.

Please note to fix that the caller (policy that aborts the loop) should be fixed, not send_clean itself. Alternatively, both could be fixed to have a special return code for this case.
There are surely some cases where it's valid to just abort.

It is not valid to abort "upload inventories" because of this though, since uploading DOES work.
(unless all failed, which would be a good reason to abort and be verbosely sad)

I'm changing this to SEV: major, and would suggest to consider turning it into critical. It could even cause data loss if there's some very very bad coincidence. (Accepted nodes not getting updates, not for new nodes since they will probably not even get in)
Visibility is infrequent, can only happen on setups > 288 hosts though.

#14 Updated by Janos Matya 12 months ago

The problem is that the api returns 5xx on a Java issue (the ugly way heapdumping) and also on queue full.

#15 Updated by Benoît PECCATTE 10 months ago

  • Target version set to 4.1.10
  • Priority changed from 48 to 47

#16 Updated by Benoît PECCATTE 10 months ago

  • Status changed from New to In progress
  • Assignee set to Benoît PECCATTE

#17 Updated by Benoît PECCATTE 10 months ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Benoît PECCATTE to Alexis MOUSSET
  • Pull Request set to https://github.com/Normation/rudder-techniques/pull/1249

#18 Updated by Normation Quality Assistant 10 months ago

  • Assignee changed from Alexis MOUSSET to Benoît PECCATTE

#19 Updated by Benoît PECCATTE 10 months ago

  • Status changed from Pending technical review to Pending release

#20 Updated by Vincent MEMBRÉ 10 months ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.10 and 4.2.4 which were released today.

Also available in: Atom PDF