Please mark as complete - had a discussion with Sehul and it looks like the current SQL Tool will allow this
Pls mark ticket as completed
This was resolved by Yogesh and Sehul - this behaviour is by design when filling the OnError variable in a workflow it will suppress the error and continue as normal - Pls mark ticket as completed
Resolved by Yogesh in version 2.4.10 - Pls close ticket
Resolution: Populate the OnError variable of the Main SubWorkflow on the consume workflow. This will force a 200 OK response, even if an error (like an API failure) happens in the sub-workflow.
This fix has been implemented in Journeys and no queues have stopped in the past 3 weeks.
This fix was later rolled out to other MicroServices such as Collections, BusinessRules etc
I believe this resolves one of the main issues why we switch architecture back to VMs
Context: Morning Everyone
We have a big problem in Journey Team’s Production environment, and we please need your assistance / guidance as to how we can resolve it.
When we read a record from RabbitMQ, and that record gets consumed in a workflow, and an error occurs in the workflow, like an API error, we would handle the error inside the workflow (pack up with a status of lost), however, the response sent back to Rabbit is not a 200 OK, even though we handled the error, but some other response like 400 Bad Request. Now what happens is Rabbit receives this back as an unack.
Now if we have 7 consumers each with a prefetch of 5, we thus have a read capacity of 35 messages. But what happens over a period of a day or so is that the unacks build up (due to API errors in the workflows) until we have 35 unacks on the queue. What happens then, because our 35 read capacity is now stuck with 35 unacks, the queue is not being processed further. What we used to do is to restart Warewolf, and it would take the 35 in unack and put them back in Ready and the queue would proceed with processing.
Now, here is where the problem comes in – we recently upgraded to Warewolf 18.104.22.168, as it contained some additional logic for RabbitMQ hoping that it would bring some kind of relief for the issue. However, we now see that with this version, even if we restart Warewolf, the unack messages on the queue still remain at 35 (it doesn’t put it in Ready like the previous versions), thus we do not have any capacity to read from the queue and the queue is stuck. This is currently happening on a number of queues.
We have a legal risk to the business here – as we are not serving our customers. How can we resolve this?
A downgrade was mentioned as an option to bring immediate relief – do you agree? If yes, to what version? And how do we remedy this for the future?
Finally – this kind of issue became more apparent after a change was made in Warewolf on how to deal with errors – I think the change was introduced at 22.214.171.124
When Output is blank in Warewolf 126.96.36.199, the variable maps, for example:
I've logged the following ticket - https://community.warewolf.io/communities/1/topics/1510-can-we-please-confirm-that-the-condition-if-there-is-no-error-on-the-decision-tool-recognizes-and
With regards to ASYNC processing, we've removed it after we experienced a drop-off in records during execution. I'm aware that you have reported it and DEV2 have fixed it, we (Journey Team) haven't enabled or tested it since the fix was made. Something for us to look into.
This is a nice breakdown, thanks Khonzi.
I've been thinking about the issue, and from my experience monitoring journeys and the particular issue, I have the following comments:
1. If an API fails, the current condition in the Decision Tool, "If there is No error", appears to not not work for certain errors like a 504 error. What then happens the customer record gets stuck in the Journey, because it doesn't pack up - due to the logic in the decision failing. Now, because the record is stuck in the journey, it doesn't send an acknowledgement back to the relevant RabbitMQ queue. And because this happens, the next record in the RabbitMQ queue is not being read and processed - we basically have a lock, that only a restart at this stage seems to resolve.
Now knowing what we know, what can we do differently?
2. We can fix the condition, "If there is no errors" in the Decision Tool to recognize failures such as 504.
From the Journey Team - We can alter the condition on the Decision Tool, to rather look if any records are being returned from the API, instead of "If there is no error". If no records are returned then Packup, else continue.
3. We can also implement ASYNC processing on the journeys - this will enable other records to be processed without having to wait for one to finish before the next one is picked up.
Customer support service by UserEcho