How to handle unsent data in microservices - microservices

I have two services A and B. A receives a request, does some processing and sends the processed data to B.
What should I do with the data in the following scenario:
A receives data.
Processes it successfully.
Crashes before sending the data to B.
Comes back online.

I would either use some sort of persistent log to handle the communication between the micro-services (e.g. Kafka) or some sort of retry mechanism.
In either case, the data that A received and processed must not disappear until the entire chain of execution completes successfully or, at the very least, until A has successfully completed its work and passed its payload to the next service. And this payload must exist until the next service processes it, and so on.
Generally, the steps should continue as follows:
A comes back online and sees that there is work to be done: the one that it processed at step #2 (since it's processing is not yet done as far as the overall system is concerned). Unless there are some weird side-effects, it shouldn't matter that it processes it again.
The data is sent to B (although this step should, conceptually, be part of "processing" the data).
If A crashes again then it probably means that the data it processes matches nicely with a bug in A and the whole chain of starting up, reprocessing and crashing will continue for ever. This is a Denial of Service, malicious or not, and you should have some procedure in place to handle it, perhaps you don't reprocess the same data more than a given number of times and log this to be analyzed with top priority.

Related

Is this Redis Race Condition Scenario Possible?

I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.

Question about implementing Raft's Client interaction

I'm actually learning MIT6.824,
https://www.youtube.com/channel/UC_7WrbZTCODu1o_kfUMq88g,
and try to implement its lab,
there's a paragraph in raft's paper describing client semantics:
Our goal for Raft is to implement linearizable seman- tics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and its response). However, as described so far Raft can exe- cute a command multiple times: for example, if the leader crashes after committing the log entry but before respond- ing to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the as- sociated response. If it receives a command whose serial number has already been executed, it responds immedi- ately without re-executing the
request.
Now I have passed MIT lab 3A, but I have responses map[string]string in kvserver,
which is a map from client's request id to response, but the problem is then the
map will keep increasing if client's keep sending request, Which is problemic in real project. How does Raft handle this in real project? Also, the MIT lab 3 says one client
will execute one command at a time, so probably I can optimize by deleting client's last request's response. But how does Raft handle this in real project where client's behavior is more free?

Camunda: Receive multiple, different messages at once

I am currently developing a kinda complex workflow with camunda. The goal of this workflow is to orchestrate the execution of different external business processes. Which includes start, overwatch and synchronize these workflows. Everything besides the synchronization works as expected.
Example:
My example has one main workflow which starts multiple sub workflows. The main workflow has to be aware when all sub workflows are finished. Every sub workflow is triggered by a message and sends a message back to the main workflow at the end of execution. Therefore, all sub workflows should be synchronized in the main workflow.
Xml can be accessed on this site: https://pastebin.com/2aj4z0zU
Unfortunately, this leads to numerous message correlation exceptions at the choke point in the main workflow (1st lane, after the first parallel gateway). I am using the following code to correlate the messages:
this.runtimeService.createMessageCorrelation(messageName)
.processInstanceId(processInstanceId)
.setVariables(payload)
.correlate();
The whole workflow is executable and runs without errors, but only if one example_workflow at a time is executed. Starting multiple example_workflows quickly one after another results in this type of exception randomly for every message type:
ENGINE-16004 Exception while closing command context: Cannot correlate message 'PROCESS_B_FINISHED': No process definition or execution matches the parameters org.camunda.bpm.engine.MismatchingMessageCorrelationException: Cannot correlate message 'PROCESS_B_FINISHED': No process definition or execution matches the parameters
at org.camunda.bpm.engine.impl.cmd.CorrelateMessageCmd.execute(CorrelateMessageCmd.java:88) ~[camunda-engine-7.14.0.jar!/:7.14.0]
Currently, the correlation exceptions occur if a postgresql database is used. The same workflow runs much better, but not perfect, when we use a h2 file-based database. All receive tasks are not configured asynchronously, only send tasks are (async before + exclusive).
Questions:
Is this already the best practice to synchronize multiple messages in one workflow?
What could be the reason for the correlation exceptions while using a postgresql database?
Used software:
spring boot application [Version:2.3.4]
camunda [Version:7.14.0]
h2 [Version:1.4.200]
postgresql [Version:42.2.22]
the process model seems to contain sequences where it can run into a deadlock (What if blue is followed directly by green? Or yellow?) or where you have race conditions. If the process has not reached a state where it is in a receiving state for the message, then the message delivery will fail (as indicated in the error message you shared)
(The reason you are observing the CorellationException more frequently on postgresql if the race condition. With this external database some operations take slightly more time, increasing the chance of the race condition occurring).
The process engine needs to be able to match a message to a unique receiver. If there are multiple potential receivers for the same message name, and no other correlation criteria creating a unique match is provided, then the delivery will also fail. You either need to use unique message names per instance or better use a businessKey or a process data which is unique per instance as additional correlation criteria. This is why it does not work when you run multiple process instances.
Modelling a workflow with this parallel message bottleneck leads to a race condition, as mentioned by #rob2universe's post.
To solve this problem, I had firstly to correlate the messages directly. I did this by adding a unique identifier to every message, which was not a big deal due to the fact that an item ID was defined within the payload of every message. Secondly, I had to remove all asynchronous and exclusive markers for every receive task and connected gateways. And thirdly, I had to reset the job executor properties to default values. Limiting the pool size and jobs per acquisition did not benefit the workflow execution.
After all these changes, my workflow now runs as expected with no errors. Unfortunately, due to the described bottleneck optimistic logging exceptions are common, but the workflow engine handles these exceptions without further errors.

Phones won't stop ringing with Twilio Taskrouter

I've been trying to implement a call centre type system using Taskrouter using this guide as a base:
https://www.twilio.com/docs/tutorials/walkthrough/dynamic-call-center/ruby/rails
Project location is Australia, if that affects call details.
This system dials multiple numbers (workers), and I have run into an issue where phones will continue to ring even after the call has been accepted or cancelled.
ie. If Taskrouter calls Workers A and B, and A picks up first they are connected to the customer, but B will continue to ring. If B then picks up the phone they are greeted by a hangup tone. Ringing can continue for at least minutes until B picks up (I haven't checked if it ever times out).
Similar occurs if no one picks up and the call simply times out and is redirected to voicemail. As you can imagine, an endlessly ringing phone is pretty annoying, especially when there's no one on the other end.
I was able to replicate this issue using the above guide without modification (other than the minimum changes to set it up locally). Note that it doesn't dial workers simultaneously, rather it dials the first in line for a few seconds before moving to the next.
My interpretation of what is occurring is that Taskrouter is dialling workers, but not updating them when dialling should end, and simply moving on to the next stage of the workflow. It does update Worker status, so it knows if they've timed out for instance, but that doesn't update the actual call.
I have looked for any solutions to this and havent found much about it except the following:
How to make Twilio stop dialing numbers when hangup() is fired?
https://www.twilio.com/docs/api/rest/change-call-state
These don't specifically apply to Taskrouter, but suggest that a call that needs to be ended can be updated and completed.
I am not too sure if I can implement this however, as it seems to be using the same CallSid for all calls being dialled within a Workflow, makes it hard/impossible to seperate each call, and would end the active call as well.
It also just seems wrong that Taskrouter wouldn't be doing this automatically, so I wanted to ask about this before I tinker too much and break things.
Has anyone run into this issue before, or is able/unable to replicate it using the tutorial code?
When testing I've noticed the problem much more on landline numbers, which may only be because mobiles have their own timeout/redirects. VOIPs seem to immediately answer calls, so they behave a bit differently.
Any help/suggestions appreciated, thanks!
Current suggestion to work around this is to not issue the dequeue instruction immediately, but rather issue a Call instruction on the REST API when the Worker wishes to accept the Inbound Call.
This will create an Outbound Call to bridge the two calls together and thus won’t have many outbound calls for the same inbound caller at once.
Your implementation will depend on the behavior that you want to achieve:
Do you want to simul-dial both Workers?
Do you want to send
the task to both Workers and whoever clicks to Accept the Task first
will have the call routed to them?
If it's #2, this is a scenario where you're saying that the Worker should accept the Reservation (reservation.accepted) before issuing the Call.
If it's #1, you can either issue a Call Instruction or Dequeue Instruction. The key being that you provide a DequeueStatusCallbackUrl or CallStatusCallbackUrl to receive call progress events. Once one of the outbound calls is connected, you will need to complete the other associated call. So you will have to unfortunately track which outbound calls are tied to which Reservation, by using AssignmentCallbacks or EventCallbacks, to make that determination within your app.

Batching generation of http responses

I'm trying to find an architecture for the following scenario. I'm building a REST service that performs some computation that can be quickly batch computed. Let's say that computing 1 "item" takes 50ms, and computing 100 "items" takes 60ms.
However, the nature of the client is that only 1 item needs to be processed at a time. So if I have 100 simultaneous clients, and I write the typical request handler that sends one item and generates a response, I'll end up using 5000ms, but I know I could compute the same in 60ms.
I'm trying to find an architecture that works well in this scenario. I.e., I would like to have something that merges data from many independent requests, processes that batch, and generates the equivalent responses for each individual client.
If you're curious, the service in question is python+django+DRF based, but I'm curious about what kind of architectural solutions/patterns apply here and if anything solving this is already available.
At first you could think of a reverse proxy detecting all pattern-specific queries, collecting all theses queries and sending it to your application in an HTTP 1.1 pipeline (pipelining is a way to send a big number of queries one after another and receiving all HTTP responses in the same order at the end, without waiting for a response after each query).
But:
Pipelining is very hard to do well
you would have to code the reverse proxy as I do not know a way to do it
one slow response in the pipeline block all the other responses
you need an http server able to give several queries to your application language, something which never happens if the http server is not directly coded in your application, because usually http is made to work on only one query (like you never receive 2 queries in a PHP env, you receive the 1st one, send the response, and then receive the next one, even if the connection contain 2 queries).
So the good idea would be to do that on the application side. You could identify matching queries, and wait for a small amount of time (10ms?) to see if some other queries are also incoming. You will need a way to communicate between several parallel workers here (like you have 50 application workers and 10 of them have received queries that could be treated in the same batch). This way of communication could be a database (a very fast one) or some shared memory, depends on the technology used.
Then when too much time waiting has been spend (10ms?) or when a big amount of queries are received, one of the worker could collect all queries, run the batch, and tell every other workers that a result is there (here again you need a central point of communication, like LISTEN/NOTIFY in PostgreSQL, a shared memory thing, a message queue service, etc.).
Finally every worker is responsible for sending the right HTTP response.
The key here is having a system where the time you loose in trying to share requests treatment is less important than the time saved in batching several queries together, and in case of low traffic this time should stay reasonnable (as here you will always loose time waiting for nothing). And of course you are also adding some complexity on the system, harder to maintain, etc.

Resources