What data structure does Erlang use in its inboxes? - data-structures

Erlang uses message passing to communicate between processes. How does it handle concurrent incoming messages? What data structure is used?

The process inbox is made of 2 lists.
The main one is a fifo where all the incoming messages are stored, waiting for the process to examine them in the exact order they were received. The second one is a stack to store the messages that won't match any clause in a given receive statement.
When the process executes a receive statement, it will try to "pattern match" the first message against each clause of the receive in the order they are declared until the first match occurs.
if no match is found, the message is removed from the fifo and stacked on the second list, then the process iterates with the next message (note that the process execution may be suspended in the mean time either because the fifo is empty, or because it has reached his "reduction quota")
if a match is found, the message is removed from the fifo, and the stacked messages are restored in the fifo in their original order
note that the pattern matching process includes the copy of any interesting stuff into the process variables for example if {request,write,Value,_} -> ... succeeds, that means that the examined message is a 4 elements tuple, whose first and second elements are respectively the atoms request and write, whose third element is successfully pattern matched against the variable Value: that means that Value is bound to this element if it was previously unbound, or that Value matches the element, and finally the fourth element is discarded. After this operation is completed, there is no mean to retrieve the original message

You may get some info out of checking out the erl_message primitive, erl_message.c, and its declaration file, erl_message.h.
You may also find these threads helpful (one, two), although I think your question is more about the data structures in play.
ERTS Structures
The erlang runtime system (erts) allocates a fragmented (linked) heap for the scheduling of message passing (see source). The ErlHeapFragment structure can be found here.
However, each process also has a pretty simple fifo queue structure to which they copy messages from the heap in order to consume them. Underlying the queue is a linked list, and there are mechanisms to bypass the heap and use the process queue directly. See here for more info on that guy.
Finally each process also has a stack (also implemented as a list) where messages that don't have a matching pattern in receive are placed. This acts as a way to store messages that might be important, but that the process has no way of handling (matching) until another, different message is received. This is part of how erlang has such powerful "hot-swapping" mechanisms.
Concurrent Message Passing Semantics
At a high level, the erts receives a message and places it in the heap (unless explicitly told not to), and each process is responsible for selecting messages to copy into its own process queue. From what I have read, the messages currently in the queue will be processed before copying from the heap again, but there is likely more nuance.

Related

How to handle unsent data in microservices

I have two services A and B. A receives a request, does some processing and sends the processed data to B.
What should I do with the data in the following scenario:
A receives data.
Processes it successfully.
Crashes before sending the data to B.
Comes back online.
I would either use some sort of persistent log to handle the communication between the micro-services (e.g. Kafka) or some sort of retry mechanism.
In either case, the data that A received and processed must not disappear until the entire chain of execution completes successfully or, at the very least, until A has successfully completed its work and passed its payload to the next service. And this payload must exist until the next service processes it, and so on.
Generally, the steps should continue as follows:
A comes back online and sees that there is work to be done: the one that it processed at step #2 (since it's processing is not yet done as far as the overall system is concerned). Unless there are some weird side-effects, it shouldn't matter that it processes it again.
The data is sent to B (although this step should, conceptually, be part of "processing" the data).
If A crashes again then it probably means that the data it processes matches nicely with a bug in A and the whole chain of starting up, reprocessing and crashing will continue for ever. This is a Denial of Service, malicious or not, and you should have some procedure in place to handle it, perhaps you don't reprocess the same data more than a given number of times and log this to be analyzed with top priority.

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

Track causality in message queue

I'm working on a system where I have to process multiple messages but keep partial order of these messages. I have RabbitMQ queue with messages of two types: item create and item update.
Consider we have queue with 6 messages:
CREATE1 CREATE2 UPDATE1 UPDATE2 UPDATE1 UPDATE1
If I process them one by one, then it's completely fine, but it's very slow, because I have lots of messages
If I read them into some buffer, then I can process them in parallel, but I can process UPDATE1 for first item which wasn't created yet. The worse, the last update may be processed before previous one and thus erase latest item state
I can create some extra field in the message or put it in the queue with some extra header, e.g. MESSAGE_ID:10 to make sure that all messages for one item have the same MESSAGE_ID. The problem is that I don't know what to do with it.
How can I read from the queue multiple items at once without breaking causality between messages?
The pseudocode that I imagine for this task could be:
const prefetchItemsCount = 20
let buffer = new Message[prefetchItemsCount]
let set = new Set()
foreach item in queue
if !set.Contains(item.MessageId)
set.Add(item.MessageId)
buffer.Add(item)
if set.Count == buffer.Count
break
return buffer
So in our example it will return following sequences of items:
CREATE1 CREATE2
UPDATE1 UPDATE2
UPDATE1
UPDATE1
Which makes it almost twice as faster
How can I read from the queue multiple items at once without breaking causality between messages?
Nice case, indeed.
If indeed performed in the desired manner, the TimeDOMAIN singularity of "at once" goes principally against a hidden morphology of what was expressed as "causality".
Given together with a QUEUE-ingress side, which is by definition a pure-[SERIAL] ( nothing may happen at once, just a pure one-goes-after-another, even if a "just"-[CONCURENT] scheduling may get exposed to external agents, the internal QUEUE-management conforms to a pure sequential ordering of messages internal flow and delivery ( also ref. to time-stamping, persistence and other artefacts thereof ) ).
Causality also means some cause -> effect ordering of events, both in the abstracted causality sense of the relation and also in the flow of real time of how thing indeed do happen, so practically an anti-pattern to the "at once".
Last, but not least, the Causality also has to handle an additional paradigm, the latency between the cause -> side and the -> effect side of the ( often Finite-State-Automata, typically having much richer state-space than a just { 0 -> CREATE -> UPDATE [ -> UPDATE [...] ] -> } ) series of events.
Result?
While one may "read" using some degree of [CONCURRENT]-scheduling of processes, the FSA / Causality conditions principally avoid moving anywhere out of the principal pure-[SERIAL] post-processing of event-messages delivered.
More reguirements on this come, if the messaging framework is broker-less and without guarranteed robustness against lost messages / messages ordering / messages authenticity / messages content.
There the Devils start to dance against your attempts to build a consistent, distributed transaction processing robust, distributed FSA :o)

Collector Node Issue (IIB)

Collector node issue: I am currently using collector node to group messages (XML's). My requirement is to collect messages till the last message is received. (Reading from file input)
Control terminal: I'm sending a control message to stop collection and propagate to next node. But this doesn't work. As it still waits for timeout/quantity condition to be satisfied.
MY QUESTION: What condition can I use to collect messages till the last message received?
Add a separate input terminal on the Collector node that is used to complete a collection. Once you send a message to the second terminal, the collection is complete and propagated.
The Control terminal can be used to signal the Collector node when complete collections are propagated, not to determine when a collection is complete.
A collection is complete when either the set number of messages are received or the timeout is exhausted for all input terminals.
So if you don't know in advance how many messages you want to include in a collection, you have 3 options:
Set message quantity to 0 and set an appropriate timeout for input terminals.
This way the node will include all messages received within the time between the first message and the timeout value in the collection.
Set a large number as message quantity and use collection expiry
With collection expiry, incomplete collections can be propagated to the expiry terminal, but this will work essentially the same as the previous method.
Develop your own collector flow
You can develop a flow for combining messages using MQ Input, Get and Output nodes, keeping intermediate combined messages in MQ queues. Use this flow to combine your inputs and send the complete message onto the input queue of your processing flow.

MPI buffered send/receive order

I'm using MPI (with fortran but the question is more specific to the MPI standard than any given language), and specifically using the buffered send/receive functions isend and irecv. Now if we imagine the following scenario:
Process 0:
isend(stuff1, ...)
isend(stuff2, ...)
Process 1:
wait 10 seconds
irecv(in1, ...)
irecv(in2, ...)
Are the messages delivered to Process 1 in the order they were sent, i.e. can I be sure that in1 == stuff1 and in2 == stuff2 if the tag used is the same in all cases?
Yes, the messages are received in the order they are sent. This is described by the standard as non-overtaking messages. See this MPI Standard section for more details, here's an excerpt:
Order Messages are non-overtaking: If a sender sends two messages in succession to the same destination, and both match the same receive, then this operation cannot receive the second message if the first one is still pending. If a receiver posts two receives in succession, and both match the same message, then the second receive operation cannot be satisfied by this message, if the first one is still pending. This requirement facilitates matching of sends to receives. It guarantees that message-passing code is deterministic, if processes are single-threaded and the wildcard MPI_ANY_SOURCE is not used in receives. (Some of the calls described later, such as MPI_CANCEL or MPI_WAITANY, are additional sources of nondeterminism.)
Yes and no.
can I be sure that in1 == stuff1 and
in2 == stuff2 if the tag used is the
same in all cases?
Yes. There is a deterministic 1:1 correlation between send's and recv's that will get the correct input into the correct recv buffer. This behavior is guaranteed by the standard, and is enforced by all MPI implementations.
No. The exact order of internal message progression and the exact order in which buffers on the receiver side are populated is somewhat of a black box....especially when RDMA style message transfers with multiple in-flight buffers are being used (e.g. InfiniBand).
If your code is using multiple threads, and inspecting the buffer to determine completeness (e.g. waiting on a bit to be toggled) rather than using MPI_Test or MPI_Wait, then it is possible that the messages can arrive out of order (but in the correct buffer).
If your code is dependent on the in1 = stuff1 being populated BEFORE in2 = stuff2 is populated on the receiver side, and there is a single sending rank for both messages, then using MPI_Issend (non-blocking, synchronous send) will guarantee the messages are recv'd in order. If you need to guarantee the buffer population order of multiple recv's from multiple sending ranks, then some kind of blocking call is required between each revc (e.g. MPI_Recv, MPI_Barrier, MPI_Wait, etc).

Resources