Apache Storm Message Passing Implementation (MPI) - zeromq

According to the MPI implementation of Storm the workers manage connections to other workers and maintain a mapping from task to task. Also, transferring takes in a task id and a tuple, and it serializes the tuple and puts it onto a "transfer queue”.
The question is, if there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology (could ZeroMQ possibly do something like this?).

Q : "If there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology ( could ZeroMQ possibly do something like this? )."
Obviously could,it does allow smart & flexible creation of signalling/messaging meta-plane(s) infrastructure(s) for the distributed-computing, improving itself in doing this for about the last 12+ years.
The #HristoIlliev attached comment's URL details that Apache-Storm itself reports to already use the ZeroMQ-layer for its own services *[in ver.0.8.0, almost all implementation (source-code) links unfortunately already dead there]:
The implementation for distributed mode uses ZeroMQ code
The implementation for local mode uses in-memory Java queues (so that it's easy to use Storm locally without needing to get ZeroMQ installed) code
...
Tasks listen on an in-memory ZeroMQ port for messages from the virtual port code
So the topology-related part of your question is related to the decision already made on this subject in the "outer" Apache-Storm architecture, that was done.
Tasks are responsible for message routing. A tuple is emitted either to a direct stream (where the task id is specified) or a regular stream. In direct streams, the message is only sent if that bolt subscribes to that direct stream. In regular streams, the stream grouping functions are used to determine the task ids to send the tuple to.
The MPI does the same for the HPC-focused computing ecosphere, since FORTRAN jobs started to run on first HPC distributed infrastructures. Due to the most of the HPC-computing problems were "simply" scaled onto larger footprints of the computing hardware, the MPI focus was more on efficiency of such uniform scaling, not visiting thus the opposite corner of adaptive, almost ad-hoc setup of message-passing infrastructure, layered topologies of specialised ZeroMQ Scalable Formal Communication Archetypes Patterns, so each of the tools focus on other factors.
If you feel you want to read a bit more on ZeroMQ, this answer might help to fast understand the core underlying concepts.

Related

Regarding Akka message transfer performance: many small messages or less large messages?

For a data-mining algorithm I am currently developing using Akka, I was wondering if Akka implements performance optimizations of the messages that are sent.
For instance, if I have an Actor that emits a very large number of messages to the same other Actor, is it good to encapsulate a set of messages into another large message? Or does Akka have some sort of buffer itself so that not one message but many messages are transfered over the network at once?
I am asking this question because the algorithm is supposed to be executed remotely on a cluster where transfer performance is important and I currently have no option to just do benchmarks myself.
For messages passed in Akka on the same machine, I don't think it matters a lot whether you use small message or an aggregation of messages as single message. The additional overhead of many calls versus having to loop while processing the aggregation is minimal I think.
I would prefer using small messages because it keeps the system simpler.
However, when sending messages over the network Akka is using HTTP and so there is the additional HTTP overhead costs for setting up a connection etc. Therefore you might choose here to aggregate some messages into a single message.
However, this also depends on your use case. Buffering implies waiting for more until there are enough (or a timeout occured). If you cannot wait, e.g. because you need fast responses, then you still need to send each message over individually.
I don't think there is a standard Akka actor available which does some aggregation of messages. Maybe a special kind of routing could be applied which does the buffering.
Or you might have a look at Akka Streams. That does support buffering of messages.

akka actor model vs java usage in following scenario

I want to know the applicability of the Akka Actor model.
I know it is useful in the case a huge number of Actor instances are created and destroyed. e.g. a call server, where every incoming call creates an actor instance and communicates with few other actors and get killed after the call is over.
Is it also useful in the following scenario :
A server has a few processing elements (10~50) implemented over Actors. The lifetime of these processing elements is infinite. some of them do not maintain state and a few maintain state. The processing elements process the message and pass the message to other actors in a fixed manner. The system receives a huge number of messages from outside and gets passed through processing elements and goes out of the system.
My gut feeling is that we cannot get any advantage by using Akka Actor model and even implementing this server in Scala. Because the use case for which Akka is designed, is not applicable here. If the scale-up meant that processing elements be increased dynamically then it would be applicable.
For fixed topologies, I think if i implement it in Java, it is going to be more beneficial in terms of raw performance. The 'immutability' feature of Scala leads to more copies and so reduces performance. So i believe i better stick to Java.
Is my understanding correct? I a nut shell i want to know why i should leave Java and use Scala/Akka for the application scenario above. and my target is to process 1 million messages per second.
If this question is still actual...
Scala vs. Java
Scala gives productivity to developers.
Immutability decreases debugging to almost zero level.
GC perfectly copes with waste immutables.
Akka Actors vs. other means
Akka has dispatcher that distributes all tasks across fixed thread pool. This allows to evenly consume available resources. This approach is much better than the fixed worker threads — the processing resources are provided to the tasks not DataFlow nodes.
DataFlow implementation
There is a SynapseGrid library that is built on top of Akka Actors and allows easy construction of DataFlow systems distributed over fixed immortal Actors. It can even draw the DataFlow diagram (in .dot format) of the whole system.
(The library is more convenient to be used with Scala.)

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Storm vs. Trident: When not to use Trident?

I'm working with Storm and it is fine for a lot of use cases. Recently I had a look at Trident, which is a high-level abstraction of Storm. It supports exactly-once processing and makes stateful processing easier.
But now I'm wondering.. Why can't I always use Trident instead of Storm?
What I read so far:
Trident processes messages in batches, so throughput time could be longer.
Trident is not yet able to process loops in topologies.
Are there any other disadvantages when using Trident instead of Storm? Because right now, I think the disadvantages I listed above are marginal.
What use cases cannot be implemented with Trident?
Aftermath:
Since I asked the question my company decided to go for Trident first. We will only use pure Storm when there are performance problems. Sadly this wasn't an active decision it just became the default behavior (I wasn't around at that time).
Their assumption was that in most use cases we need state or only-once-processing or we will need it in near future. I understand their reasoning because moving from Storm to Trident or back isn't an easy transformation, but in my personal opinion the concept of stream processing without state wasn't understood by all and that was the main reason to use Trident.
To answer your question: when shouldn't you use Trident? Whenever you can afford not to.
Trident adds complexity to a Storm topology, lowers performance and generates state. Ask yourself the question: do you need the "exactly once" processing semantics of Trident or can you live with the "at least once" processing semantics of Storm. For exactly once, use Trident, otherwise don't.
I would also just like to highlight the fact that Storm guarantees that all messages will be processed. Some messages might just be processed more than once.
If the lowest possible latency is your goal and you don't need exactly-once processing, then using Storm is better than Trident.
Trident is a high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.x. Storm is stateless stream processing framework and Trident provides stateful stream processing.
Chris, since these two of them are open source technologies, trident serves as an only an implementation of a scenario on top of the storm, of course, this brought a performance overhead. If the trident could not meet your requirements, you create your own state implementation on top of the storm. Trident yielded higher level projects such as Trident-ML in time.
assume we want to do filtering + addition of a field to a tuple.
if we use storm usually we use 2 bots for filtering , addition of field. so again we need to send the tuple to new bolt by may be using global grouping. so here nw bandwidth may become bottleneck.
by using trident we can use do above on a single machine. so no regrouping is needed in this case.
such use case in addition to "exactly once" /"at east once" can differentiate what to use etc.
Trident is kind of grouping logical grouping

How to dedupe parallel event streams

Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.

Resources