I'm working with Storm and it is fine for a lot of use cases. Recently I had a look at Trident, which is a high-level abstraction of Storm. It supports exactly-once processing and makes stateful processing easier.
But now I'm wondering.. Why can't I always use Trident instead of Storm?
What I read so far:
Trident processes messages in batches, so throughput time could be longer.
Trident is not yet able to process loops in topologies.
Are there any other disadvantages when using Trident instead of Storm? Because right now, I think the disadvantages I listed above are marginal.
What use cases cannot be implemented with Trident?
Aftermath:
Since I asked the question my company decided to go for Trident first. We will only use pure Storm when there are performance problems. Sadly this wasn't an active decision it just became the default behavior (I wasn't around at that time).
Their assumption was that in most use cases we need state or only-once-processing or we will need it in near future. I understand their reasoning because moving from Storm to Trident or back isn't an easy transformation, but in my personal opinion the concept of stream processing without state wasn't understood by all and that was the main reason to use Trident.
To answer your question: when shouldn't you use Trident? Whenever you can afford not to.
Trident adds complexity to a Storm topology, lowers performance and generates state. Ask yourself the question: do you need the "exactly once" processing semantics of Trident or can you live with the "at least once" processing semantics of Storm. For exactly once, use Trident, otherwise don't.
I would also just like to highlight the fact that Storm guarantees that all messages will be processed. Some messages might just be processed more than once.
If the lowest possible latency is your goal and you don't need exactly-once processing, then using Storm is better than Trident.
Trident is a high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.x. Storm is stateless stream processing framework and Trident provides stateful stream processing.
Chris, since these two of them are open source technologies, trident serves as an only an implementation of a scenario on top of the storm, of course, this brought a performance overhead. If the trident could not meet your requirements, you create your own state implementation on top of the storm. Trident yielded higher level projects such as Trident-ML in time.
assume we want to do filtering + addition of a field to a tuple.
if we use storm usually we use 2 bots for filtering , addition of field. so again we need to send the tuple to new bolt by may be using global grouping. so here nw bandwidth may become bottleneck.
by using trident we can use do above on a single machine. so no regrouping is needed in this case.
such use case in addition to "exactly once" /"at east once" can differentiate what to use etc.
Trident is kind of grouping logical grouping
Related
We have a spark-streaming micro batch process which consumes data from kafka topic with 20 partitions. The data in the partitions are independent and can be processed independently. The current problem is the micro batch waits for processing to be complete in all 20 partitions before starting next micro batch. So if one partition completes processing in 10 seconds and other partition takes 2 mins then the first partition will have to wait for 110 seconds before consuming next offset.
I am looking for a streaming solution where we can process the 20 partitions independently without having to wait for other partition to complete a process. The steaming solution should consume data from each partition and progress offsets at its own rate independent of other partitions.
Anyone have suggestion on which streaming architecture would allow to achieve my goal?
Any of Flink (AFAIK), KStreams, and Akka Streams will be able to progress through the partitions independently: none of them does Spark-style batching unless you explicitly opt in.
Flink is similar to Spark in that it has a job server model; KStreams and Akka are both libraries that you just integrate into your project and deploy like any other JVM application (e.g. you can build a container and run on a scheduler like kubernetes). I personally prefer the latter approach: it generally means less infrastructure to worry about and less of an impedance mismatch to integrate with observability tooling used elsewhere.
Flink is an especially good choice when it comes to time-window based processing and joins.
KStreams fundamentally models everything as a transformation from one kafka topic to another: the topic topology is managed by KStreams, but there can be some gotchas there (especially if you're dealing with anything time-seriesy).
Akka is the most general and (in some senses) the least opinionated of the toolkits: you will have to make more decisions with less handholding (I'm saying this as someone who could probably fairly be called an Akka cheerleader); as a pure stream processing library, it may not be the ideal choice (though in terms of resource consumption, being able to more explicitly manage backpressure (basically, what happens when data comes in faster than it can be processed) may make it more efficient than the alternatives). I'd probably tend to only choose it if you were going to also take advantage of cluster sharded (and almost certainly event-sourced) actors: the benefit of doing that is that you can completely decouple your processing parallelism from the number of input Kafka partitions (e.g. you may be able to deploy 40 instances of processing and have each working on half of the data from Kafka).
According to the MPI implementation of Storm the workers manage connections to other workers and maintain a mapping from task to task. Also, transferring takes in a task id and a tuple, and it serializes the tuple and puts it onto a "transfer queue”.
The question is, if there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology (could ZeroMQ possibly do something like this?).
Q : "If there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology ( could ZeroMQ possibly do something like this? )."
Obviously could,it does allow smart & flexible creation of signalling/messaging meta-plane(s) infrastructure(s) for the distributed-computing, improving itself in doing this for about the last 12+ years.
The #HristoIlliev attached comment's URL details that Apache-Storm itself reports to already use the ZeroMQ-layer for its own services *[in ver.0.8.0, almost all implementation (source-code) links unfortunately already dead there]:
The implementation for distributed mode uses ZeroMQ code
The implementation for local mode uses in-memory Java queues (so that it's easy to use Storm locally without needing to get ZeroMQ installed) code
...
Tasks listen on an in-memory ZeroMQ port for messages from the virtual port code
So the topology-related part of your question is related to the decision already made on this subject in the "outer" Apache-Storm architecture, that was done.
Tasks are responsible for message routing. A tuple is emitted either to a direct stream (where the task id is specified) or a regular stream. In direct streams, the message is only sent if that bolt subscribes to that direct stream. In regular streams, the stream grouping functions are used to determine the task ids to send the tuple to.
The MPI does the same for the HPC-focused computing ecosphere, since FORTRAN jobs started to run on first HPC distributed infrastructures. Due to the most of the HPC-computing problems were "simply" scaled onto larger footprints of the computing hardware, the MPI focus was more on efficiency of such uniform scaling, not visiting thus the opposite corner of adaptive, almost ad-hoc setup of message-passing infrastructure, layered topologies of specialised ZeroMQ Scalable Formal Communication Archetypes Patterns, so each of the tools focus on other factors.
If you feel you want to read a bit more on ZeroMQ, this answer might help to fast understand the core underlying concepts.
I want to know the applicability of the Akka Actor model.
I know it is useful in the case a huge number of Actor instances are created and destroyed. e.g. a call server, where every incoming call creates an actor instance and communicates with few other actors and get killed after the call is over.
Is it also useful in the following scenario :
A server has a few processing elements (10~50) implemented over Actors. The lifetime of these processing elements is infinite. some of them do not maintain state and a few maintain state. The processing elements process the message and pass the message to other actors in a fixed manner. The system receives a huge number of messages from outside and gets passed through processing elements and goes out of the system.
My gut feeling is that we cannot get any advantage by using Akka Actor model and even implementing this server in Scala. Because the use case for which Akka is designed, is not applicable here. If the scale-up meant that processing elements be increased dynamically then it would be applicable.
For fixed topologies, I think if i implement it in Java, it is going to be more beneficial in terms of raw performance. The 'immutability' feature of Scala leads to more copies and so reduces performance. So i believe i better stick to Java.
Is my understanding correct? I a nut shell i want to know why i should leave Java and use Scala/Akka for the application scenario above. and my target is to process 1 million messages per second.
If this question is still actual...
Scala vs. Java
Scala gives productivity to developers.
Immutability decreases debugging to almost zero level.
GC perfectly copes with waste immutables.
Akka Actors vs. other means
Akka has dispatcher that distributes all tasks across fixed thread pool. This allows to evenly consume available resources. This approach is much better than the fixed worker threads — the processing resources are provided to the tasks not DataFlow nodes.
DataFlow implementation
There is a SynapseGrid library that is built on top of Akka Actors and allows easy construction of DataFlow systems distributed over fixed immortal Actors. It can even draw the DataFlow diagram (in .dot format) of the whole system.
(The library is more convenient to be used with Scala.)
We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.
Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.