How to write a trident topology without aggregations? - apache-storm

I would like to process tuples in batches for which I am in a thought of using Trident API. However, there are no operations that I perform in batches here. Every tuple is processed individually. All that I need here is exactly-once semantics so that every tuple is processed only once and this is the only reason to use Trident.
I want to store the information of which tuple is processed so that when a batch is replayed, the tuple will not be executed when that is already processed.
The topology contains a persistentAggregate() method, but it takes some aggregation operation but I don't have any aggregation operation to perform on a set of tuples as every tuple is processed individually.
Here, the functions that a tuple undergoes are too minute to be executed. So, I am looking to process them in batches in order to save computing resources and time.
Now, how to write a topology which consumes tuples as batches but still doesn't perform any batch operations (like word count)?

Looks like what you need is partitionPersist.
It should be provided with a state (or a state factory), fields to persist and an updater.
For development purposes check MemoryMapState - it's basically an in-memory hashmap.
For production you can use, say, cassandra - check out the examples at https://github.com/hmsonline/storm-cassandra-cql

Related

Delay Kafka processor reading from source topic

I have a topology that consists of two source topics which are read and processed by two different processors in a Kafka Streams app. The one processor A reads its corresponding topic and creates a persistent local store which is shared with the other processor B in the topology.
My issue is that I need somehow after a restart to pause processor B processing for a very small amount of time and give processor A the time to read some events from its topic updating its local store before processor B starts with its processing.
Since both processors belong to the same sub-topology I can't use Thread.sleep in init() for example because this will cause the whole app to stall.
So is there a way to make processor B in topology wait/stall for a very small amount of time when restarting the application before starting reading from the source topic and begin processing events?
Processing order is base on record timestamps. Hence, if the timestamps of the record processed by A are smaller than the timestamps of the record processed by B, those "A records" will be processed first.
Explicitly pausing one side does not make sense, as it may violate the processing order. Just make sure that your input data is properly timestamped and you don't have to worry about manual pausing.

In Nifi, what is the difference between FirstInFirstOutPrioritizer and OldestFlowFileFirstPrioritizer

User guide https://nifi.apache.org/docs/nifi-docs/html/user-guide.html has the below details on prioritizers, could you please help me understand how these are different and provide any real time example.
FirstInFirstOutPrioritizer: Given two FlowFiles, the one that reached the connection first will be processed first.
OldestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is oldest in the dataflow will be processed first. 'This is the default scheme that is used if no prioritizers are selected.'
Imagine two processors A and B that are both connected to a funnel, and then the funnel connects to processor C.
Scenario 1 - The connection between the funnel and processor C has first-in-first-out prioritizer.
In this case, the flow files in the queue between the funnel and connection C will be processed strictly based on the order they reached the queue.
Scenario 2 - The connection between the funnel and processor C has oldest-flow-file-first prioritizer.
In this case, there could already be flow files in the queue between the funnel and connection C, but one of the processors transfers a flow to that queue that is older than all the flow files in that queue, it will jump to the front.
You could imagine that some flow files come from a different portion of the flow that takes way longer to process than other flow files, but they both end up funneled into the same queue, so these flow files from the longer processing part are considered older.
Apache NiFi handles data from many disparate sources and can route it through a number of different processors. Let's use the following example (ignore the processor types, just focus on the titles):
First, the relative rate of incoming data can be different depending on the source/ingestion point. In this case, the database poll is being done once per minute, while the HTTP poll is every 5 seconds, and the file tailing is every second. So even if a database record is 59 seconds "older" than another, if they are captured in the same execution of the processor, they will enter NiFi at the same time and the flowfile(s) (depending on splitting) will have the same origin time.
If some data coming into the system "is dirty", it gets routed to a processor which "cleans" it. This processor takes 3 seconds to execute.
If both the clean relationship and the success relationship from "Clean Data" went directly to "Process Data", you wouldn't be able to control the order that those flowfiles were processed. However, because there is a funnel that merges those queues, you can choose a prioritizer on the queued queue, and control that order. Do you want the first flowfile to enter that queue processed first, or do you want flowfiles that entered NiFi earlier to be processed first, even if they entered this specific queue after a newer flowfile?
This is a contrived example, but you can apply this to disaster recovery situations where some data was missed for a time window and is now being recovered, or a flow that processes time-sensitive data and the insights aren't valid after a certain period of time has passed. If using backpressure or getting data in large (slow) batches, you can see how in some cases, oldest first is less valuable and vice versa.

Apache Storm Tuple Replay Count

I'm using Apache Storm for parallel processing. I'd like to detect when the tuple is on its last replay count so that if it fails again then the tuple can be moved to a dead letter queue.
Is there a way to find the replay count from within the Bolt? I'm not able to find such a field within the tuple.
The reason I'm looking for the last replay count is to iron out our topology so that it is more resilient failures caused by bugs and downstream service outages. When the bug/downstream issue has been resolved the tuples can be reprocessed from the dead letter queue. However I'd like to place the tuples on the dead letter queue only on its last and final replay.
There are multiple possible answers to this question:
Do you use low level Java API to define your topology? If yes, see here: Storm: Is it possible to limit the number of replays on fail (Anchoring)?
You can also use transactional topologies. The documentation is here: https://storm.apache.org/documentation/Transactional-topologies.html
Limiting the number of replays implies counting the number of replays and that's a requirement to get this done. However, Storm does not support a dead letter queue or similar natively. You would need to use a reliable external distributed storage system (maybe Kafka) and put the tuple there if the replay count exceed your threshold. And in your spout, you need to check periodically for tuple in this external storage. If they are stored there "long enough" (whatever that means in your application), the spout can try re-processing.

Storm Trident and Spark Streaming: distributed batch locking

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.
Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

How do I implement this topology in Storm?

I'm new to Storm, so be gentle :-)
I want to implement a topology that is similar to the RollingTopWords topology in the Storm examples. The idea is to count the frequency of words emitted. Basically, the spouts emit words at random, the first level bolts count the frequency and pass them on. The twist is that I want the bolts to pass on the frequency of a word only if its frequency in one of the bolts exceeded a threshold. So, for example, if the word "Nathan" passed the threshold of 5 occurrences within a time window on one bolt then all bolts would start passing "Nathan"'s frequency onwards.
What I thought of doing is having another layer of bolts which would have the list of words which have passed a threshold. They would then receive the words and frequencies from the previous layer of bolts and pass them on only if they appear in the list. Obviously, this list would have to be synchronized across the whole layer of bolts.
Is this a good idea? What would be the best way of implementing it?
Update: What I'm hoping to achieve a situation where communication is minimized i.e. each node in my use case is simulated by a spout and an attached bolt which does the local counting. I'd like that bolt to emit only words that have passed a threshold, either in the bolt itself or in another one. So every bolt will have to have a list of words that have passed the threshold. There will be a central repository that will hold the list of words over the threshold and will communicate with the bolts to pass that information.
What would be the best way of implementing that?
That shouldn't be too complicated. Just don't emit the words until you reach the threshold and in the meantime keep them stored in a HashMap. That is just one if-else statement.
About the synchronization - I don't think you need it because when you have these kind of problems (with counting words) you want one and only one task to receive a specific word. The one task that receives the word (e.g. "Nathan") will be the only one emitting its frequency. For that you should use fields grouping.

Resources