Are storm trident batches simultaneously processed? - apache-storm

I would like to know whether trident batches are executed in parallel i.e. multiple batches can run at a time?
Apart from this I have few questions which are too small to be posted individually. If they are quite large enough, feel free to comment to post them individually.
What if processing only a particular tuple in a batch is failed?
Then the batch will be replayed, resulting in the reprocessing of tuples that are successfully processed previously? For example, word count, in which every tuple contains a word but only a few tuples were successfully counted? For example, if there are three words called man and the count shows only 2 which means that one tuple is failed in processing?
In this tutorial only, previous txid is stored. What about the previous transaction ids?
For example, there are three batches 1,2,3,4. Now, after batch #1, #2 are executed and batch #1 is replayed. Then txid will be 2 since the most recently processed batch is batch #2 and there is no way to recognize that whether batch #1 is previously processed or not.
If so, then the batches must be executed in order. That means until the batch #1 is successfully finished batch #2 cannot be executed. If it is the case, then where is the parallelism in executing the batches?
What if only a particular function is not executed properly for a batch in a topology?
For example, I have two functions, one is to persist the message into database and the other is to produce to kafka queue. And here, persisting in the database is successful however pushing to the kafka queue is failed due to some node failures (say for example). Then, I would want only the function that pushes to the kafka queue to be executed for that particular batch. Is there a way to do in trident? For this, I will need to store not only the txid but also a list of functions that are to be processed for that txid. How could it be done?

As best I understand:
Any failure is considered failing for the batch and it will be replayed by the spout. The transactional state stores the value and transaction id from the last operation. If counting "man" failed, its txid would be less than the current txid and it should add this batches data to the stored value. Otherwise, it can ignore the replay because it knows the data from this batch has already been counted for this key.
State transactions are processed in strict txid order, but only by the stateful components. Functions can execute on upcoming transaction tuples.
It sounds like you want States instead of Functions. The state will remember that it's already completed the batch, and ignore it when replayed.

Related

Distribute processing of records of scheduler job

I am working on a use case where I have a cron job scheduled (via quartz) which reads certain entries from db and process them.
Now in each schedule, I can get thousands of records which need to be processed. Processing each record takes time (in seconds/minutes). Currently all those records are getting processed on single node (node elected by quartz). Now my challenge is to parallelize these records processing. Please help me in solving below concerns :
How I can distribute these records/tasks to a cluster of machines
If any machine fails after processing few records then remaining records should be processed by healthy nodes in cluster
Get a signal that all record processing is finished.
Create cron jobs to run separately on each host at the desired frequency. You will need some form of lock on each record or some form of range lock on the record set to ensure that servers process mutually exclusive set of records.
e.g. : You can add following new field to all records:
Locked By Server:
Locked for Duration (or lock expiration time):
On each run, each cron picks a set of records that have expired or empty locks and then it aquires the lock on a small set of records by putting these two entries. Then it proceeds to process them. If it crashes or gets stuck the lock expires, otherwise it is released on completion.

Kafka Streams Custom processing

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.
Each Row in a specific file would be processed for a rule specific to that file.
Once the processing is complete we would be generating an output file based on the processed records.
One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)
I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.
However (I am new to kafka streams hence may be wrong):
The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..
i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file.
Say the record has a unique key such as -1 which signifies that the eof

Apache Nifi processor that acts like a barrier to synchronize multiple flow files

I'm evaluating Nifi for our ETL process.
I want to build the following flow:
Fetch a lot of data from SQL database -> Split into chunks 1000 records
each -> Count error records in each chunk -> Count total number of error
records -> If it exceeds a threshold Fail process -> else save each chunk to the database.
The problem I can't resolve is how to wait until all chunks are validated.
If for example I have 5 validation tasks working concurrently, I need some
kind of barrier to wait until all chunks are processed and only after that
run error count processor because I don't want to save invalid data and
delete it if the threshold is reached.
The other question I have is if there is any possibility to run this
validation processor on multiple nodes in parallel and still have the
possibility to wait until they all are completed.
One solution to this is to use the ExecuteScript processor as a "relief valve" to hold a simple count in memory triggered off of the first receipt of a flowfile with a specific attribute value (store in the local/cluster state with basically a Map of key attribute-value to value count). Once that value reaches a threshold, you can generate a new flowfile to route to the success relationship containing the attribute value that has finished. In this case, send the other results (the flowfiles that need to be batched) to a MergeContent processor and set the minimum batching size to whatever you like. The follow-on processor to the valve should have its Scheduling Strategy set to Event Driven so it only runs when it receives a flowfile from the valve.
Updating count in distributed MapCache is not the correct way as fetch and update are separate and cannot be made in atomic processor which just increments counts.
http://apache-nifi-users-list.2361937.n4.nabble.com/How-do-I-atomically-increment-a-variable-in-NiFi-td1084.html

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Perform actions before end of the micro-batch in Spark Streaming

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to sum up zero values (as some micro-batches are empty).
e.g. I do collect some statistics data and want to send them to my server, but the object that collects the data only exists during a certain batch and is initialized from the scratch for the next batch. I would love to be able to call my "finish" method before the batch is done and the object is gone. Otherwise I loose the data that has not been sent to my server.
Maybe you can use StreamingListener:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener

Resources