Apache Storm Tuple Replay Count - apache-storm

I'm using Apache Storm for parallel processing. I'd like to detect when the tuple is on its last replay count so that if it fails again then the tuple can be moved to a dead letter queue.
Is there a way to find the replay count from within the Bolt? I'm not able to find such a field within the tuple.
The reason I'm looking for the last replay count is to iron out our topology so that it is more resilient failures caused by bugs and downstream service outages. When the bug/downstream issue has been resolved the tuples can be reprocessed from the dead letter queue. However I'd like to place the tuples on the dead letter queue only on its last and final replay.

There are multiple possible answers to this question:
Do you use low level Java API to define your topology? If yes, see here: Storm: Is it possible to limit the number of replays on fail (Anchoring)?
You can also use transactional topologies. The documentation is here: https://storm.apache.org/documentation/Transactional-topologies.html
Limiting the number of replays implies counting the number of replays and that's a requirement to get this done. However, Storm does not support a dead letter queue or similar natively. You would need to use a reliable external distributed storage system (maybe Kafka) and put the tuple there if the replay count exceed your threshold. And in your spout, you need to check periodically for tuple in this external storage. If they are stored there "long enough" (whatever that means in your application), the spout can try re-processing.

Related

Spring Kafka Accumulator Use Case

I am developing a SpringBoot application which consumes events from a Kafka (broker version is 2.6) input topic and produce an event into an output topic.
In order to respect some business constraints the component should wait to have at least X messages (which is a batch size) or until a timeout expired. In conclusion, it should act like an accumulator.
Further, another mandatory requirement is to respect exactly-once semantic.
The first solution I approached was to maintain events in-memory until constraints are satisfied and then publish output messages. In order to implement an at least-once semantic I used manual_immediate ack mode and I stored latest ack for each partition in-memory and acknowledge after processing ended (it may cause duplicates in race conditions but it is acceptable).
In order to increase reliability I enforced upstream transactionality and set read_committed mode on listener.
I was wondering wheter it is a correct approach or if there is any suitable solution like batch_mode listener.
On a first look it is wonderful, but it seems not to allow accumulating on number of records, but rather on data size in bytes.
Thanks in advance,
G.

Achieve concurrency in Kafka consumers

We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).
First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.
You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.

Apache Storm: what happens to a tuple when no bolts are available to consume it?

If it's linked to another bolt, but no instances of the next bolt are available for a while. How long will it hang around? Indefinitely? Long enough?
How about if many tuples are waiting, because there is a line or queue for the next available bolt. Will they merge? Will bad things happen if too many get backed up?
By default tuples will timeout after 30 seconds after being emitted; You can change this value, but unless you know what you are doing don't do it (topology.message.timeout.secs)
Failed and timeout out tuples will be replayed by the spout, if the spout is reading from a reliable data source (eg. kafka); this is, storm has guaranteed message processing. If you are codding your own spouts, you might want to dig deep into this.
You can see if you are having timeout tuples on storm UI, when tuples are failing on the spout but not on the bolts.
You don't want tuples to timeout inside your topology (for example there is a performance penalty on kafka for not reading sequential). You should adjust the capacity of your topology process tuples (this is, tweak the bolt parallelism by changing the number of executors) and by setting the parameter topology.max.spout.pending to a reasonable conservative value.
increase the topology.message.timeout.secs parameter is no real solution, because soon or late if the capacity of your topology is not enough the tuples will start to fail.
topology.max.spout.pending is the max number of tuples that can be waiting. The spout will emit more tuples as long the number of tuples not fully processed is less than the given value. Note that the parameter topology.max.spout.pending is per spout (each spout has it's internal counter and keeps track of the tuples which are not fully processed).
There is a deserialize-queue for buffering the coming tuples, if it hangs long enough, the queue will be full,and tuples will be lost if you don't use the ack function to make sure it will be resent.
Storm just drops them if the tuples are not consumed until timeout. (default is 30 seconds)
After that, Storm calls fail(Object msgId) method of Spout. If you want to replay the failed tuples, you should implement this function. You need to keep the tuples in memory, or other reliable storage systems, such as Kafka, to replay them.
If you do not implement the fail(Object msgId) method, Storm just drops them.
Reference: https://storm.apache.org/documentation/Guaranteeing-message-processing.html

Storm Trident and Spark Streaming: distributed batch locking

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.
Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources