Emitting Tuple to different Bolts - apache-storm

I am trying of a scenario in which I have a Spout which reads a data from a Message Broker and emits the message as a tuple to a Bolt for some Processing.
Bolt post processing converts it into seperate Messages and each sub- message has to be sent to different Brokers which can be hosted on different machines .
Assuming I have finite recipients (in my case there are 3 Message Brokers for output) .
So , Bolt1 post processing can either drop the message directly to these 3 Message Brokers
Now, If I use a single Bolt here which drops the messages to these three brokers by itself and lets say One of them fails(due to unavailability etc) on which I call the collector's fail method .
Once the fail method is called on the bolt , in my Spout fail method gets Invoked .
Here , I believe I will have to again process the entire message again (I have to make sure everyMessage has to be processed ) even though 2 out of 3 messages got successfully delivered .
Alternatively , even If I emit these 3 sub messages to different bolt , I think even in that case Spout will have to process the entire message again .
This is because I am appending a Unique Guid with the message while emitting it first time in the spout nextTuple() method .
Is there a way to ensure that only the failed sub message should be processed and not entire one?
Thanks

Storm (low level Java API) provides only "at-least-once" processing guarantees, ie, there is no support to avoid duplicate processing on case of failure.
If you need exactly once proceeding, you can use Trident on top of Storm. However, even Trident can not give exactly once if you emit data to an external system (if the external system cannot detect and delete duplicates). This is not a Storm specific but general problem. Other system like Apache Flink, Apache Spark Streaming, or S-Store (a recent research prototype system from MIT -> Stonebraker) "suffer" from the exact same problem.
Maybe the best approach would be to try out Trident to evaluate if it can meet your requirements.

Related

ActiveMQ - Competing Consumers with Selector - messages starve in the queue

ActiveMQ 5.15.13
Context: I have a single queue with multiple Consumers. I want to stop some consumers from processing certain messages. This has to be dynamic, I don't want to create separate queues for this. This works without any problems. e.g. Consumer1 ignores Stocks -> Consumer1 can process all invoices and Consumer2 can process all Stocks
But if there is a large number of messages already in the Queue (of one type, e.g. stocks) and I send a message of another type (e.g. invoices), Consumer1 won't process the message of type invoices. It will instead be idle until Consumer2 has processed all Stocks messages. It does not happen every time, but quite often.
Is there any option to change the order of the new messages coming into the queue, such that an idle consumer with matching selector picks up the new message?
Things I've already tried:
using a PendingMessageLimitStrategy -> it seems like it does not work for queues
increasing the maxPageSize and maxBrowsePageSize in the hope that once all Messages are in RAM, the Consumers will search for their messages.
Exclusive Consumers aren't an option since I want to be able to use more than one Consumer per message type.
Im pretty sure that there is some configuration which allows this type of usage. I'm aware that there are better solutions for this issue, but sadly I can't use them easily due to other constraints.
Thanks a lot in advance!
EDIT: I noticed that when I'm refreshing on the localhost queue browser, the stuck messages get executed immediately. It seems like this action performs some sort of queue refresh where the messages get filtered based on their selector again. So I just need this action whenever a new message enters the queue...
This is a 'window' problem where the next set of 'stocks' data needs to be processed before the 'invoicing' data can be processed.
The gotcha with window problems like this is that you need to account for the fact that some messages may never come through, or a consumer may never come back online either. Also, eventually you will be asked 'how many invoices or stocks are left to be processed'-- aka observability.
ActiveMQ has you covered-- check out wild-card destinations and consumers.
Produce 'stocks' to:
queue://data.stocks.input
Produce 'invoices' to:
queue://data.invoices.input
You then setup consumes to connect:
queue://data.*.input
note: the wildard '*'.
ActiveMQ will match queues based on the wildcard pattern, and then process data accordingly. As a bonus, you can still use a selector.

Why does Trident not call ack() or fail() in this minimal example?

I tried to create a small example in Trident. The goal was to see how tuples are replayed in Case of failures. Below is the topology definition
Random rand = new Random();
Config config = new Config();
config.setDebug(true);
config.setNumWorkers(1);
TridentTopology topology = new TridentTopology();
topology.newStream("spout", new RandomIntegerSpout())
.map((MapFunction) tridentTuple -> {
if ((tridentTuple.getLongByField("msgid") % 50 == 0) &&
(rand.nextInt(2) == 1)) {
System.out.println(String.format("Failed to process tuple %d", tridentTuple.getLongByField("msgid")));
throw new ReportedFailedException("Divisible by 50");
}
return new Values(tridentTuple.toArray());
})
.peek((Consumer) tridentTuple -> System.out.println(tridentTuple.getValues()));
I use the RandomIntegerSpout from storm-starter which extends BaseRichSpout and just generates random numbers. I then apply a MapFunction that just draws a random number every 50 tuples and randomly fails the tuple.
The Problem is, I do not get any acks or fails.
I played around with the spout and ran it in debug mode, tried same sample output, tried it with standard storm bolts. The anchoring is working fine, it just does not get called by trident.
I reproduced this problem with LocalCluster and StormSubmitter, in v1.2.3 and v2.0.0.
Below is a screenshot of the Storm UI:
The bolts corresponding to the map ack and fail the tuple as expected, but this is are never propagated back to the spout.
I thought the trident mastercoord might expect some kind of persistence in a state to realize the topology is done, but replacing peek by some persistentAggregate did not help. I also ruled out a bug in map by doing the same with each.
Seeing the code is almost trivial by inspection I probably misunderstand something fundamental about Trident / Storm. Am I wrong to expect trident to call the spout's and ack method if a batch is done? I realized there is no fail method in IBatchSpout. how does Trident handle replaying of batches??
Trident spouts don't ack or fail tuples at the individual tuple level. Instead, tuples are acked as a batch.
Trident spouts will often look something like this interface.
M emitPartitionBatch(TransactionAttempt tx, TridentCollector collector, PartitionT partition, M lastPartitionMeta);
The idea is that Trident will manage keeping track of acks/fails of the batch tuples, and then if the batch fails, it will ask the spout for to repeat the batch, and if not, it simply won't.
Note how this is different from a standard Storm spout. With a normal spout, the framework basically tells the spout "Hey, emit something. Up to you what you emit.", and then the ack and fail methods are used to tell the spout whether it should emit a particular tuple again.
With Trident, the spout is instead told "Hey, (re)emit batch number x", and it is then up to the spout to know which tuples were in that batch. With this model there's no need for a fail method. Some Trident spouts will have an ack/succeed method though, to allow the spout to drop any state it may have related to a particular in-progress batch.
For wrapped IRichSpouts, there's some bridging code that wraps them into the Trident API. Basically, the wrapper calls nextTuple until it has a full batch, then it stores the ids in a cache. If the wrapper is asked to reemit a batch, it calls fail on the spout. Otherwise, it calls ack once the batch has succeeded.
I think the reason you're not seeing anything in Storm UI related to this, is that the IRichBolt isn't actually represented there. Instead it's wrapped, so the ack/fail calls are happening "under the hood" inside the spout-spout component. If you want to know for sure whether ack/fail is being called, try adding some logging to the ack/fail methods of your IRichSpout.

How to create unique messages to rabbitmq queue - spring-amp

I am putting a message containing string data to rabbitmq queue.
Message publishing is called as a part of a service and the service can be called with same data (data goes to the queue) multiple times, thus chances for having duplicated data in the queue is very likely.
We have issues with this as the consumer code is inserting this data to table where this data is primary key. Consumer will be called from 4 different nodes simultaneously thus chances for having consumers consuming same data (from different messages) can happen.
I want to know if rabbitMQ publishing has any way to avoid message duplication.
Read "define a property "x-unique-message-code" to compare them is an easy and simple way" , but don't know how to do it.
I am using spring-amqp
Any help is highly appreciated.
Thank you
There is a good article from RabbitMQ about reliability: https://www.rabbitmq.com/reliability.html
There is a note like:
In the event of network failure (or a node crashing), messages can be duplicated, and consumers must be prepared to handle them. If possible, the simplest way to handle this is to ensure that your consumers handle messages in an idempotent way rather than explicitly deal with deduplication.
For this purpose the message to produce can be supplied with a messageId property.

How to handle ACKing in storm with multiple bolts reading from the same spout

My topology looks like this :
Data_Enrichment_Persistence_Topology
So basically the problem I am trying to solve here is that every time any issue comes in the Stop or Load service bolts, and a tuple fails , it replays and the spout re emits it. This makes the Cassandra bolt re process the tuple and rewrite data.
I can not make the tuples in the load and stop bolts unanchored as i need them to be replayed in case of any failure. However I only want to get the upper workflow replayed.
I am using a KafkaSpout to emit data ( it is emitting it on the " default" stream). Not sure how to duplicate the streams at the Kafka Spout's emit level.
If I can duplicate the streams the replay on any of of the two will only re emit the message on a particular stream right at the spout level leaving the other stream untouched right?
TIA!
You need to use two output streams in your Spout -- one for each downstream pass. Furthermore, you emit each tuple to both streams (using different message-id).
Thus, if one fails, you can reply this tuple to just this stream.

storm uncontrolled tuple multiplikation

I am trying to to put kafka-data through storm in hdfs and hive. I am working with hortonworks. Therefore i have the following structure, as (a little modificated) seen in many tutorials (http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/):
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", kafkaSpout);
builder.setBolt("hdfs-bolt", hdfsBolt).globalGrouping("kafka-spout");
builder.setBolt("parse-bolt", new ParseBolt()).globalGrouping("kafka-spout");
builder.setBolt("hive-bolt", hiveBolt).globalGrouping("parse-bolt");
I send the kafka-spout data directly to hdfs-bolt, which is working when i only use hdfs-bolt. When i add the parse-bolt to parse the kafka-data and emit it to hive-bolt, the complete system goes crazy. Even when iam just sending one single message over kafka, this message is duplicated by the kafka-spout infinite times and is written to the hdfs infinite.
If there is an error in the parse-bolt, shouldn't the hdfs-bolt still working normal? I'am new to the topic, can someone see a simple beginners mistake? I am grateful for any advice.
Are you acking the messages at the end of both bolt's execution?
When you read from the same stream from your kafka-spout, messages will get anchored to the same spout but with unique messageIds. So essentially even though your parse-bolt 's tuple fails, since it is anchored to the same spout, it will get replayed at the spout . This will result in another tuple with a different messageId but same content being played for all the bolts subscribed to it, in your case the parse-bolt and the hdfs-bolt.
Remember that the replay happens at the Spout and hence everything subscribed to that stream from the spout will get redundant messages.

Resources