Confusion of Storm acker and guaranteed message processing - apache-storm

Now I am learning Storm's Guaranteeing Message Processing and am confused by some concepts in this part.
To guarantee a message emitted by a spout is fully processed, Storm uses acker to achieve this. Each time a spout emits a tuple, acker will assign "ack val" initialized as 0 to store the status of the tuple tree. Each time the downstream bolts of this tuple emit new tuple or ack an "old" tuple, the tuple ID will be XOR with "ack val". The acker only needs to check whether "ack val" is 0 or not to know the tuple has been fully processed. Let's see the code below:
public class WordReader implements IRichSpout {
... ...
while((str = reader.readLine()) != null){
this.collector.emit(new Values(str), str);
... ...
}
The code piece above is a spout in word count program from "Getting Started with Storm" tutorial. In the emit method, the 2nd parameter "str" is the messageId. I am confused by this parameter:
1) As I understand, each time a tuple (i.e., a message) is emitted no matter in spouts or in bolts, it should be Storm's responsibility to assign a 64-bit messageId to that message. Is that correct? Or here "str" is just a human-readable alias to this message?
2) No matter what's answer to 1), here "str" would be the same word in two different messages because in a text file there should be many duplicate words. If this is true, then how does Storm differentiate different messages? And what's the meaning of this parameter?
3) In some code piece, I see some spouts use the following code to set the message Id in Spout emit method:
public class RandomIntegerSpout extends BaseRichSpout {
private long msgId = 0;
collector.emit(new Values(..., ++msgId), msgId);
}
This is much closer to what I think it should be: the message ID should be totally different across different messages. But for this code piece, another confusion is: what will happen to private field "msgId" across different executors? Because each executor has its own msgId initialized as 0, then messages in different executors will be named from 0, 1, 2, and so on. Then how does Storm differentiate these messages?
I am novice to Storm, so maybe these problems are naive. Hope someone could help me to figure out. Thanks!

About message ID is general: internally it might be a 64bit value, but this 64bit value is computed as a hash from the msgID object provided in emit() within Spout. So you can hand any object as message ID (the probability that two objects hash to the same value is close to zero).
About using str: I think in this example, str contains a line (and not a word) and it is very unlikely that document contains the exact same line twice (if there are no empty lines which might be many).
About the counter as message id: you are absolutely right about you observation -- if multiple spouts are running in parallel, this would give message ID conflict and would break fault tolerance.
If you want to "fix" the counter approach, each counter should be initialized differently (best, from 1...#SpoutTasks). You can use the taskID for this (which is unique and can be accessed via TopologyContext provided in Spout.open()). Basically, you get all taskIDs for all parallel spout tasks, sort them, and assign each spout task its ordering number. Furthermore, you need to increment by "number of parallel spouts" instead of 1.

Related

Kafka Streams TopologyTestDriver input-output topic

I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().
Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.

IBM Integration Bus, best practices for calling multiple services

So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.

Apache Storm SleepSpoutWaitStrategy Behaviour

I am running an Apache Storm benchmark on a local machine.
However, I am seeing a weird behavior. One of the benchmarks i.e., SOL (Speed of light) test, uses a RandomMessageSpout to generate random tuples as source. Here is the nextTuple() code for that spout:
public void nextTuple() {
final String message = messages[rand.nextInt(messages.length)];
if(ackEnabled) {
collector.emit(new Values(message), messageCount);
messageCount++;
} else {
collector.emit(new Values(message));
}
}
When I run this benchmark and profile it using a Java profiler (Yourkit in my case). The spout thread shows sleep intervals in accordance with the SleepSpoutWaitStrategy.emptyEmit(). As my understanding goes, this function is called when nextTuple() has no tuples to emit and thus the spout thread goes to sleep for a configurable amount of time, as shown in the screenshot.
I do not understand why this function would be called given this particular nextTuple() implementation that will always return a tuple. What I am misunderstanding here?
Empty emit is also called in following situations
If the number of unacked messages reach the max spout pending.
If the executor send queue as well as the overflow buffer of spout is full.

Distributed caching in storm

How to store the temporary data in Apache storm?
In storm topology, bolt needs to access the previously processed data.
Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.
and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)
later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.
How to achieve this functionality.
In short, you wanted to do micro-batching calculations with in storm’s running tuples.
First you need to define/find key in tuple set.
Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key.
Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.
I'm afraid there is no such built-in functionality as of today.
But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.
There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.
You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.
If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.
You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.
Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):
// init your cache.
this.cache = CacheBuilder.newBuilder()
.maximumSize(maximumCacheSize)
.expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
.build();
Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:
// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
cache.put(key, nearlyEmptyList);
this.collector.emit(input, input.getValues());
}
this.collector.ack(input);
This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.
Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.
Maintain a Map<String, Integer> in the bolt's local cache. This map will keep the last known value of a "variable".
class CumulativeDiffBolt extends InstrumentedBolt{
Map<String, Integer> lastKnownVariableValue;
#Override
public void prepare(){
this.lastKnownVariableValue = new HashMap<>();
....
#Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
.... extract variable from tuple
.... extract current value from tuple
Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
Integer newValue = currValue - lastValue
lastKnownVariableValue.put(variable, newValue)
emit(new Fields(variable, newValue));
...
}

Storm topology and bolts needed to join multiple bolts output on a single bolt

I'm new in Storm and I having troubles developing what it seems something simple.
My scenario is as follows:
DRPC listen to an user input, eg: "name"
BoltA listen to the output of the DRCP. BoltA receives the name and search for it in a DB, getting full strings that contains the name and emits one by one
BoltB listen to the output of BoltA. BoltB receives the full string which contains the name, do a process B and emits the resulting string
BoltC listen to the output of BoltA. BoltC receives the full string, do a process C and emits the resulting string
BoltD listen to the output of BoltB and BoltC, join the results and emit a final result to be sent to the user who query the DRPC
[EDITED]
I solve part of the problem, and post all the example code in this pastebin. In this example I'm having an unexpected behaviour: if I comment the next line to the TODO comment in BoltD, then the second query to the DRPC never returns a result.
Am I defining incorrectly the topoology? I thought it would be enough to fieldGrouping by the criteria I want to join.

Resources