fieldsGrouping on a a particular stream in Storm - apache-storm

I can see we've shuffleGrouping available for a particular stream in Storm as described here: How would I split a stream in Apache Storm?
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("SpoutWithStreams", "stream1");
But I've a use case where I would want to have fieldsGrouping on a particular stream emitted by a spout.
For Eg. SpoutWithStreams is emitting stream1 with random words, I want myBolt1 to subscribe to this stream, but I also want a particular instance of myBolt1 to receive same words i.e I want fieldsGrouping on stream1.
So what I want is something like this:
builder.setBolt("myBolt1", new MyBolt1()).fieldsGrouping("boltWithStreams", "stream1","field");
I don't want to have an extra bolt just for fieldsGrouping. Any other alternatives?

Due to not well-defined question, I will take a guess as to what you mean and try to answer.
I am guessing you want to receive two streams in your bolt, where one stream is a shuffleGrouping from another bolt and the other stream is a fieldsGrouping from a spout.
If this is the case, you can do something like that:
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1").fieldsGrouping("spout", "stream2", new Fields("field"));
and then in your bolt you can distinguish if the tuple belongs to one stream or the other using:
if (tuple.getSourceStreamId().equals("stream1"){
//do something
} else if (tuple.getSourceStreamId().equals("stream2"){
//do something else
}

Related

Kafka Streams TopologyTestDriver input-output topic

I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().
Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.

InvalidTopologyException(msg:Component: [x] subscribes from non-existent stream [y]

I m trying to read data from kafka and insert into cassandra using storm. I've configured the topology also, however I'm getting some issue and I don't have clue why that is happening.
Here is my submitter piece.
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("spout", new KafkaSpout(spoutConfig));
topologyBuilder.setBolt("checkingbolt", new CheckingBolt("cassandraBoltStream")).shuffleGrouping("spout");
topologyBuilder.setBolt("cassandrabolt", new CassandraInsertBolt()).shuffleGrouping("checkingbolt");
Here, if I comment the last line, I don't see any exceptions. With the last line, I'm getting the below error:
InvalidTopologyException(msg:Component: [cassandrabolt] subscribes from non-existent stream: [default] of component [checkingbolt])
Can someone please help me, what is wrong here?
Here is the outputFieldDeclarer in CheckingBolt
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));
}
I don't have anything in declareOutputFields method for CassandraInsertBolt as that bolt doesn't emit any values.
TIA
The problem here is that you're mixing up stream names and component (i.e. spout/bolt) names. Component names are used for referring to different bolts, while stream names are used to refer to different streams coming out of the same bolt. For example, if you have a bolt named "evenOrOddBolt", it might emit two streams, an "even" stream and and "odd" stream. In many cases though, you only have one stream coming out of a bolt, which is why Storm has some convenience methods for using a default stream name.
When you do .shuffleGrouping("checkingbolt"), you are using one of these convenience methods, effectively saying "I want this bolt to consume the default stream coming out of the checkingbolt". There is an overloaded version of this method you can use if you want to explicitly name the stream, but it's only useful if you have multiple streams coming out of the same bolt.
When you do ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));, you are saying the bolt will emit on a stream named "cassandraBoltStream". This is probably not what you want to do, you want to declare that it will emit on the default stream. You do this by using the ofd.declare method instead.
Refer to the documentation for more details.

Build a Kafka Stream that returns the list of distinct ids into time interval

I have a kafka stream of objects events:
KStream<String, VehicleEventTO> stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)));
Each ObjectEvent has a property idType (Long). I need to build a Stream that returns distinct idTypes into time interval (For example: 10 minutes).
It's possible, using KafkaStream DSL? I don't find a solution.
Based on your use case you are looking for a Windowed aggregation. Kafka streams DSL has TimeWindowedKStream or SessionWindowdKStream which should be able to solve your problem.
I don't quite know KafkaStream's API, but regarding general streaming api,
you'd have a method that buffers messages over time (like buffer, groupedWithin, or something similar) where you can specify time (and/or maximum messages).
Then your stream would be something like:
KStream stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)))
.map(record -> record.value().getId()) // assuming you get a stream of records, I don't know the KafkaStreams api
.groupedWithin(Duration.ofMinutes(10)) // <-- pseudocode, search for correct method
Then you'd get a stream that contains the ids over time.

Confusion of Storm acker and guaranteed message processing

Now I am learning Storm's Guaranteeing Message Processing and am confused by some concepts in this part.
To guarantee a message emitted by a spout is fully processed, Storm uses acker to achieve this. Each time a spout emits a tuple, acker will assign "ack val" initialized as 0 to store the status of the tuple tree. Each time the downstream bolts of this tuple emit new tuple or ack an "old" tuple, the tuple ID will be XOR with "ack val". The acker only needs to check whether "ack val" is 0 or not to know the tuple has been fully processed. Let's see the code below:
public class WordReader implements IRichSpout {
... ...
while((str = reader.readLine()) != null){
this.collector.emit(new Values(str), str);
... ...
}
The code piece above is a spout in word count program from "Getting Started with Storm" tutorial. In the emit method, the 2nd parameter "str" is the messageId. I am confused by this parameter:
1) As I understand, each time a tuple (i.e., a message) is emitted no matter in spouts or in bolts, it should be Storm's responsibility to assign a 64-bit messageId to that message. Is that correct? Or here "str" is just a human-readable alias to this message?
2) No matter what's answer to 1), here "str" would be the same word in two different messages because in a text file there should be many duplicate words. If this is true, then how does Storm differentiate different messages? And what's the meaning of this parameter?
3) In some code piece, I see some spouts use the following code to set the message Id in Spout emit method:
public class RandomIntegerSpout extends BaseRichSpout {
private long msgId = 0;
collector.emit(new Values(..., ++msgId), msgId);
}
This is much closer to what I think it should be: the message ID should be totally different across different messages. But for this code piece, another confusion is: what will happen to private field "msgId" across different executors? Because each executor has its own msgId initialized as 0, then messages in different executors will be named from 0, 1, 2, and so on. Then how does Storm differentiate these messages?
I am novice to Storm, so maybe these problems are naive. Hope someone could help me to figure out. Thanks!
About message ID is general: internally it might be a 64bit value, but this 64bit value is computed as a hash from the msgID object provided in emit() within Spout. So you can hand any object as message ID (the probability that two objects hash to the same value is close to zero).
About using str: I think in this example, str contains a line (and not a word) and it is very unlikely that document contains the exact same line twice (if there are no empty lines which might be many).
About the counter as message id: you are absolutely right about you observation -- if multiple spouts are running in parallel, this would give message ID conflict and would break fault tolerance.
If you want to "fix" the counter approach, each counter should be initialized differently (best, from 1...#SpoutTasks). You can use the taskID for this (which is unique and can be accessed via TopologyContext provided in Spout.open()). Basically, you get all taskIDs for all parallel spout tasks, sort them, and assign each spout task its ordering number. Furthermore, you need to increment by "number of parallel spouts" instead of 1.

How would I split a stream in Apache Storm?

I am not understanding how I would split a stream in Apache Storm. For example, I have bolt A that after some computation has somevalue1, somevalue2, and somevalue3. It wants to send somevalue1 to bolt B, somevalue2 to bolt C, and somevalue1,somevalue2 to bolt D. How would I do this in Storm? What grouping would I use and what would my topology look like? Thank you in advance for your help.
You can use different streams if your case needs that, it is not really splitting, but you will have a lot of flexibility, you could use it for content based routing from a bolt for instance:
You declare the stream in the bolt:
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream("stream1", new Fields("field1"));
outputFieldsDeclarer.declareStream("stream2", new Fields("field1"));
}
You emit from the bolt on the chosen stream:
collector.emit("stream1", new Values("field1Value"));
You listen to the correct stream through the topology
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1");
builder.setBolt("myBolt2", new MyBolt2()).shuffleGrouping("boltWithStreams", "stream2");
You have two options here: Stream Groups and "Direct Grouping". Depending on your requirements, one of them is going to serves you.
Have a look at WordCountTopology sample project to see whether that is what you are looking for. Otherwise, "Direct Grouping" is going to be a better alternative.
But again, picking a grouping strategy depends on your requirements.

Resources