How would I split a stream in Apache Storm? - apache-storm

I am not understanding how I would split a stream in Apache Storm. For example, I have bolt A that after some computation has somevalue1, somevalue2, and somevalue3. It wants to send somevalue1 to bolt B, somevalue2 to bolt C, and somevalue1,somevalue2 to bolt D. How would I do this in Storm? What grouping would I use and what would my topology look like? Thank you in advance for your help.

You can use different streams if your case needs that, it is not really splitting, but you will have a lot of flexibility, you could use it for content based routing from a bolt for instance:
You declare the stream in the bolt:
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream("stream1", new Fields("field1"));
outputFieldsDeclarer.declareStream("stream2", new Fields("field1"));
}
You emit from the bolt on the chosen stream:
collector.emit("stream1", new Values("field1Value"));
You listen to the correct stream through the topology
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1");
builder.setBolt("myBolt2", new MyBolt2()).shuffleGrouping("boltWithStreams", "stream2");

You have two options here: Stream Groups and "Direct Grouping". Depending on your requirements, one of them is going to serves you.
Have a look at WordCountTopology sample project to see whether that is what you are looking for. Otherwise, "Direct Grouping" is going to be a better alternative.
But again, picking a grouping strategy depends on your requirements.

Related

Kafka Streams TopologyTestDriver input-output topic

I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().
Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.

InvalidTopologyException(msg:Component: [x] subscribes from non-existent stream [y]

I m trying to read data from kafka and insert into cassandra using storm. I've configured the topology also, however I'm getting some issue and I don't have clue why that is happening.
Here is my submitter piece.
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("spout", new KafkaSpout(spoutConfig));
topologyBuilder.setBolt("checkingbolt", new CheckingBolt("cassandraBoltStream")).shuffleGrouping("spout");
topologyBuilder.setBolt("cassandrabolt", new CassandraInsertBolt()).shuffleGrouping("checkingbolt");
Here, if I comment the last line, I don't see any exceptions. With the last line, I'm getting the below error:
InvalidTopologyException(msg:Component: [cassandrabolt] subscribes from non-existent stream: [default] of component [checkingbolt])
Can someone please help me, what is wrong here?
Here is the outputFieldDeclarer in CheckingBolt
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));
}
I don't have anything in declareOutputFields method for CassandraInsertBolt as that bolt doesn't emit any values.
TIA
The problem here is that you're mixing up stream names and component (i.e. spout/bolt) names. Component names are used for referring to different bolts, while stream names are used to refer to different streams coming out of the same bolt. For example, if you have a bolt named "evenOrOddBolt", it might emit two streams, an "even" stream and and "odd" stream. In many cases though, you only have one stream coming out of a bolt, which is why Storm has some convenience methods for using a default stream name.
When you do .shuffleGrouping("checkingbolt"), you are using one of these convenience methods, effectively saying "I want this bolt to consume the default stream coming out of the checkingbolt". There is an overloaded version of this method you can use if you want to explicitly name the stream, but it's only useful if you have multiple streams coming out of the same bolt.
When you do ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));, you are saying the bolt will emit on a stream named "cassandraBoltStream". This is probably not what you want to do, you want to declare that it will emit on the default stream. You do this by using the ofd.declare method instead.
Refer to the documentation for more details.

Kafka Streams API: KStream to KTable

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

fieldsGrouping on a a particular stream in Storm

I can see we've shuffleGrouping available for a particular stream in Storm as described here: How would I split a stream in Apache Storm?
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("SpoutWithStreams", "stream1");
But I've a use case where I would want to have fieldsGrouping on a particular stream emitted by a spout.
For Eg. SpoutWithStreams is emitting stream1 with random words, I want myBolt1 to subscribe to this stream, but I also want a particular instance of myBolt1 to receive same words i.e I want fieldsGrouping on stream1.
So what I want is something like this:
builder.setBolt("myBolt1", new MyBolt1()).fieldsGrouping("boltWithStreams", "stream1","field");
I don't want to have an extra bolt just for fieldsGrouping. Any other alternatives?
Due to not well-defined question, I will take a guess as to what you mean and try to answer.
I am guessing you want to receive two streams in your bolt, where one stream is a shuffleGrouping from another bolt and the other stream is a fieldsGrouping from a spout.
If this is the case, you can do something like that:
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1").fieldsGrouping("spout", "stream2", new Fields("field"));
and then in your bolt you can distinguish if the tuple belongs to one stream or the other using:
if (tuple.getSourceStreamId().equals("stream1"){
//do something
} else if (tuple.getSourceStreamId().equals("stream2"){
//do something else
}

Distributed caching in storm

How to store the temporary data in Apache storm?
In storm topology, bolt needs to access the previously processed data.
Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.
and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)
later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.
How to achieve this functionality.
In short, you wanted to do micro-batching calculations with in storm’s running tuples.
First you need to define/find key in tuple set.
Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key.
Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.
I'm afraid there is no such built-in functionality as of today.
But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.
There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.
You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.
If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.
You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.
Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):
// init your cache.
this.cache = CacheBuilder.newBuilder()
.maximumSize(maximumCacheSize)
.expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
.build();
Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:
// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
cache.put(key, nearlyEmptyList);
this.collector.emit(input, input.getValues());
}
this.collector.ack(input);
This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.
Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.
Maintain a Map<String, Integer> in the bolt's local cache. This map will keep the last known value of a "variable".
class CumulativeDiffBolt extends InstrumentedBolt{
Map<String, Integer> lastKnownVariableValue;
#Override
public void prepare(){
this.lastKnownVariableValue = new HashMap<>();
....
#Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
.... extract variable from tuple
.... extract current value from tuple
Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
Integer newValue = currValue - lastValue
lastKnownVariableValue.put(variable, newValue)
emit(new Fields(variable, newValue));
...
}

Resources