I m trying to read data from kafka and insert into cassandra using storm. I've configured the topology also, however I'm getting some issue and I don't have clue why that is happening.
Here is my submitter piece.
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("spout", new KafkaSpout(spoutConfig));
topologyBuilder.setBolt("checkingbolt", new CheckingBolt("cassandraBoltStream")).shuffleGrouping("spout");
topologyBuilder.setBolt("cassandrabolt", new CassandraInsertBolt()).shuffleGrouping("checkingbolt");
Here, if I comment the last line, I don't see any exceptions. With the last line, I'm getting the below error:
InvalidTopologyException(msg:Component: [cassandrabolt] subscribes from non-existent stream: [default] of component [checkingbolt])
Can someone please help me, what is wrong here?
Here is the outputFieldDeclarer in CheckingBolt
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));
}
I don't have anything in declareOutputFields method for CassandraInsertBolt as that bolt doesn't emit any values.
TIA
The problem here is that you're mixing up stream names and component (i.e. spout/bolt) names. Component names are used for referring to different bolts, while stream names are used to refer to different streams coming out of the same bolt. For example, if you have a bolt named "evenOrOddBolt", it might emit two streams, an "even" stream and and "odd" stream. In many cases though, you only have one stream coming out of a bolt, which is why Storm has some convenience methods for using a default stream name.
When you do .shuffleGrouping("checkingbolt"), you are using one of these convenience methods, effectively saying "I want this bolt to consume the default stream coming out of the checkingbolt". There is an overloaded version of this method you can use if you want to explicitly name the stream, but it's only useful if you have multiple streams coming out of the same bolt.
When you do ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));, you are saying the bolt will emit on a stream named "cassandraBoltStream". This is probably not what you want to do, you want to declare that it will emit on the default stream. You do this by using the ofd.declare method instead.
Refer to the documentation for more details.
Related
I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().
Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.
I am using cloud stream to consuming messages I am using something like
#StreamListener(target = "CONSTANT_CHANNEL_NAME")
public void readingData(String input){
System.out.println("consumed info is"+input);
}
But I want to keep channel name as per my environment and it should be picked from property file, while as per Spring channel name should be constant.
Is there any work around to fix this problem?
Edit:1
Let's see the actual situation
I am using multiple queues and dlq queues and it's binding is done with rabbit-mq
I want to change my channel name and queue name as per my environment
I want to do all on same AMQP host.
My Sink Code
public interfaceProcessorSink extends Sink {
#Input(CONSTANT_CHANNEL_NAME)
SubscribableChannel channel();
#Input(CONSTANT_CHANNEL_NAME_1)
SubscribableChannel channel2();
#Input(CONSTANT_CHANNEL_NAME_2)
SubscribableChannel channle2();
}
You can pick target value from property file as below:
#StreamListener(target = "${streamListener.target}")
public void readingData(String input){
System.out.println("consumed info is"+input);
}
application.yml
streamListener:
target: CONSTANT_CHANNEL_NAME
While there are many ways to do that I wonder why do you even care? In fact if anything you do want to make it constant so it is always the same, but thru configuration properties map it to different remote destinations (e.g., Kafka, Rabbit etc). For example spring.cloud.stream.bindings.input.destination=myKafkaTopic states that channel by the name input will be mapped to (bridged with) Kafka topic named myKafkaTopic'.
In fact, to further prove my point we completely abstracted away channels all together for users who use spring-cloud-function programming model, but that is a whole different discussion.
My point is that I believe you are actually creating a problem rather the solving it since with externalisation of the channel name you create probably that due to misconfiguration your actual bound channel and the channel you're mentioning in your properties are not going to be the same.
For example, if I want to build a websocket server, I wonder what should be put in the initChannel method. Then I found the websocket example in netty's sourcecode, in which I need to do the following:
public void initChannel(final SocketChannel ch) throws Exception {
ch.pipeline().addLast(
new HttpRequestDecoder(),
new HttpObjectAggregator(65536),
new HttpResponseEncoder(),
new WebSocketServerProtocolHandler("/websocket"),
new CustomTextFrameHandler());
}
But I have no idea why I need to put the objects in such an order. In the description of HttpObjectAggregator I found something like this:
Be aware that you need to have the {#link HttpResponseEncoder} or {#link HttpRequestEncoder} before the {#link HttpObjectAggregator} in the {#link ChannelPipeline}.
But in the above code HttpObjectAggregator object is before the HttpResponseEncoder object. I am confused. How do I know I am putting those objects in a correct order?
TLDR; You should put HttpServerCodec into your init method, to keep things simple. Do that before HttpObjectAggregator if you choose to use the aggregator.
I'm pretty sure the advice about putting encoders before the HttpObjectAggregator is a typo. The encoders are outbound only handlers, while the HttpObjectAggregator is an inbound only handler, which means an event will never interact with both of them; so it makes no sense that their relative order would matter.
The caveat here is that HttpObjectAggregator will write HttpObjects out (mainly a 100 CONTINUE) in certain cases, and for that HttpObject to be converted to a byte[] that can be sent on the wire it needs a HttpResponseEncoder before it in the pipeline. On outgoing the pipeline is traversed in reverse, so the encoder before it will receive a message sent by the aggregator, but an encoder after it won't. The sample code you posted has a bug in it that will only be hit if a 100 CONTINUE would need to be sent. Looks like that bug was fixed by replacing the encoder/decoder with a HttpServerCodec before the aggregator.
A decoder, like HttpRequestDecoder or HttpResponseDecoder is an inbound only handler, and they need to be before the HttpObjectAggregator for it to function properly. That's because those two decoders transform a byte[] into an HttpObject, while the HttpObjectAggregator is really a Message to Message decoder that transforms a HttpObject into a FullHttpMessage.
Netty introduced the HttpServerCodec which is a combination of HttpRequestDecoder and HttpResponseEncoder in one class. If you put that before your aggregator you'll save yourself a line of code and make sure you have the proper encoder and decoder for your server.
Good reference on understanding how a message works in the pipeline for inbound vs outbound handlers: https://netty.io/4.0/api/io/netty/channel/ChannelPipeline.html
Issue where this wording was first introduced (notice no mention of encoding, only decoding): https://github.com/netty/netty/issues/2401
Issue where this wording is pointed out as a typo/bug: https://github.com/netty/netty/issues/2471
I can see we've shuffleGrouping available for a particular stream in Storm as described here: How would I split a stream in Apache Storm?
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("SpoutWithStreams", "stream1");
But I've a use case where I would want to have fieldsGrouping on a particular stream emitted by a spout.
For Eg. SpoutWithStreams is emitting stream1 with random words, I want myBolt1 to subscribe to this stream, but I also want a particular instance of myBolt1 to receive same words i.e I want fieldsGrouping on stream1.
So what I want is something like this:
builder.setBolt("myBolt1", new MyBolt1()).fieldsGrouping("boltWithStreams", "stream1","field");
I don't want to have an extra bolt just for fieldsGrouping. Any other alternatives?
Due to not well-defined question, I will take a guess as to what you mean and try to answer.
I am guessing you want to receive two streams in your bolt, where one stream is a shuffleGrouping from another bolt and the other stream is a fieldsGrouping from a spout.
If this is the case, you can do something like that:
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1").fieldsGrouping("spout", "stream2", new Fields("field"));
and then in your bolt you can distinguish if the tuple belongs to one stream or the other using:
if (tuple.getSourceStreamId().equals("stream1"){
//do something
} else if (tuple.getSourceStreamId().equals("stream2"){
//do something else
}
I am not understanding how I would split a stream in Apache Storm. For example, I have bolt A that after some computation has somevalue1, somevalue2, and somevalue3. It wants to send somevalue1 to bolt B, somevalue2 to bolt C, and somevalue1,somevalue2 to bolt D. How would I do this in Storm? What grouping would I use and what would my topology look like? Thank you in advance for your help.
You can use different streams if your case needs that, it is not really splitting, but you will have a lot of flexibility, you could use it for content based routing from a bolt for instance:
You declare the stream in the bolt:
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream("stream1", new Fields("field1"));
outputFieldsDeclarer.declareStream("stream2", new Fields("field1"));
}
You emit from the bolt on the chosen stream:
collector.emit("stream1", new Values("field1Value"));
You listen to the correct stream through the topology
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1");
builder.setBolt("myBolt2", new MyBolt2()).shuffleGrouping("boltWithStreams", "stream2");
You have two options here: Stream Groups and "Direct Grouping". Depending on your requirements, one of them is going to serves you.
Have a look at WordCountTopology sample project to see whether that is what you are looking for. Otherwise, "Direct Grouping" is going to be a better alternative.
But again, picking a grouping strategy depends on your requirements.