Kafka Streams TopologyTestDriver input-output topic

Kafka Streams TopologyTestDriver input-output topic - apache-kafka-streams

I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().

Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.

Related

Spring Integration can’t use multiple Outbound Channel Adapters

I want to write to a channel adapter only if the previous channel adapter write has been written successfully. I’m trying to do this by:
#Bean
public IntegrationFlow buildFlow() {
return IntegrationFlows.from(someChannelAdapter)
.handle(outboundChannelAdapter1)
.handle(outboundChannelAdapter2)
.get();
}
But I’m getting the following exception: The ‘currentComponent’ (…ReactiveMessageHandlerAdapter) is a one-way 'MessageHandler’ and it isn’t appropriate to configure ‘outputChannel’. This is the end of the integration flow.
How can I perform this?

If your handler implementation is one-way, fire-n-forget, then indeed there is no justification to continue the flow. It can go ahead with the configuration if the current handler is reply-producing and there will be something we can build a message to send to the next channel.
In your case .handle(outboundChannelAdapter1) is just void, so the next .handle(outboundChannelAdapter2) is not going to have anything to continue the flow. So, the framework gives you a hint that such a configuration is wrong. It is called flow for a reason: the result of the current endpoint is going to be an input for the next one. If no result, no continuation. How else it could work in your opinion?
The point is that there need to be something to write to your channel adapter. One of the solution is a PublishSubscribeChannel which distributes the same input message to all its subscribers. If that is what would fit to your expectations, then take a look into its support in Java DSL: https://docs.spring.io/spring-integration/docs/current/reference/html/dsl.html#java-dsl-subflows.
Another way is a RecipientListRouter pattern: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#router-implementations-recipientlistrouter.
You may achieve the same with WireTap as well, but it depends on a business logic of your solution: https://docs.spring.io/spring-integration/docs/current/reference/html/core.html#channel-wiretap.
But anyway: you need to understand that the second handler can be called only if there is an input message for its channel. In all those cases I showed you it is exactly the same message you send to a first handler. If your expectations are different, please elaborate what kind of message you'd like to have for a second handler if the first does not return anything.

InvalidTopologyException(msg:Component: [x] subscribes from non-existent stream [y]

I m trying to read data from kafka and insert into cassandra using storm. I've configured the topology also, however I'm getting some issue and I don't have clue why that is happening.
Here is my submitter piece.
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("spout", new KafkaSpout(spoutConfig));
topologyBuilder.setBolt("checkingbolt", new CheckingBolt("cassandraBoltStream")).shuffleGrouping("spout");
topologyBuilder.setBolt("cassandrabolt", new CassandraInsertBolt()).shuffleGrouping("checkingbolt");
Here, if I comment the last line, I don't see any exceptions. With the last line, I'm getting the below error:
InvalidTopologyException(msg:Component: [cassandrabolt] subscribes from non-existent stream: [default] of component [checkingbolt])
Can someone please help me, what is wrong here?
Here is the outputFieldDeclarer in CheckingBolt
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));
}
I don't have anything in declareOutputFields method for CassandraInsertBolt as that bolt doesn't emit any values.
TIA

The problem here is that you're mixing up stream names and component (i.e. spout/bolt) names. Component names are used for referring to different bolts, while stream names are used to refer to different streams coming out of the same bolt. For example, if you have a bolt named "evenOrOddBolt", it might emit two streams, an "even" stream and and "odd" stream. In many cases though, you only have one stream coming out of a bolt, which is why Storm has some convenience methods for using a default stream name.
When you do .shuffleGrouping("checkingbolt"), you are using one of these convenience methods, effectively saying "I want this bolt to consume the default stream coming out of the checkingbolt". There is an overloaded version of this method you can use if you want to explicitly name the stream, but it's only useful if you have multiple streams coming out of the same bolt.
When you do ofd.declareStream(cassandraBoltStream, new Fields(new String[]{"jsonFields"}));, you are saying the bolt will emit on a stream named "cassandraBoltStream". This is probably not what you want to do, you want to declare that it will emit on the default stream. You do this by using the ofd.declare method instead.
Refer to the documentation for more details.

Build a Kafka Stream that returns the list of distinct ids into time interval

I have a kafka stream of objects events:
KStream<String, VehicleEventTO> stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)));
Each ObjectEvent has a property idType (Long). I need to build a Stream that returns distinct idTypes into time interval (For example: 10 minutes).
It's possible, using KafkaStream DSL? I don't find a solution.

Based on your use case you are looking for a Windowed aggregation. Kafka streams DSL has TimeWindowedKStream or SessionWindowdKStream which should be able to solve your problem.

I don't quite know KafkaStream's API, but regarding general streaming api,
you'd have a method that buffers messages over time (like buffer, groupedWithin, or something similar) where you can specify time (and/or maximum messages).
Then your stream would be something like:
KStream stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)))
.map(record -> record.value().getId()) // assuming you get a stream of records, I don't know the KafkaStreams api
.groupedWithin(Duration.ofMinutes(10)) // <-- pseudocode, search for correct method
Then you'd get a stream that contains the ids over time.

With Akka Streams, how do I know when a source has completed?

I have an Alpakka Elasticsearch Sink that I'm keeping around between requests. When I get a request, I create a Source from an HTTP request and turn that into a Source of Elasticsearch WriteMessages, then run that with mySource.runWith(theElasticseachSink).
How do I get notified when the source has completed? Nothing useful seems to be materialized.
Will completion of the source be passed to the sink, meaning I have to create a new one each time?
If yes to the above, would decoupling them somehow with Flow.fromSourceAndSink help?
My goal is to know when the HTTP download has completed (including the vias it goes through) and to be able to reuse the sink.

you can pass around the single parts of a flow as you wish, you can even pass around the whole executabe graph (those are immutables). The run() call materializes the flow, but does not change your graph or its parts.
1)
Since you want to know when the HttpDownload passed the flow , why not use the full graphs Future[Done] ? Assuming your call to elasticsearch is asynchronous, this should be equal since your sink just fires the call and does not wait.
You could also use Source.queue (https://doc.akka.io/docs/akka/2.5/stream/operators/Source/queue.html) and just add your messages to the queue, which then reuses the defined graph so you can add new messages when proocessing is needed. This one also materializes a SourceQueueWithComplete allowing you to stop the stream.
Apart from this, reuse the sink wherever needed without needing to wait for another stream using it.
2) As described above: no, you do not need to instantiate a sink multiple times.
Best Regards,
Andi

It turns out that Alpakka's Elasticsearch library also supports flow shapes, so I can have my source go via that and run it via any sink that materializes a future. Sink.foreach works fine here for testing purposes, for example, as in https://github.com/danellis/akka-es-test.
Flow fromFunction { product: Product =>
WriteMessage.createUpsertMessage(product.id, product.attributes)
} via ElasticsearchFlow.create[Map[String, String]](index, "_doc")
to define es.flow and then
val graph = response.entity.withSizeLimit(MaxFeedSize).dataBytes
.via(scanner)
.via(CsvToMap.toMap(Utf8))
.map(attrs => Product(attrs("id").decodeString(Utf8), attrs.mapValues(_.decodeString(Utf8))))
.via(es.flow)
val futureDone = graph.runWith(Sink.foreach(println))
futureDone onComplete {
case Success(_) => println("Done")
case Failure(e) => println(e)
}

fieldsGrouping on a a particular stream in Storm

I can see we've shuffleGrouping available for a particular stream in Storm as described here: How would I split a stream in Apache Storm?
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("SpoutWithStreams", "stream1");
But I've a use case where I would want to have fieldsGrouping on a particular stream emitted by a spout.
For Eg. SpoutWithStreams is emitting stream1 with random words, I want myBolt1 to subscribe to this stream, but I also want a particular instance of myBolt1 to receive same words i.e I want fieldsGrouping on stream1.
So what I want is something like this:
builder.setBolt("myBolt1", new MyBolt1()).fieldsGrouping("boltWithStreams", "stream1","field");
I don't want to have an extra bolt just for fieldsGrouping. Any other alternatives?

Due to not well-defined question, I will take a guess as to what you mean and try to answer.
I am guessing you want to receive two streams in your bolt, where one stream is a shuffleGrouping from another bolt and the other stream is a fieldsGrouping from a spout.
If this is the case, you can do something like that:
builder.setBolt("myBolt1", new MyBolt1()).shuffleGrouping("boltWithStreams", "stream1").fieldsGrouping("spout", "stream2", new Fields("field"));
and then in your bolt you can distinguish if the tuple belongs to one stream or the other using:
if (tuple.getSourceStreamId().equals("stream1"){
//do something
} else if (tuple.getSourceStreamId().equals("stream2"){
//do something else
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio