Kafka Streams API: KStream to KTable - apache-kafka-streams

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.

Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

Related

Message processing guarantees with spring-cloud-stream-binder-kafka functional binding

Given default configuration and this binding
#Bean
public Function<Flux<Message<Input>>, Flux<Message<Output>>> process() {
return input -> input
.map(message -> {
// simplified
return MessageBuilder.build();
});
}
Is there any guarantee that input message offset is commited after output is written to Kafka? I don´t need full Transactions, and I can live with at-least-once delivery and possible duplicates, but I cannot loose output message. I was unable to find this exact scenario in docs, and I believe previous channel-based binding worked as I need it to, since it was blocking by nature, but I am not sure about functional.

Kafka Streams TopologyTestDriver input-output topic

I have Kafka Streams unit test based on a really great, reliable and convenient TopologyTestDriver:
try (TopologyTestDriver testDriver = new TopologyTestDriver(builder.build(),
streamsConfig(Serdes.String().getClass(), SpecificAvroSerde.class))) {
TestInputTopic<String, Event> inputTopic = testDriver.createInputTopic(inputTopicName,
Serdes.String().serializer(), eventSerde.serializer());
TestOutputTopic<String, Frame> outputWindowTopic = testDriver.createOutputTopic(
outputTopicName, Serdes.String().deserializer(), frameSerde.deserializer());
...
}
I'd like to test a bit more complex setup where an "output" topic is an "input" topic for another topology.
I can define several input and output topics inside of the same topology. But as soon as I am using the same topic as an input and output topic within the same topology, I'm getting the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: Topic events has already been registered by another source.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.validateTopicNotAlreadyRegistered(InternalTopologyBuilder.java:578)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addSource(InternalTopologyBuilder.java:378)
at org.apache.kafka.streams.kstream.internals.graph.StreamSourceNode.writeToTopology(StreamSourceNode.java:94)
at org.apache.kafka.streams.kstream.internals.InternalStreamsBuilder.buildAndOptimizeTopology(InternalStreamsBuilder.java:303)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:558)
at org.apache.kafka.streams.StreamsBuilder.build(StreamsBuilder.java:547)
It looks like the TopologyTestDriver doesn't provide possibility to define input-output topics, is that right?
Update
To better illustrate what I'm trying to achieve:
builder.stream("input-topic, ...)..to("intermediate-topic",...);
builder.stream("intermediate-topic", ...)..to("output-topic",...);
and I want to be able to verify (assert) the contents of the "intermeidate-topic" in my unit test. Btw. I cannot "reuse" the result of the call ".to()" in building the next topology part, since that method returns void.
But I only have testDriver.createInputTopic() and testDriver.createOutputTopic() and no way of defining something like testDriver.createInputOutputTopic().
Using the same topic as input and an output topic should work. However, you cannot use the same topic as input topic multiple times (the strack trace indicates that you try to do this).
If you want to use the same input topic twice, you would just add it once, and "fan it out":
KStream stream = builder.stream(...);
stream.map(...); // first usage
stream.filter(...); // second usage
Using the same KStream object twice, is basically a "fan out" (or "broadcast") that will send the input data to both operators.

Build a Kafka Stream that returns the list of distinct ids into time interval

I have a kafka stream of objects events:
KStream<String, VehicleEventTO> stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)));
Each ObjectEvent has a property idType (Long). I need to build a Stream that returns distinct idTypes into time interval (For example: 10 minutes).
It's possible, using KafkaStream DSL? I don't find a solution.
Based on your use case you are looking for a Windowed aggregation. Kafka streams DSL has TimeWindowedKStream or SessionWindowdKStream which should be able to solve your problem.
I don't quite know KafkaStream's API, but regarding general streaming api,
you'd have a method that buffers messages over time (like buffer, groupedWithin, or something similar) where you can specify time (and/or maximum messages).
Then your stream would be something like:
KStream stream = builder.stream("mytopic", Consumed.with(Serdes.String(), new JsonSerde<>(MyObjectEvent.class)))
.map(record -> record.value().getId()) // assuming you get a stream of records, I don't know the KafkaStreams api
.groupedWithin(Duration.ofMinutes(10)) // <-- pseudocode, search for correct method
Then you'd get a stream that contains the ids over time.

Kafka Streams: How to use persistentKeyValueStore to reload existing messages from disk?

My code is currently using an InMemoryKeyValueStore, which avoids any persistence to disk or to kafka.
I want to use rocksdb (Stores.persistentKeyValueStore) so that the app will reload state from disk. I'm trying to implement this, and I'm very new to Kafka and the streams API. Would appreciate help on how I might make changes, while I still try to understand stuff as I go.
I tried to create the state store here:
StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> store =
Stores.<String, LinkedList<StoreItem>>keyValueStoreBuilder(Stores.persistentKeyValueStore(storeKey), Serdes.String(), valueSerde);
How do I register it with the streams builder?
Existing code which uses the inMemoryKeyValueStore:
static StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> makeStoreBuilder(
final String storeKey,
final Serde<LinkedList<StoreItem>> valueSerde,
final boolean loggingDisabled) {
final StoreBuilder<KeyValueStore<String, LinkedList<StoreItem>>> storeBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(storeKey), Serdes.String(), valueSerde);
return storeBuilder;
}
I need to ensure that the streams app will not end up missing existing messages in the log topic each time it restarts.
How do I register it with the streams builder?
By calling StreamsBuilder#addStateStore().
https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/StreamsBuilder.html#addStateStore-org.apache.kafka.streams.state.StoreBuilder-
See StateStoresInTheDSLIntegrationTest at https://github.com/confluentinc/kafka-streams-examples fro an end-to-end demo application.
You use a persistent store the exact some way as an in-memory store. The store takes care of the rest and you don't need to worry about loading data etc. You just use it.

Distributed caching in storm

How to store the temporary data in Apache storm?
In storm topology, bolt needs to access the previously processed data.
Eg: if the bolt processes varaiable1 with result as 20 at 10:00 AM.
and again varaiable1 is received as 50 at 10:15 AM then the result should be 30 (50-20)
later if varaiable1 receives 70 then the result should be 20 (70-50) at 10:30.
How to achieve this functionality.
In short, you wanted to do micro-batching calculations with in storm’s running tuples.
First you need to define/find key in tuple set.
Do field grouping(don't use shuffle grouping) between bolts using that key. This will guarantee related tuples will always send to same task of downstream bolt for same key.
Define class level collection List/Map to maintain old values and add new value in same for calculation, don’t worry they are thread safe between different executors instance of same bolt.
I'm afraid there is no such built-in functionality as of today.
But you can use any kind of distributed cache, like memcached or Redis. Those caching solutions are really easy to use.
There are a couple of approaches to do that but it depends on your system requirements, your team skills and your infrastructure.
You could use Apache Cassandra for you events storing and you pass the row's key in the tuple so the next bolt could retrieve it.
If your data is time series in nature, then maybe you would like to have a look at OpenTSDB or InfluxDB.
You could of course fall back to something like Software Transaction Memory but I think that would needs good amount of crafting.
Uou can use CacheBuilder to remember your data within your extended BaseRichBolt (put this in the prepare method):
// init your cache.
this.cache = CacheBuilder.newBuilder()
.maximumSize(maximumCacheSize)
.expireAfterWrite(expireAfterWrite, TimeUnit.SECONDS)
.build();
Then in execute, you can use the cache to see if you have already seen that key entry or not. from there you can add your business logic:
// if we haven't seen it before, we can emit it.
if(this.cache.getIfPresent(key) == null) {
cache.put(key, nearlyEmptyList);
this.collector.emit(input, input.getValues());
}
this.collector.ack(input);
This question is a good candidate to demonstrate Apache Spark's in memory computation over the micro batches. However, your use case is trivial to implement in Storm.
Make sure the bolt uses fields grouping. It will consistently hash the incoming tuple to the same bolt so we do not lose out on any tuple.
Maintain a Map<String, Integer> in the bolt's local cache. This map will keep the last known value of a "variable".
class CumulativeDiffBolt extends InstrumentedBolt{
Map<String, Integer> lastKnownVariableValue;
#Override
public void prepare(){
this.lastKnownVariableValue = new HashMap<>();
....
#Override
public void instrumentedNextTuple(Tuple tuple, Collector collector){
.... extract variable from tuple
.... extract current value from tuple
Integer lastValue = lastKnownVariableValue.getOrDefault(variable, 0)
Integer newValue = currValue - lastValue
lastKnownVariableValue.put(variable, newValue)
emit(new Fields(variable, newValue));
...
}

Resources