Best practices for transforming a batch of records using KStream - apache-kafka-streams

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?

You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).

Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Related

Can I join the KTable produced by a Kafka Streams #aggregate call before the aggregation runs?

I have a number of IOT devices that report events via messages to a kafka topic, and I have defined an aggregator to update the device state from those events.
What I'd like to do is be able to join the input stream to the KTable that the aggregator outputs before the aggregation updates the state-- that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
I've tried creating the state store with StreamsBuilder#addStateStore first, but that method returns a StreamsBuilder, and doesn't seem to provide me a way to turn it into a KTable.
I've tried joining the input stream against the KTable produced by StreamsBuilder#aggregate, but that doesn't do what I want, because it only gives me the value in the KTable after the aggregation has run, and I'd like it to run before the aggregation.
// this is fine, but it returns a StreamsBuilder and I don't see how to get a KTable out of it
streamsBuilder.addStateStore(
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(deviceStateAggregator),
Serdes.String(),
Serdes.String()
)
);
// this doesn't work because I only get doThingsBeforeStateUpdate called after the state is updated by the DeviceStateAggregator
KTable<String, DeviceState> deviceTable = deviceEventKStream
.groupByKey(Serialized.with(Serdes.String(), new deviceEventSerde()))
.aggregate(
() -> null,
new DeviceStateAggregator(),
Materialized.<String, DeviceState>as(stateStoreSupplier)
.withValueSerde(deviceStateSerde)
);
deviceEventKStream.join(deviceTable, (event, state) -> doThingsBeforeStateUpdate(event, state));
I was hoping to be able to exploit the Streams DSL to check some preconditions before the state is updated by the aggregator, but it doesn't seem possible. I'm currently exploring the idea of using a Processor, or perhaps just extending my DeviceStateAggregator to do all the pre-aggregation processing as well, but that feels awkward to me, as it forces the aggregation to care about concerns that don't seem reasonable to do as part of the aggregation.
that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
If I understand your question and notably this quote correctly, then I'd follow your idea to use the Processor API to implement this. You will need to implement a Transformer (as you want it to output data, not just read it).
As an example application that you could use as a starting point I'd recommend to look at the MixAndMatch DSL + Processor API and the CustomStreamTableJoin examples at https://github.com/confluentinc/kafka-streams-examples. The second example shows, though for a different use case, how to do custom "if this then that" logic when working with state in the Processor API, plus it also covers join functionality, which is something you want to do, too.
Hope this helps!

Adding a global store for a transformer to consume

Is there a way to add a global store for a Transformer to use? In the docs for transformer it says:
"Transform each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily). A Transformer (provided by the given TransformerSupplier) is applied to each input record and computes zero or more output records. In order to assign a state, the state must be created and registered beforehand via stores added via addStateStore or addGlobalStore before they can be connected to the Transformer"
yet, the API for addGlobalStore on takes a ProcessSupplier?
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore],
topic: String,
consumed: Consumed[_, _],
stateUpdateSupplier: ProcessorSupplier[_, _])
My end goal is to the Kafka Streams DSL, with a transformer since I need a flatMap and transform both keys and values to my output topic. I do not have a processor in my topology tho.
I would expect something like this:
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore], topic: String, consumed: Consumed[_, ], stateUpdateSupplier: TransformerSupplier[, _])
The Processor that is passed into addGlobalStore() is use to maintain (ie, write) the store. Note, that's it's expected atm that this Processor copies the data as-is into the store (cf https://issues.apache.org/jira/browse/KAFKA-7663).
After you have added a global store, you can also add a Transformer and the Transformer can access the store. Note, that it's not required to connect a global store to make it available (only "regular" stores, would need to be added). Also note, that a Transformer only gets read access to global stores.
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

How to avoid timestamps extracted from a changelog topic to trigger Processor::punctuate

I'm writing a KafkaStreams (v2.1.0) application which reads messages from a regular Kafka topic, left-joins these messages with metadata from a compacted changelog topic and performs some stateful processing on the result using a Processor which also runs a punctuate call on regular time intervals based on event-time. The messages in the data-topic have a timestamp and a custom TimestampExtractor is defined (which defines the event-time as I'd like to use in the puctuate call). The messages in the metadata-topic, however, don't have a timestamp, but it seems KafkaStreams requires a TimestampExtractor to be defined anyhow. Now, if I use the embedded metadata in Kafka messages ExtractRecordMetadataTimestamp or just a WallclockTimestampExtractor it breaks the logic of my application because it seems these timestamps of the metadata-topic also trigger the punctuate call in my processor, but the event-time in my data topic can be hours behind the wall-clock time and I only want to have the punctuate-call trigger on these.
My question is how I can avoid the punctuate call to be triggered by timestamps extracted from this compacted metadata topic?
One trick that seems to work at first sight is to always return 0 as the timestamp of these metadata messages, but I'm not sure that it will not have unwanted side-effects. Another workaround is not to rely on 'punctuate' altogether and implement my own tracking of the event time but I'd prefer to use the standard KafkaStreams approach. So maybe there is another way to solve this?
This is the structure of my application:
KStream<String, Data> input =
streamsBuilder.stream(dataTopic,
Consumed.with(Serdes.String(),
new DataSerde(),
new DataTimestampExtractor(),
Topology.AutoOffsetReset.LATEST));
KTable<String, Metadata> metadataTable =
streamsBuilder.table(metadataTopic,
Consumed.with(Serdes.String(),
new MetadataSerde(),
new WhichTimestampExtractorToUse(),
Topology.AutoOffsetReset.EARLIEST));
input.leftJoin(metadataTable, this::joiner)
.process(new ProcessorUsingPunctuateOnEventTime());

Kafka Streams: How to avoid forwarding downstream twice when repartitioning

In my application I have KafkaStreams instances with a very simple topology: there is one processor, with a key-value store, and each incoming message gets written to the store and is then forwarded downstream to a sink.
I would like to increase the number of partitions I have for my source topic, and then reprocess the data, so that each store will contain only keys relevant to its partition. (I understand this is done using the Application Reset Tool). However, while reprocessing the data, I don't want to forward anything downstream; I want only new data to be forwarded. (Otherwise, consumers of the result topic will handle old values again). My question: is there an easy way to achieve this? Any build-in mechanism that can assist me in telling reprocessed data and new data apart maybe?
Thank you in advance
There is not build-in mechanism. But you might be able to just remove the sink operation that is writing to the result topic when you reprocess your data -- when reprocessing is done, you stop the application, add the sink again and restart. Not sure if this works for you.
Another possible solution might be, to use a transform() an implement an offset-based filter. For each input topic partitions, you get the offset of the first new message (this is something you need to do manually before you write the Transformer). You use this information, to implement a filter as a custom Transformer: for each input record, you check the record's partition and offset and drop it, if the record's offset is smaller then the offset of the first new message of this partition.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources