Kafka Streams toTable - logging the old value - apache-kafka-streams

I have a reasonably simple topology in Kafka Streams looking to build up a few strings and output them with a key to a new topic. This will end up being treated as a KTable with each key having a set of values and the possibility of being overwritten.
I would like to have a log message generated when an overwrite occurs, noting the old values and the new values.
I was thinking I'd be able to do something like this:
... multiple KTable-KTable joins needed to build up the strings
.toStream()
.peek(
// some code to compare value with value in KeyValueStore here and log.info() the result
)
.mapValues(v -> new ValueWithCalculatedStrings(v))
.toTable(Materialized.as(KVS mentioned above)
.toStream()
.to("outputTopic");
possibly with declaring my store above this block and using the reference in the peek and toTable sections. Unfortunately the Materialized instance doesn't give me access to the actual store, so this is a non-starter.
Alternatively, if there is an event I can hook into on the toTable to indicate an old record then that would also work.
Is there any way to do this other than needing to mess with grouping?
I also note there is an enableLogging option on the Materialized, but the little documentation I can find seems to suggest this is for internal fault tolerance rather than for external observation

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Can I join the KTable produced by a Kafka Streams #aggregate call before the aggregation runs?

I have a number of IOT devices that report events via messages to a kafka topic, and I have defined an aggregator to update the device state from those events.
What I'd like to do is be able to join the input stream to the KTable that the aggregator outputs before the aggregation updates the state-- that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
I've tried creating the state store with StreamsBuilder#addStateStore first, but that method returns a StreamsBuilder, and doesn't seem to provide me a way to turn it into a KTable.
I've tried joining the input stream against the KTable produced by StreamsBuilder#aggregate, but that doesn't do what I want, because it only gives me the value in the KTable after the aggregation has run, and I'd like it to run before the aggregation.
// this is fine, but it returns a StreamsBuilder and I don't see how to get a KTable out of it
streamsBuilder.addStateStore(
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(deviceStateAggregator),
Serdes.String(),
Serdes.String()
)
);
// this doesn't work because I only get doThingsBeforeStateUpdate called after the state is updated by the DeviceStateAggregator
KTable<String, DeviceState> deviceTable = deviceEventKStream
.groupByKey(Serialized.with(Serdes.String(), new deviceEventSerde()))
.aggregate(
() -> null,
new DeviceStateAggregator(),
Materialized.<String, DeviceState>as(stateStoreSupplier)
.withValueSerde(deviceStateSerde)
);
deviceEventKStream.join(deviceTable, (event, state) -> doThingsBeforeStateUpdate(event, state));
I was hoping to be able to exploit the Streams DSL to check some preconditions before the state is updated by the aggregator, but it doesn't seem possible. I'm currently exploring the idea of using a Processor, or perhaps just extending my DeviceStateAggregator to do all the pre-aggregation processing as well, but that feels awkward to me, as it forces the aggregation to care about concerns that don't seem reasonable to do as part of the aggregation.
that is, I want to, say, compare an event to the current state, and if they match a certain predicate, do some processing, and then update the state.
If I understand your question and notably this quote correctly, then I'd follow your idea to use the Processor API to implement this. You will need to implement a Transformer (as you want it to output data, not just read it).
As an example application that you could use as a starting point I'd recommend to look at the MixAndMatch DSL + Processor API and the CustomStreamTableJoin examples at https://github.com/confluentinc/kafka-streams-examples. The second example shows, though for a different use case, how to do custom "if this then that" logic when working with state in the Processor API, plus it also covers join functionality, which is something you want to do, too.
Hope this helps!

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

Kafka Streams: How to avoid forwarding downstream twice when repartitioning

In my application I have KafkaStreams instances with a very simple topology: there is one processor, with a key-value store, and each incoming message gets written to the store and is then forwarded downstream to a sink.
I would like to increase the number of partitions I have for my source topic, and then reprocess the data, so that each store will contain only keys relevant to its partition. (I understand this is done using the Application Reset Tool). However, while reprocessing the data, I don't want to forward anything downstream; I want only new data to be forwarded. (Otherwise, consumers of the result topic will handle old values again). My question: is there an easy way to achieve this? Any build-in mechanism that can assist me in telling reprocessed data and new data apart maybe?
Thank you in advance
There is not build-in mechanism. But you might be able to just remove the sink operation that is writing to the result topic when you reprocess your data -- when reprocessing is done, you stop the application, add the sink again and restart. Not sure if this works for you.
Another possible solution might be, to use a transform() an implement an offset-based filter. For each input topic partitions, you get the offset of the first new message (this is something you need to do manually before you write the Transformer). You use this information, to implement a filter as a custom Transformer: for each input record, you check the record's partition and offset and drop it, if the record's offset is smaller then the offset of the first new message of this partition.

Is there a way to get offset for each message consumed in kafka streams?

In order to avoid reading of messages which are processed but missed to get committed when a KAFKA STREAMS is killed , I want to get the offset for each message along with the key and value so that I can store it somewhere and use it to avoid the reprocessing of already processed messages.
Yes, this is possible. See the FAQ entry at http://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information.
I'll copy-paste the key information below:
Accessing record metadata such as topic, partition, and offset information?
Record metadata is accessible through the Processor API.
It is also accessible indirectly through the DSL thanks to its
Processor API integration.
With the Processor API, you can access record metadata through a
ProcessorContext. You can store a reference to the context in an
instance field of your processor during Processor#init(), and then
query the processor context within Processor#process(), for example
(same for Transformer). The context is updated automatically to match
the record that is currently being processed, which means that methods
such as ProcessorContext#partition() always return the current
record’s metadata. Some caveats apply when calling the processor
context within punctuate(), see the Javadocs for details.
If you use the DSL combined with a custom Transformer, for example,
you could transform an input record’s value to also include partition
and offset metadata, and subsequent DSL operations such as map or
filter could then leverage this information.

Resources