Kafka Streams Transform Method and stream timestamp - apache-kafka-streams

ProcessorContext.timestamp() Java Doc suggests it will only return the stream time if triggered in a punctuate call - but not when handling a message in the transform method:
Returns the current timestamp. If it is triggered while processing a
record streamed from the source processor, timestamp is defined as the
timestamp of the current input record; the timestamp is extracted from
ConsumerRecord by TimestampExtractor. If it is triggered while
processing a record generated not from the source processor (for
example, if this method is invoked from the punctuate call), timestamp
is defined as the current task's stream time, which is defined as the
smallest among all its input stream partition timestamps.
So is there a way for the transform method to be aware of the stream time? I need to be able to detect the lateness of messages compared to the stream time and handle then appropriately.
Many thanks

Related

How to avoid timestamps extracted from a changelog topic to trigger Processor::punctuate

I'm writing a KafkaStreams (v2.1.0) application which reads messages from a regular Kafka topic, left-joins these messages with metadata from a compacted changelog topic and performs some stateful processing on the result using a Processor which also runs a punctuate call on regular time intervals based on event-time. The messages in the data-topic have a timestamp and a custom TimestampExtractor is defined (which defines the event-time as I'd like to use in the puctuate call). The messages in the metadata-topic, however, don't have a timestamp, but it seems KafkaStreams requires a TimestampExtractor to be defined anyhow. Now, if I use the embedded metadata in Kafka messages ExtractRecordMetadataTimestamp or just a WallclockTimestampExtractor it breaks the logic of my application because it seems these timestamps of the metadata-topic also trigger the punctuate call in my processor, but the event-time in my data topic can be hours behind the wall-clock time and I only want to have the punctuate-call trigger on these.
My question is how I can avoid the punctuate call to be triggered by timestamps extracted from this compacted metadata topic?
One trick that seems to work at first sight is to always return 0 as the timestamp of these metadata messages, but I'm not sure that it will not have unwanted side-effects. Another workaround is not to rely on 'punctuate' altogether and implement my own tracking of the event time but I'd prefer to use the standard KafkaStreams approach. So maybe there is another way to solve this?
This is the structure of my application:
KStream<String, Data> input =
streamsBuilder.stream(dataTopic,
Consumed.with(Serdes.String(),
new DataSerde(),
new DataTimestampExtractor(),
Topology.AutoOffsetReset.LATEST));
KTable<String, Metadata> metadataTable =
streamsBuilder.table(metadataTopic,
Consumed.with(Serdes.String(),
new MetadataSerde(),
new WhichTimestampExtractorToUse(),
Topology.AutoOffsetReset.EARLIEST));
input.leftJoin(metadataTable, this::joiner)
.process(new ProcessorUsingPunctuateOnEventTime());

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

How to process a message in a specific time?

I'm using spring-integration to develop a service bus. I need to process some messages from the message-store at the specific time. For example if there is a executionTimestamp parameter in payload of the message it will be executed in specified time otherwise be executed as soon as message received.
What kind of channel and taskExecutor I have to use?
Do I Have to implement a custom Trigger or there is some conventional way to implement the message processing strategy?
Sincerely
See the Delayer.
The delay handler supports expression evaluation results that represent an interval in milliseconds (any Object whose toString() method produces a value that can be parsed into a Long) as well as java.util.Date instances representing an absolute time. In the first case, the milliseconds will be counted from the current time (e.g. a value of 5000 would delay the Message for at least 5 seconds from the time it is received by the Delayer). With a Date instance, the Message will not be released until the time represented by that Date object. In either case, a value that equates to a non-positive delay, or a Date in the past, will not result in any delay. Instead, it will be sent directly to the output channel on the original sender’s Thread. If the expression evaluation result is not a Date, and can not be parsed as a Long, the default delay (if any) will be applied.
You can add a MessageStore to hold the message if you don't want to lose messages that are currently delayed when the server crashes.

Is there a way to get offset for each message consumed in kafka streams?

In order to avoid reading of messages which are processed but missed to get committed when a KAFKA STREAMS is killed , I want to get the offset for each message along with the key and value so that I can store it somewhere and use it to avoid the reprocessing of already processed messages.
Yes, this is possible. See the FAQ entry at http://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information.
I'll copy-paste the key information below:
Accessing record metadata such as topic, partition, and offset information?
Record metadata is accessible through the Processor API.
It is also accessible indirectly through the DSL thanks to its
Processor API integration.
With the Processor API, you can access record metadata through a
ProcessorContext. You can store a reference to the context in an
instance field of your processor during Processor#init(), and then
query the processor context within Processor#process(), for example
(same for Transformer). The context is updated automatically to match
the record that is currently being processed, which means that methods
such as ProcessorContext#partition() always return the current
record’s metadata. Some caveats apply when calling the processor
context within punctuate(), see the Javadocs for details.
If you use the DSL combined with a custom Transformer, for example,
you could transform an input record’s value to also include partition
and offset metadata, and subsequent DSL operations such as map or
filter could then leverage this information.

Access to queue attributes?

I have a number of GenerateTableFetch processors that send Flowfiles to a downstream UpdateAttributes processor. From the UpdateAttributes, the Flowfile is passed to an ExecuteSQL processor:
Is there any way to add an attribute to a flow file coming off a queue with the position of that Flowfile in the queue? For example, After I reset/clear the state for a GenerateTableFetch, I would like to know if this is the first batch of Flowfiles coming from GenerateTableFetch. I can see the position of the FlowFile in the queue, but it would nice is there's a way that I could add that as an attribute that is passed downstream. Is this possible?
This is not an available feature in Apache NiFi. The position of a flowfile in a queue is dynamic, and will change as flowfiles are removed from the queue, either by downstream processing or by flowfile expiration.
If you are simply trying to determine if the queue was empty before a specific flowfile was added, your best solution at this time is probably to use an ExecuteScript processor to get the desired connection via the REST API, then use FlowFileQueue#isActiveQueueEmpty() to determine if the specified queue is currently empty, and add a boolean attribute to the flowfile indicating it is the "first of a batch" or whatever logic you want to apply.
"Batches" aren't really a NiFi concept. Is there a specific action you want to take with the "first" flowfile? Perhaps there is other logic (i.e. the ExecuteSQL processor hasn't operated on a flowfile in x seconds, etc.) that could trigger your desired behavior.

Resources