Adding a global store for a transformer to consume - apache-kafka-streams

Is there a way to add a global store for a Transformer to use? In the docs for transformer it says:
"Transform each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily). A Transformer (provided by the given TransformerSupplier) is applied to each input record and computes zero or more output records. In order to assign a state, the state must be created and registered beforehand via stores added via addStateStore or addGlobalStore before they can be connected to the Transformer"
yet, the API for addGlobalStore on takes a ProcessSupplier?
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore],
topic: String,
consumed: Consumed[_, _],
stateUpdateSupplier: ProcessorSupplier[_, _])
My end goal is to the Kafka Streams DSL, with a transformer since I need a flatMap and transform both keys and values to my output topic. I do not have a processor in my topology tho.
I would expect something like this:
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore], topic: String, consumed: Consumed[_, ], stateUpdateSupplier: TransformerSupplier[, _])

The Processor that is passed into addGlobalStore() is use to maintain (ie, write) the store. Note, that's it's expected atm that this Processor copies the data as-is into the store (cf https://issues.apache.org/jira/browse/KAFKA-7663).
After you have added a global store, you can also add a Transformer and the Transformer can access the store. Note, that it's not required to connect a global store to make it available (only "regular" stores, would need to be added). Also note, that a Transformer only gets read access to global stores.

Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

Related

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

Kafka Streams: How to avoid forwarding downstream twice when repartitioning

In my application I have KafkaStreams instances with a very simple topology: there is one processor, with a key-value store, and each incoming message gets written to the store and is then forwarded downstream to a sink.
I would like to increase the number of partitions I have for my source topic, and then reprocess the data, so that each store will contain only keys relevant to its partition. (I understand this is done using the Application Reset Tool). However, while reprocessing the data, I don't want to forward anything downstream; I want only new data to be forwarded. (Otherwise, consumers of the result topic will handle old values again). My question: is there an easy way to achieve this? Any build-in mechanism that can assist me in telling reprocessed data and new data apart maybe?
Thank you in advance
There is not build-in mechanism. But you might be able to just remove the sink operation that is writing to the result topic when you reprocess your data -- when reprocessing is done, you stop the application, add the sink again and restart. Not sure if this works for you.
Another possible solution might be, to use a transform() an implement an offset-based filter. For each input topic partitions, you get the offset of the first new message (this is something you need to do manually before you write the Transformer). You use this information, to implement a filter as a custom Transformer: for each input record, you check the record's partition and offset and drop it, if the record's offset is smaller then the offset of the first new message of this partition.

Is there a way to get offset for each message consumed in kafka streams?

In order to avoid reading of messages which are processed but missed to get committed when a KAFKA STREAMS is killed , I want to get the offset for each message along with the key and value so that I can store it somewhere and use it to avoid the reprocessing of already processed messages.
Yes, this is possible. See the FAQ entry at http://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information.
I'll copy-paste the key information below:
Accessing record metadata such as topic, partition, and offset information?
Record metadata is accessible through the Processor API.
It is also accessible indirectly through the DSL thanks to its
Processor API integration.
With the Processor API, you can access record metadata through a
ProcessorContext. You can store a reference to the context in an
instance field of your processor during Processor#init(), and then
query the processor context within Processor#process(), for example
(same for Transformer). The context is updated automatically to match
the record that is currently being processed, which means that methods
such as ProcessorContext#partition() always return the current
record’s metadata. Some caveats apply when calling the processor
context within punctuate(), see the Javadocs for details.
If you use the DSL combined with a custom Transformer, for example,
you could transform an input record’s value to also include partition
and offset metadata, and subsequent DSL operations such as map or
filter could then leverage this information.

Significance of boolean direct Output Fields declare (boolean direct, Fields fields)

Looking at the OutputFieldsDeclarer class, I see there is an overloaded method for declare(...) with a boolean flag direct.
If I use the method declare(Fields fields), it sets this boolean flag as false.
I am not sure how Storm interprets this boolean field internally while processing with Spouts and Bolts .
Can somebody explain me the significance of this flag?
If you declare a direct stream (ie, setting the flag to true), you need to
emit tuples via
collector.emitDirect(...)
methods (collector.emit(...) is not allowed for direct streams). Those
methods require to specify the consumer task ID that should receive the
tuple.
Furthermore, when connecting a consumer to a direct stream, you need to
specify
builder.setBolt(....).directGrouping("direct-emitting-bolt", "direct-stream-Id");
All other connection patterns are not allowed on direct stream.
Direct streams have the advantage, that you have fine-grained controlled
over the data distribution from producer to consumer. You can implement
any imaginable distribution pattern. Of course, direct streams are much
more difficult to handle. For example, you need to know the task IDs of
subscribed consumers (those can be looked up in the TopologyContext
provided via Bolt.prepare).

Resources