Different serde for Kafka Streams KTable state store

Different serde for Kafka Streams KTable state store - apache-kafka-streams

As part of our application logic, we use Kafka Streams state store for range lookups, data is loaded from Kafka topic using builder.table() method.
The problem is that source topic's key is serialised as JSON and doesn't suite well to binary key comparisons used internally in RocksDB based state store.
We were hoping to use a separate serde for keys by passing it to Materialized.as(). However, it looks like that streams implementation resets whatever is passed to the original serdes used to load from the table topic.
This is what I can see in streams builder internals:
public synchronized <K, V> KTable<K, V> table(final String topic,
final Consumed<K, V> cons,
final Materialized<K, V, KeyValueStore<Bytes, byte[]>> materialized) {
Objects.requireNonNull(topic, "topic can't be null");
Objects.requireNonNull(consumed, "consumed can't be null");
Objects.requireNonNull(materialized, "materialized can't be null");
materialized.withKeySerde(consumed.keySerde).withValueSerde(consumed.valueSerde);
return internalStreamsBuilder.table(topic,
new ConsumedInternal<>(consumed),
new MaterializedInternal<>(materialized, internalStreamsBuilder, topic + "-"));
}
Anybody knows why it's done this way, and if it's possible to use a different serde for a DSL state store?
Please don't propose using Processor API, this route is well explored. I would like to avoid writing a processor and a custom state store every time when I need to massage data before saving it into a state store.
After some digging through streams sources, I found out that I can pass a custom Materialized.as to the filter with always true predicate. But it smells a bit hackerish.
This is my code, that unfortunately doesn't work as we hoped to, because of "serdes reset" described above.
Serde<Value> valueSerde = new JSONValueSerde()
KTable<Key, Value> table = builder.table(
tableTopic,
Consumed.with(new JSONKeySerde(), valueSerde)
Materialized.as(cacheStoreName)
.withKeySerde(new BinaryComparisonsCompatibleKeySerde())
.withValueSerde(valueSerde)
)

The code works by design. From a streams point of view, there is no reason to use a different Serde for the store are for reading the data from the topic, because it's know to be the same data. Thus, if one does not use the default Serdes from the StreamsConfig, it's sufficient to specify the Serde once (in Consumed) and it's not required to specify it in Materialized again.
For you special case, you could read the topic as a stream a do a "dummy aggregation" that just return the latest value per record (instead of computing an actual aggregate). This allows you to specify a different Serde for the result type.

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

I'm using the DSL api and I have a use case where I need to check a condition and then if true, send an additional message to a separate topic from the happy path. My question is, how can I attach child processors to parents in the DSL api? Is it as simple as caching a stream variable and using it in two subsequent places, and naming those stream processors? Here's some brief code that explains what I'm trying to do. I am using the DSL api because I need the use of the foreignKeyJoin.
var myStream = stream.process(myProcessorSupplier); //3.3 returns a stream
stream.to("happyThingTopic"); Q: will the forward ever land here?
stream.map( myKvMapper, new Named("what-is-this")).to("myOtherTopic"); //will the forward land here?
public KeyValue<String, Object> process(Object key, Object value){
if (value.hasFlag){
processorContext.forward(key, new OtherThing(), "what-is-this?");
}
return new KeyValue(key, HappyThing(value));
}

apache camel data validation issue

I am new to camel, reading a data file (flat text file) using multiple threads.
Able to read data and store into pojo. Getting issue in validation.
Issue is in validation. While reading line by line I am storig records in the context in a map. Now validation has to be across the line. Like some of the record has suppose ids that should be defined in earlier record. Here is my route:-
get("readerService","fileReader")
.log(...)
.process(e->readerService.init(...))
.split(...)
.streaming()
.parallelProcessing(...)
.threads(...)
.process(e -> readerService.read(e, ctx)
// tranforms read record to pojo and stores into map in ctx
.process(e -> recTransformer(e, ctx)
.process(e -> myValidator(e, ctx)
.end()
problem is read service is reading 10 data randomly and transformer reads, creates pojo and stores into map in context. But validator also getting called each time when the complete record is not read.
In some cases it is trying to validate record when previous record is not stored into map due to parallel processing.
I would like to know how to call .process(e->myValidator(e, ctx)
after entire file is read and map is complete. That will resolve my issue. Any suggestion will be great.

How to access DSL-created KTable/GlobalKTable using Processor API?

I am using a Processor API (PAPI) topology.
Is it possible to access a KTable (or GlobalKTable) created with DSL from within the Processor API (even if read-only)?
I.e. using the:
val builder = new StreamsBuilder()
val KTable = builder.table("topicname")
I get a KTable, but the Topology only allows you to use addStateStore with a StoreBuilder, not the KTable itself.
.addStateStore(myStoreBuilder, MY_PROCESSOR_NAME)
So I could build one by doing this:
def keyValueStoreBuilder[K, V](storeName: String, keySerde: Serde[K], valueSerde: Serde[V]): StoreBuilder[KeyValueStore[K, V]] = {
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(storeName),
keySerde,
valueSerde)
}
But, how to cleanly obtain the storeName in this case?

When you create a KTable it will automatically create a store internally, with a generated name. (You can get the name via Topology#describe()). You can also assign a name to the store via table() method using Materialized parameter.
It's a little unclear to me, what you mean by "access a KTable within the Processor API" though? If you mean "access the KTable store within a Processor" you can use Topology#connectProcessorAndStateStores() to give the processor access to the store. Note, that the processor should never write into the KTable store, as the table() operator is responsible to maintain the table's state. If you do write into the store, there are not guarantees and you might loose data in case of a failure.

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

RocksDBStore<K,V> stores keys and values as byte[] on disk. It converts to/from K and V typed objects using Serdes provided while constructing the object of RocksDBStore<K,V>.
Given this, please help me understand the purpose of the following code in RocksDbKeyValueBytesStoreSupplier:
return new RocksDBStore<>(name,
Serdes.Bytes(),
Serdes.ByteArray());
Providing Serdes.Bytes() and Serdes.ByteArray() looks redundant.
RocksDbKeyValueBytesStoreSupplier is introduced in KAFKA-5650 (Kafka Streams 1.0.0) as part of KIP-182: Reduce Streams DSL overloads and allow easier use of custom storage engines.
In KIP-182, there is the following sentence :
The new Interface BytesStoreSupplier supersedes the existing StateStoreSupplier (which will remain untouched). This so we can provide a convenient way for users creating custom state stores to wrap them with caching/logging etc if they chose. In order to do this we need to force the inner most store, i.e, the custom store, to be a store of type <Bytes, byte[]>.
Please help me understand why we need to force custom stores to be of type <Bytes, byte[]>?
Another place (KAFKA-5749) where I found similar sentence:
In order to support bytes store we need to create a MeteredSessionStore and ChangeloggingSessionStore. We then need to refactor the current SessionStore implementations to use this. All inner stores should by of type < Bytes, byte[] >
Why?

Your observation is correct -- the PR implementing KIP-182 did miss to remove the Serdes from RocksDBStore that are not required anymore. This was fixed in 1.1 release already.

How to get ordering for different MessagePostProcessors in SimpleMessageListenerContainer

I have multiple MessagePostProcessors in SpringAMQP which i set them using SimpleMessageListenerContainer.setAfterReceivePostProcessors API , now my query is does these MessagePostProcessors are called in order I have mentioned.
Pseoudo code
SimpleMessageListenerContainer container = // api returing SimpleMessageListenerContainer object
container.setAfterReceivePostProcessors(new MessagePostProcessor[] {
messagePostProcessors1 , messagePostProcessors2});
So does Spring AMQP call messagePostProcessors1 followed messagePostProcessors2 in sequence or does it randomly selects the same ?
If it randomly selects is there any way that we can order the same i.e messagePostProcessors2 always gets called after messagePostProcessors1

Akshat , the order is based on the order that is set in the processor.Quoting the document here. When i look at the concrete implementation of the processors , i find there is a setOrder method (form interface ordered i think). May be setting that in your message post processor will do the trick.
public void setAfterReceivePostProcessors(MessagePostProcessor...
afterReceivePostProcessors)
Set a MessagePostProcessor that will be
invoked immediately after a Channel#basicGet() and before any message
conversion is performed. May be used for operations such as
decompression Processors are invoked in order, depending on
PriorityOrder, Order and finally unordered.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Different serde for Kafka Streams KTable state store - apache-kafka-streams

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

apache camel data validation issue

How to access DSL-created KTable/GlobalKTable using Processor API?

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

How to get ordering for different MessagePostProcessors in SimpleMessageListenerContainer

Categories

Resources