StateStoreSupplier to store a sequence in KafkaStreams - apache-kafka-streams

I need to re-sequence the data that comes from two topics (merged using an outer join). Is it good practice to use a StateStore to keep the latest sequence and modify the downstream stream value with the re-sequenced message.
Simplified problem :
(seq from topic A, seq from topic B) -> new seq to output (keeping the current sequence in the StateStore)
(10,100) -> 1
(11,101) -> 2
(12,102) -> 3
(...,...) -> ...
The new sequence would be stored as value for the key "currentSeq" in the stateStore. The sequence will be incremented on each message and stored back to the stateStore.

You should use Processor API with a registered (maybe custom) state.
You can also mix-and-match the Processor API with DSL using process(), transform() or transformValue() and reference a state store (by name).
See
How to add a custom StateStore to the Kafka Streams DSL processor?
How to filter keys and value with a Processor using Kafka Stream DSL
How to create a state store with HashMap as value in Kafka streams?

Related

Why Kafka streams creates topics for aggregation and joins

I recently created my first Kafka stream application for learning. I used spring-cloud-stream-kafka-binding. This is a simple eCommerce system, in which I am reading a topic called products, which have all the product entries whenever a new stock of a product comes in. I am aggregating the quantity to get the total quantity of a product.
I had two choices -
Send the aggregate details (KTable) to another kafka topic called aggregated-products
Materialize the aggregated data
I opted second option and what I found out that application created a kafka topic by itself and when I consumed messages from that topic then got the aggregated messages.
.peek((k,v) -> LOGGER.info("Received product with key [{}] and value [{}]",k, v))
.groupByKey()
.aggregate(Product::new,
(key, value, aggregate) -> aggregate.process(value),
Materialized.<String, Product, KeyValueStore<Bytes, byte[]>>as(PRODUCT_AGGREGATE_STATE_STORE).withValueSerde(productEventSerde)//.withKeySerde(keySerde)
// because keySerde is configured in application.properties
);
Using InteractiveQueryService, I am able to access this state store in my application to find out the total quantity available for a product.
Now have few questions -
why application created a new kafka topic?
if answer is 'to store aggregated data' then how is this different from option 1 in which I could have sent the aggregated data by my self?
Where does RocksDB come into picture?
Code of my application (which does more than what I explained here) can be accessed from this link -
https://github.com/prashantbhardwaj/kafka-stream-example/blob/master/src/main/java/com/appcloid/kafka/stream/example/config/SpringStreamBinderTopologyBuilderConfig.java
The internal topics are called changelog topics and are used for fault-tolerance. The state of the aggregation is stored both locally on the disk using RocksDB and on the Kafka broker in the form of a changelog topic - which is essentially a "backup". If a task is moved to a new machine or the local state is lost for a different reason, the local state can be restored by Kafka Streams by reading all changes to the original state from the changelog topic and applying it to a new RocksDB instance. After restoration has finished (the whole changelog topic was processed), the same state should be on the new machine, and the new machine can continue processing where the old one stopped. There are a lot of intricate details to this (e.g. in the default setting, it can happen that the state is updated twice for the same input record when failures happen).
See also https://developer.confluent.io/learn-kafka/kafka-streams/stateful-fault-tolerance/

kstream topology with inmemory statestore data not commited

I need to aggregate client information and every hours push it to an output topic.
I have a topology with :
input-topic
processor
sink topic
Data arrives in input-topic with a key in string which contains a clientID concatenated with date in YYYYMMDDHH
.
In my processor I use a simple InMemoryKeyValueStore (withCachingDisabled) to merge/aggregate data with specific rules (data are sometime not aggregated according to business logic).
In a punctuator, every hours the program parse the statestore to get all the messages transform it and forward it to the sink topic, after what I clean the statestore for all the message processed.
After the punctuation, I ask the size of the store which is effectivly empty (by .all() and
approximateNumEntries), every thing is OK.
But when I restart the application, the statstore is restored with all the elements normally deleted.
When I parse manually (with a simple KafkaConsumer) the changelog topic of the statestore in Kafka, I view that I have two records for each key :
The first record is commited and the message contains my aggregation.
The second record is a deletion message (message with null) but is not commited (visible only with read_uncommitted) which is dangerous in my case because the next punctuator will forward again the aggregate.
I have play with commit in the punctuator which forward, I have create an other punctuator which commit the context periodically (every 3 seconds) but after the restart I still have my data restored in the store (normal my delete message in not commited.)
I have a classic kstream configuration :
acks=all
enable.idempotence=true
processing.guarantee=exactly_once_v2
commit.interval.ms=100
isolation.level=read_committed
with the last version of the library kafka-streams 3.2.2 and a cluster in 2.6
Any help is welcome to have my record in the statestore commited. I don't use TimeWindowedKStream which is not exactly my need (sometime I don't aggregate but directly forward)

Execute code when two previous events have been processed (Apache Kafka)

I´m new in Apache Kafka and Spring Boot. I´m trying to create a Spring Boot listener that generates a new event only when two specific messages (sent through Apache Kafka) have been received (for a determined resource).
The obvious solution is to use the database to change the status of the resource when the first event comes, and execute the code when the second event comes (if the customer is in the correct status in database). In this case, I'm worried if both events arrive at the same time.
Is there a way to aggregate both messages in Spring Boot/Apache Kafka instead do this manually?
Thanks.
You can do it with kafka streams. Example topology:
input stream (key/value from input topic A)
filter (filter by event type for example)
groupBy (group events by key or some field)
aggregate (aggregate events into new data structure)
filter (verify if aggregate its complete)
map (generate new output event with aggregate values)
output stream (key/value to topic B)
Check details in official doc: https://kafka.apache.org/24/documentation/streams/developer-guide/dsl-api.html#creating-source-streams-from-kafka

How to run more than 1 application instances of ktable-ktable joins kafka streams application on single partitioned kafka topics?

KTable<Key1, GenericRecord> primaryTable = createKTable(key1, kstream, statestore-name);
KTable<Key2, GenericRecord> childTable1 = createKTable(key1, kstream, statestore-name);
KTable<Key3, GenericRecord> childTable2 = createKTable(key1, kstream, statestore-name);
primaryTable.leftJoin(childTable1, (primary, choild1) -> compositeObject)
.leftJoin(childTable2,(compositeObject, child2) -> compositeObject, Materialized.as("compositeobject-statestore"))
.toStream().to(""composite-topics)
For my application, I am using KTable-Ktable joins, so that whenever data is received on primary or child stream, it can set it compositeObject with setters and getters for all three tables. These three incoming streams have different keys, but while creating KTable, I make the keys same for all three KTable.
I have all topics with single partition. When I run application on single instance, everything runs fine. I can see compositeObject populated with data from all three tables.
All interactive queries also runs fine passing the recordID and local statestore name.
But when I run two instances of same application, I see compositeObject with primary and child1 data but child2 remains empty. Even if i try to make call to statestore using interactive query, it doesn't return anything.
I am using spring-cloud-stream-kafka-streams libraries for writing code.
Please suggest what is the reason it is not setting and what should be a right solution to handle this.
Kafka Streams' scaling model is coupled to the number of input topic partitions. Thus, if your input topics are single partitioned you cannot scale-out. The number of input topic partitions determine your maximum parallelism.
Thus, you would need to create new topics with higher parallelism.

store kafka-streams table in data store

I create a KTable<Integer, CustomObject>, and now I want to store this data from KTable to mysql db.
Is it possible to save KTable in db? I checked Materialized class, but I do not see appropriate method for it.
final KTable<Integer, Result> result =
users_table.join(photos_table, (a, b) -> Result.from(a, b));
Or it's only possible with Consumer Api? When I read from "my-results" topic?
Materialized is to configure/set the store used by Kafka Streams -- if you don't have a good reason to change it, it's recommended to use the default setting.
If you want to put the data into an external DB, you should write the KTable into a topic KTable#toStream#to("topic") and use Kafka Connect to load the data from the topic into the DB.

Resources