Kafka Stream work with JoinWindow for data replay - apache-kafka-streams

I have 2 streams of data and I want to be able to join them for a window of 1 month let's say. When I have a live data everything is fun and super easy with KStream and join. I did something like this;
KStream<String, GenericRecord> stream1 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic1());
KStream<String, GenericRecord> stream2 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic2());
long joinWindowSizeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days
KStream<String, GenericRecord> joinStream = stream1.join(stream2,
new ValueJoiner<GenericRecord, GenericRecord, GenericRecord>() {
#Override
public GenericRecord apply(GenericRecord genericRecord, GenericRecord genericRecord2) {
final GenericRecord jonnedRecord = new GenericData.Record(jonnedRecordSchema);
....
....
....
return jonnedRecord;
}
}, JoinWindows.of(joinWindowSizeMs));
The problem appears when I want to do a data replay. let's say I want to re-do these join for the data I have for past 6 months since I am running the pipeline for all data at once kafkaStream will join all the joinable data and it doesn't take the time difference into consideration (which it should only join past one month of data). I am assuming the JoinWindow time is the time we insert data into Kafka topic, am I right?
And how can I change and manipulate this time so I can run my data replay correctly, I mean for re-inserting these past 6 months of data it should take a window of one month for each respective record and join based one that.
This question is not duplicate of How to manage Kafka KStream to Kstream windowed join?, there I asked about how can I can join based on the window of time. here I am talking about data replay. from my understanding during join Kafka take the time that data is inserted to the topic as the time for JoinWindow, so if you want to do the data replay and re-insert the data for 6 month ago kafka take it as a new data which is inserted today and gonna join it with some othrr data that is actually for today which it shouldn't.

Kafka's Streams API uses timestamps returned by TimestampExtractor to compute joins. By default, this is the record's embedded metadata timestamp. (c.f. http://docs.confluent.io/current/streams/concepts.html#time)
Per default, KafkaProducer sets this timestamp to current system time on write. (As an alternative, you can configure brokers on a per-topic basis to overwrite producer-provided timestamps of records with the broker's system time at the time the broker stored the record -- this provides "ingestion time" semantics.)
Thus, it is not a Kafka Streams issue per se.
There are multiple options to tackle the problem:
If your data is already in a topic, you can simply reset your Streams application to reprocess old data. For this, you can use the application reset tool (bin/kafka-streams-application-reset.sh). You also need to specify auto.offset.reset policy to earliest in your Streams app. Check out the docs -- also, it's recommended to read the blog post.
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
This is the best approach, as you do not need to write data to the topic again.
If your data is not in a topic and you need to write the data, you can set the record timestamp explicitly at the application level, by providing a timestamp for each record:
KafkaProducer producer = new KafkaProducer(...);
producer.send(new ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value));
Thus, if you ingest old data you can set the timestamp explicitly and Kafka Streams will pick it up and compute the join accordingly.

Related

Why Kafka streams creates topics for aggregation and joins

I recently created my first Kafka stream application for learning. I used spring-cloud-stream-kafka-binding. This is a simple eCommerce system, in which I am reading a topic called products, which have all the product entries whenever a new stock of a product comes in. I am aggregating the quantity to get the total quantity of a product.
I had two choices -
Send the aggregate details (KTable) to another kafka topic called aggregated-products
Materialize the aggregated data
I opted second option and what I found out that application created a kafka topic by itself and when I consumed messages from that topic then got the aggregated messages.
.peek((k,v) -> LOGGER.info("Received product with key [{}] and value [{}]",k, v))
.groupByKey()
.aggregate(Product::new,
(key, value, aggregate) -> aggregate.process(value),
Materialized.<String, Product, KeyValueStore<Bytes, byte[]>>as(PRODUCT_AGGREGATE_STATE_STORE).withValueSerde(productEventSerde)//.withKeySerde(keySerde)
// because keySerde is configured in application.properties
);
Using InteractiveQueryService, I am able to access this state store in my application to find out the total quantity available for a product.
Now have few questions -
why application created a new kafka topic?
if answer is 'to store aggregated data' then how is this different from option 1 in which I could have sent the aggregated data by my self?
Where does RocksDB come into picture?
Code of my application (which does more than what I explained here) can be accessed from this link -
https://github.com/prashantbhardwaj/kafka-stream-example/blob/master/src/main/java/com/appcloid/kafka/stream/example/config/SpringStreamBinderTopologyBuilderConfig.java
The internal topics are called changelog topics and are used for fault-tolerance. The state of the aggregation is stored both locally on the disk using RocksDB and on the Kafka broker in the form of a changelog topic - which is essentially a "backup". If a task is moved to a new machine or the local state is lost for a different reason, the local state can be restored by Kafka Streams by reading all changes to the original state from the changelog topic and applying it to a new RocksDB instance. After restoration has finished (the whole changelog topic was processed), the same state should be on the new machine, and the new machine can continue processing where the old one stopped. There are a lot of intricate details to this (e.g. in the default setting, it can happen that the state is updated twice for the same input record when failures happen).
See also https://developer.confluent.io/learn-kafka/kafka-streams/stateful-fault-tolerance/

kstream topology with inmemory statestore data not commited

I need to aggregate client information and every hours push it to an output topic.
I have a topology with :
input-topic
processor
sink topic
Data arrives in input-topic with a key in string which contains a clientID concatenated with date in YYYYMMDDHH
.
In my processor I use a simple InMemoryKeyValueStore (withCachingDisabled) to merge/aggregate data with specific rules (data are sometime not aggregated according to business logic).
In a punctuator, every hours the program parse the statestore to get all the messages transform it and forward it to the sink topic, after what I clean the statestore for all the message processed.
After the punctuation, I ask the size of the store which is effectivly empty (by .all() and
approximateNumEntries), every thing is OK.
But when I restart the application, the statstore is restored with all the elements normally deleted.
When I parse manually (with a simple KafkaConsumer) the changelog topic of the statestore in Kafka, I view that I have two records for each key :
The first record is commited and the message contains my aggregation.
The second record is a deletion message (message with null) but is not commited (visible only with read_uncommitted) which is dangerous in my case because the next punctuator will forward again the aggregate.
I have play with commit in the punctuator which forward, I have create an other punctuator which commit the context periodically (every 3 seconds) but after the restart I still have my data restored in the store (normal my delete message in not commited.)
I have a classic kstream configuration :
acks=all
enable.idempotence=true
processing.guarantee=exactly_once_v2
commit.interval.ms=100
isolation.level=read_committed
with the last version of the library kafka-streams 3.2.2 and a cluster in 2.6
Any help is welcome to have my record in the statestore commited. I don't use TimeWindowedKStream which is not exactly my need (sometime I don't aggregate but directly forward)

How to run more than 1 application instances of ktable-ktable joins kafka streams application on single partitioned kafka topics?

KTable<Key1, GenericRecord> primaryTable = createKTable(key1, kstream, statestore-name);
KTable<Key2, GenericRecord> childTable1 = createKTable(key1, kstream, statestore-name);
KTable<Key3, GenericRecord> childTable2 = createKTable(key1, kstream, statestore-name);
primaryTable.leftJoin(childTable1, (primary, choild1) -> compositeObject)
.leftJoin(childTable2,(compositeObject, child2) -> compositeObject, Materialized.as("compositeobject-statestore"))
.toStream().to(""composite-topics)
For my application, I am using KTable-Ktable joins, so that whenever data is received on primary or child stream, it can set it compositeObject with setters and getters for all three tables. These three incoming streams have different keys, but while creating KTable, I make the keys same for all three KTable.
I have all topics with single partition. When I run application on single instance, everything runs fine. I can see compositeObject populated with data from all three tables.
All interactive queries also runs fine passing the recordID and local statestore name.
But when I run two instances of same application, I see compositeObject with primary and child1 data but child2 remains empty. Even if i try to make call to statestore using interactive query, it doesn't return anything.
I am using spring-cloud-stream-kafka-streams libraries for writing code.
Please suggest what is the reason it is not setting and what should be a right solution to handle this.
Kafka Streams' scaling model is coupled to the number of input topic partitions. Thus, if your input topics are single partitioned you cannot scale-out. The number of input topic partitions determine your maximum parallelism.
Thus, you would need to create new topics with higher parallelism.

How to send messages in order from Apache Spark to Kafka topic

I have a use case where event information about sensors is inserted continuously in MySQL. We need to send this information with some processing in a Kafka topic every 1 or 2 minutes.
I am using Spark to send this information to Kafka topic and for maintaining CDC in Phoenix table.I am using a Cron job to run spark job every 1 minute.
The issue I am currently facing is message ordering, I need to send these messages in ascending timestamp to end the system Kafka topic (which has 1 partition). But most of the message ordering is lost due to more than 1 spark DataFrame partition sends information concurrently to Kafka topic.
Currently as a workaround I am re-partitioning my DataFrame in 1, in order to maintain the messages ordering, but this is not a long term solution as I am losing spark distributed computing.
If you guys have any better solution design around this please suggest.
I am able to achieve message ordering as per ascending timestamp to some extend by reparations my data with the keys and by applying sorting within a partition.
val pairJdbcDF = jdbcTable.map(row => ((row.getInt(0), row.getString(4)), s"${row.getInt(0)},${row.getString(1)},${row.getLong(2)},${row. /*getDecimal*/ getString(3)},${row.getString(4)}"))
.toDF("Asset", "Message")
val repartitionedDF = pairJdbcDF.repartition(getPartitionCount, $"Asset")
.select($"Message")
.select(expr("(split(Message, ','))[0]").cast("Int").as("Col1"),
expr("(split(Message, ','))[1]").cast("String").as("TS"),
expr("(split(Message, ','))[2]").cast("Long").as("Col3"),
expr("(split(Message, ','))[3]").cast("String").as("Col4"),
expr("(split(Message, ','))[4]").cast("String").as("Value"))
.sortWithinPartitions($"TS", $"Value")

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Resources