Kafka Stream timestamp synchronization in KStream/KTable join with repartitioning - apache-kafka-streams

I need a solution for a Kstream-Ktable join where records will be out-of-order because of preprocessing in topology. I am using kafka-streams:3.0.1
I have topic1 and topic2, if they are keyed on the same key I get the expected result in KStream-KTable join.
T1, T2,..Tn are timestamps.
topic1 records: [{K1,V1}(T2), {K2,V2}(T4)]
topic2 records: [{K1,A}(T1), {K2,B}(T3)]
join result: [{K1,A}, {K2,B}]
KStream stream = builder.stream("topic1");
KTable table= builder.table("topic2");
stream.leftJoin(table, (v1,v2)->v2)
.to("enrichedTopic");
But, if there is pre-processing in topology which results in repartition, the result will not be the same.
topic1 records: [{K1,V1}(T2), {K2,V2}(T4)]
topic2 records: [{*K1,A}(T1), {*K2,B}(T3)]
KStream stream = builder.stream("topic1");
KTable table= builder.stream("topic2")
.selectKey((k,v)->k.replace("*","") //remove '*' from the key
.repartition()
.toTable();
stream.leftJoin(table, (v1,v2)->v2)
.to("enrichedTopic");
Below are two solutions with their disadvantages. I want to know if there is any better way to achieve this.
Solution 1: Increasing max.task.idle.ms
This gives the right result, but I am afraid that it will slow down the processing if topic2 has slow incoming traffic. As join will wait for topic2 partitions data.
Solution 2: Foreign key join
Not completely sure about the result, as it also involves repartitioning. But, it will also produce an increased number of result if have frequent updates on topic2 for the same key and one record from topic2 is mapped to multiple records of topic1 (topic:topic1 is 1:n).
This will also need a logic to remove the records produced by updates on topic2 as we are only interested in value of topic2 when there is an event on topic1 based on time synchronization.
Please correct if there is anything incorrect in my understanding.
Questions:
Let know me if there is any better approach to solve this.
Will solution 2 guarantee the correct result if we only consider the first resultent record produced by a foreign key join (in order to ignore the updates from topic2)?

Related

How to understand streaming table in Flink?

It's hard for me to understand the streaming table in Flink. I can understand Hive, map a fixed, static data file to a "table" but how to embody a table built on streaming data?
For example, every 1 second, 5 events with same structure are sent to a Kafka stream:
{"num":1, "value": "a"}
{"num":2, "value": "b"}
....
What does the dynamic table built on them look like? Flink consumes them all and store them somewhere (memory, local file, hdfs, etc.) then map to a table? Once the "transformmer" finishes processing these 5 events then clear the data and refill the "table" with 5 new events?
Any help is appreciated...
These dynamic tables don't necessarily exist anywhere -- it's simply an abstraction that may, or may not, be materialized, depending on the needs of the query being performed. For example, a query that is doing a simple projection
SELECT a, b FROM events
can be executed by simply streaming each record through a stateless Flink pipeline.
Also, Flink doesn't operate on mini-batches -- it processes each event one at a time. So there's no physical "table", or partial table, anywhere.
But some queries do require some state, perhaps very little, such as
SELECT count(*) FROM events
which needs nothing more than a single counter, while something like
SELECT key, count(*) FROM events GROUP BY key
will use Flink's key-partitioned state (a sharded key-value store) to persist the current counter for each key. Different nodes in the cluster will be responsible for handling events for different keys.
Just as "normal" SQL takes one or more tables as input, and produces a table as output, stream SQL takes one or streams as input, and produces a stream as output. For example, the SELECT count(*) FROM events will produce the stream 1 2 3 4 5 ... as its result.
There are some good introductions to Flink SQL on YouTube: https://www.google.com/search?q=flink+sql+hueske+walther, and there are training materials on github with slides and exercises: https://github.com/ververica/sql-training.

How to run more than 1 application instances of ktable-ktable joins kafka streams application on single partitioned kafka topics?

KTable<Key1, GenericRecord> primaryTable = createKTable(key1, kstream, statestore-name);
KTable<Key2, GenericRecord> childTable1 = createKTable(key1, kstream, statestore-name);
KTable<Key3, GenericRecord> childTable2 = createKTable(key1, kstream, statestore-name);
primaryTable.leftJoin(childTable1, (primary, choild1) -> compositeObject)
.leftJoin(childTable2,(compositeObject, child2) -> compositeObject, Materialized.as("compositeobject-statestore"))
.toStream().to(""composite-topics)
For my application, I am using KTable-Ktable joins, so that whenever data is received on primary or child stream, it can set it compositeObject with setters and getters for all three tables. These three incoming streams have different keys, but while creating KTable, I make the keys same for all three KTable.
I have all topics with single partition. When I run application on single instance, everything runs fine. I can see compositeObject populated with data from all three tables.
All interactive queries also runs fine passing the recordID and local statestore name.
But when I run two instances of same application, I see compositeObject with primary and child1 data but child2 remains empty. Even if i try to make call to statestore using interactive query, it doesn't return anything.
I am using spring-cloud-stream-kafka-streams libraries for writing code.
Please suggest what is the reason it is not setting and what should be a right solution to handle this.
Kafka Streams' scaling model is coupled to the number of input topic partitions. Thus, if your input topics are single partitioned you cannot scale-out. The number of input topic partitions determine your maximum parallelism.
Thus, you would need to create new topics with higher parallelism.

Kafka Stream work with JoinWindow for data replay

I have 2 streams of data and I want to be able to join them for a window of 1 month let's say. When I have a live data everything is fun and super easy with KStream and join. I did something like this;
KStream<String, GenericRecord> stream1 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic1());
KStream<String, GenericRecord> stream2 =
builder.stream(Serdes.String(), new CustomizeAvroSerde<>(this.getSchemaRegistryClient(), this.getKafkaPropsMap()), getKafkaConsumerTopic2());
long joinWindowSizeMs = 30L * 24L * 60L * 60L * 1000L; // 30 days
KStream<String, GenericRecord> joinStream = stream1.join(stream2,
new ValueJoiner<GenericRecord, GenericRecord, GenericRecord>() {
#Override
public GenericRecord apply(GenericRecord genericRecord, GenericRecord genericRecord2) {
final GenericRecord jonnedRecord = new GenericData.Record(jonnedRecordSchema);
....
....
....
return jonnedRecord;
}
}, JoinWindows.of(joinWindowSizeMs));
The problem appears when I want to do a data replay. let's say I want to re-do these join for the data I have for past 6 months since I am running the pipeline for all data at once kafkaStream will join all the joinable data and it doesn't take the time difference into consideration (which it should only join past one month of data). I am assuming the JoinWindow time is the time we insert data into Kafka topic, am I right?
And how can I change and manipulate this time so I can run my data replay correctly, I mean for re-inserting these past 6 months of data it should take a window of one month for each respective record and join based one that.
This question is not duplicate of How to manage Kafka KStream to Kstream windowed join?, there I asked about how can I can join based on the window of time. here I am talking about data replay. from my understanding during join Kafka take the time that data is inserted to the topic as the time for JoinWindow, so if you want to do the data replay and re-insert the data for 6 month ago kafka take it as a new data which is inserted today and gonna join it with some othrr data that is actually for today which it shouldn't.
Kafka's Streams API uses timestamps returned by TimestampExtractor to compute joins. By default, this is the record's embedded metadata timestamp. (c.f. http://docs.confluent.io/current/streams/concepts.html#time)
Per default, KafkaProducer sets this timestamp to current system time on write. (As an alternative, you can configure brokers on a per-topic basis to overwrite producer-provided timestamps of records with the broker's system time at the time the broker stored the record -- this provides "ingestion time" semantics.)
Thus, it is not a Kafka Streams issue per se.
There are multiple options to tackle the problem:
If your data is already in a topic, you can simply reset your Streams application to reprocess old data. For this, you can use the application reset tool (bin/kafka-streams-application-reset.sh). You also need to specify auto.offset.reset policy to earliest in your Streams app. Check out the docs -- also, it's recommended to read the blog post.
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
This is the best approach, as you do not need to write data to the topic again.
If your data is not in a topic and you need to write the data, you can set the record timestamp explicitly at the application level, by providing a timestamp for each record:
KafkaProducer producer = new KafkaProducer(...);
producer.send(new ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value));
Thus, if you ingest old data you can set the timestamp explicitly and Kafka Streams will pick it up and compute the join accordingly.

NiFi-1.0.0 - ExecuteSQL, Event Driven

I have a NiFi flow which inserts some data into some tables. After I insert into a table some data, I send a signal and then ExecuteSQL runs an aggregation query on that table. The tables names are based on the files names.
The thing is that when ExecuteSQL runs the query, I only get a subset of the result. If I run the same query in database's console, I get a different number of rows returned.
Could this be a problem that has to do with the Event Driven Scheduling strategy ?
If ExecuteSQL is stopped, and I get the flowfile ( the signal ) in the queue of the ExecuteSQL, and then I start manually ExecuteSQL, I get back the expected result.
If you are running multiple inserts (using PutSQL for example) and you wish to run ExecuteSQL only after all of them are finished, and the order in which they finish is not deterministic, you might try one of these two approaches:
MergeContent - use a MergeContent processor after PutSQL, setting the Minimum Number of Entries and/or Max Bin Age to trigger when the inserts are finished. You can route the merged relationship to ExecuteSQL.
MonitorActivity - use a MonitorActivity processor to monitor the flow of output from PutSQL and trigger an inactive alert after a configured time period. You would route the inactive relationship to ExecuteSQL to run the aggregate query.

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Resources