How to block kstream-ktable join until ktable gets updated in Kafka streams - apache-kafka-streams

I want to perform a left join on an updated kTable.
I have a kstream with input Topic A and a materialized kTable with input Topic B. What I'm doing is I perform a kstream-ktable left join and writing that join result back to Topic B so that I can update my kTable.
The problem is that when multiple messages come in, I do a left join on an old table before it gets updated. So, data are lost. Below is the diagram of my topology.
Topic A --> kStream \
left join --> write to Topic B --
Topic B --> kTable / |
^ |
|___________________________________________________|
Expected:
stream_message1 with state A --> get new state B --> update kTable
stream_message2 with state B --> get new state C --> update kTable
Actual:
stream_message1 with state A --> get new state B
stream_message2 with state A --> get totally different state D
is there anyway to make all this synchronous?

Related

Delete data from KTable that has a custom stream StreamPartitioner

we have a kafka-topic product-update-events which contains data about product updates and their variations.
We aggregate these events into a products KTable using the kstreams 'aggregate' function.
For every product in this products KTable we then want to calculate the 'best' variation (e.g one of the variations of the product by some criteria).
These 'best' variations are then written to another KTable and to a Kafka-Topic.
We only want to emit a best-variation update, when the best-variation actually has changed because of a product update. Therefore we use a custom transformer which checks the current best-variation in its state store.
The product-events and product table have the 'productId' as key and are partitioned by this. The best-variation records have the 'variationId' as key. We use a custom StreamPartitioner to also partition these records by productId, so that each KStreams application instance has the matching product and best-variation data:
{ _, _, variation, numPartitions -> Utils.toPositive(Utils.murmur2(StringSerializer().serialize("", variation.productId))) % numPartitions }
Now we come to the actual question :)
We want to delete the best-variation when we receive a 'delete' product-update event. Therefore we need to set the payload of the best-variation record to 'null'. But now we don't have any information about the productId this record belongs to for our custom partitioner.
Do you have any suggestion on how to solve this?
Our topology is as follows:
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [product-update-events])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
--> KTABLE-TOSTREAM-0000000003
<-- KSTREAM-SOURCE-0000000000
Processor: KTABLE-TOSTREAM-0000000003 (stores: [])
--> KSTREAM-SINK-0000000004
<-- KSTREAM-AGGREGATE-0000000002
Sink: KSTREAM-SINK-0000000004 (topic: products)
<-- KTABLE-TOSTREAM-0000000003
Sub-topology: 1
Source: KSTREAM-SOURCE-0000000007 (topics: [products])
--> KSTREAM-TRANSFORM-0000000008
Source: KSTREAM-SOURCE-0000000005 (topics: [best-variation-per-article])
--> KTABLE-SOURCE-0000000006
Processor: KSTREAM-TRANSFORM-0000000008 (stores: [best-variation-per-article])
--> KSTREAM-SINK-0000000009
<-- KSTREAM-SOURCE-0000000007
Sink: KSTREAM-SINK-0000000009 (topic: best-variation-per-article)
<-- KSTREAM-TRANSFORM-0000000008
Processor: KTABLE-SOURCE-0000000006 (stores: [best-variation-per-article])
--> none
<-- KSTREAM-SOURCE-0000000005
You will want to use a tombstone basically it is the same key with a null value this will cause the store to drop the entry with that key.
this is a pretty decent example that includes deletion

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

Could using changelogs cause a bottleneck for the app itself?

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

Elastic search storing hierarchical data and querying it

Let me break down the problem it will take some time.
Consider that you have an entities A, B, C in your system.
A is the parent of everything
B is the child of A
C can be child of A or B, Please note there are some more entities like D,E,F which are same as C. So lets consider C only for time being
So basically its a tree alike structure like
```
A
/ \
/ \
B C(there are similar elements like D, E, F)
|
|
C
```
Now we need are using Elastic Search as secondary DB to store this. In the data base the structure is completely different since A, B, C have dynamic fields, so they are different tables and we join them to get data, but from business prospective this is design.
Now when we try to flat it and store in es for under set
We have a entity A1 who has 2 children C1 and B1, B1 has further children C2
A B C
1 A1 null null
2 A1 null C1
3 A1 B1 null
4 A1 B1 C2
Now what your can query
use says he wants All columns of A,B,C where value of columns A is A1, so adding some null removing rules we can give him row number 2,3,4
now the problem set , now user says he want all As where value of A is A1 , so basically we will return him all rows 1,2,3,4 or 2,3,4 so we will see values like
A
A1
A1
A1
but logically he should see only one column A1 since that is only unique value. As ES doesn't have the ability to group by things.
So how we solved things.
We solved this problem by creating multiple indices and one nested index
So when we need to group by index we go to nested index and other index work as flat index
so we have different index, like index for A and B, A or B and C . But we have more elements so it lead to creation of 5 indices.
As data started increasing its becoming difficult to maintain 5 indices and indexing them from scratch takes too much time.
So to solve this we started to look for other options and we are testing cratedb. But on the first place we are still trying to figure is there any way to do that in ES since need to use many feature of ES as percolation, watcher etc. Any clues on that?
Please also note that we need to apply pagination also. That's why single nested index will not work

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Resources