Kafka Streams Join after aggregation not working with multiple partition - apache-kafka-streams

Problem Statment:
Topic 1: "key = empId, value = empname, deptName, ..."
Topic 2: "key = deptName, value = deptName"
I need the data from Topic 1 where the deptName(value attribute in topic 1) is equal to key of Topic 2.
Steps:
Create a stream from Topic 1, group it by deptName, and do aggregatation.
It will return Ktable (key =deptName, value = "empId1,empId2,empId3 ..")
Create a stream from Topic 2 (key ="deptName" value = "deptName")
Perform a left join operation on Ktable (Step 1) and KSteam (Step2). (KStream-Ktable)
And join returns desired result.
Everything works as expected in single partition, however, after switching to multiple partitions, join doesn't return any data.
Step 1:
KGroupedStream<String, Object> groupedStream = adStream.groupBy((key, value) -> value.getOrganizationId().toString());
groupedStream
.aggregate(() -> (new String()),
(aggKey, newValue, aggValue) -> addCurrentValue(aggValue,
String.valueOf(newValue.getOriginId())),
Materialized.as("aggregated-stream-store").with(strSerde, strSerde))
.toStream().to(Constant.AD_AGGREGATED_DATA, Produced.with(strSerde, strSerde));
Step 2:
KStream<String, String> swgOrgStream = builder.stream(Constant.SWG_ORG_TOPIC,Consumed.with(strSerde, strSerde));
Step 3:
KStream<String, String> filteredOrgStream = swgOrgStream.leftJoin(aggregatedTable,
(leftValue, rightValue) -> rightValue);

Related

Kafka Streams Reduce vs Suppress

While reading up on the suppress() documentation, I saw that the time window will not advance unless records are being published to the topic, because it's based on event time. Right now, my code is outputting the final value for each key, because traffic on the topic is constant, but there are downtimes when that system is brought down, causing existing records in the state store to be "frozen". I was wondering what the difference is between just having reduce(), instead of reduce().suppress(). Does reduce() act like suppress() in that they are both event time driven? My understanding is that both are doing the same thing, aggregating the keys within a certain time window.
My topology is the following:
final Map<String, String> serdeConfig = Collections.singletonMap("schema.registry.url", schemaRegistryUrl);
final Serde<EligibilityKey> keySpecificAvroSerde = new SpecificAvroSerde<EligibilityKey>();
keySpecificAvroSerde.configure(serdeConfig, true);
final Serde<Eligibility> valueSpecificAvroSerde = new SpecificAvroSerde<Eligibility>();
valueSpecificAvroSerde.configure(serdeConfig, false);
// KStream<EligibilityKey, Eligibility>
KStream<EligibilityKey, Eligibility> kStreamInput = builder.stream(input,
Consumed.with(keySpecificAvroSerde, valueSpecificAvroSerde));
// KStream<EligibilityKey, String>
KStream<EligibilityKey, String> kStreamMapValues = kStreamInput
.mapValues((key, value) -> Processor.process(key, value));
// WindowBytesStoreSupplier
WindowBytesStoreSupplier windowBytesStoreSupplier = Stores.inMemoryWindowStore("in-mem",
Duration.ofSeconds(retentionPeriod), Duration.ofSeconds(windowSize), false);
// Materialized
Materialized<EligibilityKey, String, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(windowBytesStoreSupplier);
materialized = Materialized.with(keySpecificAvroSerde, Serdes.String());
// TimeWindows
TimeWindows timeWindows = TimeWindows.of(Duration.ofSeconds(size)).advanceBy(Duration.ofSeconds(advance))
.grace(Duration.ofSeconds(afterWindowEnd));
// KTable<Windowed<EligibilityKey>, String>
KTable<Windowed<EligibilityKey>, String> kTable = kStreamMapValues
.groupByKey(Grouped.with(keySpecificAvroSerde, Serdes.String())).windowedBy(timeWindows)
.reduce((a, b) -> b, materialized.withLoggingDisabled().withRetention(Duration.ofSeconds(retention)))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded().withLoggingDisabled()));
// KStream<Windowed<EligibilityKey>, String>
KStream<Windowed<EligibilityKey>, String> kStreamOutput = kTable.toStream();
By using reduce() without suppress, the result of the aggregation is updated continuously, i.e., updates to the KTable that holds the results of the reduce() are sent downstream also before all records of a window are processed.
Assume a reduce that just sums up the values in a window of duration 3 with grace 0 and the following input records (key, value, timestamp) to reduce():
input record (A, 1, 1) of W1 -> output record ((W1,A), 1) is sent downstream
input record (A, 2, 2) of W1 -> output record ((W1,A), 3) is sent downstream
input record (A, 3, 3) of W1 -> output record ((W1,A), 6) is sent downstream
input record (A, 4, 4) of W2 -> output record ((W2,A), 4) is sent downstream
With reduce().suppress(), the result are buffered until the window closes. The result would be:
input record (A, 1, 1) of W1 -> no output
input record (A, 2, 2) of W1 -> no output
input record (A, 3, 3) of W1 -> no output
input record (A, 4, 4) of W2 -> output record ((W1,A), 6) is sent downstream
Note that for the case without suppress() I assumed that the cache is switched off with cache.max.bytes.buffering = 0 . With cache.max.bytes.buffering > 0 (default is 10MB), the cache will buffer output records of a KTable and once the cache is full, it will output the record with the key that was least recently updated.

Should TimeWindows be the same when joining two KTable dervied from TimeWindows

I use two different retention time for two different KTable, and it works with RocksDB States and changelog Kafka Topics.
KTable is generated from KStream and groupBy and then windowedBy.
I believe when joining KStream with windowing, TimeWindows is the same. I'm wondering will there be benefit or drawback if TimeWindows parameters are different, when joining two different KTable windowed by TimeWindows?
code snippet:
final KStream<Integer, String> eventStream = builder.stream("events",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> eventWindowTable = eventStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).until(Duration.ofSeconds(100).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KStream<Integer, String> clickStream = builder.stream("clicks",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> clickWindowTable = clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).until(Duration.ofSeconds(70).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KTable<Windowed<Integer>, String> join = eventWindowTable.leftJoin(clickWindowTable,
(event, click) -> event + " ; " + click + " ; " + Instant.now()
);
Initially I thought joining two different KTable with different TimeWindows parameters will not work because the joining relies on TimeWindowedKey, a key for the time slot. But after testing, it works as well.
The join is executed because the type of both keys is the same: Windowed<Integer>. The join will of course only produce a result if the keys are the same. Assume you have the following windows (note that only the window start timestamp is stored for TimeWindows):
eventWindowTable: <A,0> <A,60>
clickWindowTable: <A,0> <A,30> <A,60> <A,90>
For this case, only <A,0> and <A,60> would join. Hence, having different windows, does impact your result, because the window start timestamp is part of the key and some windows will never join (eg, <A,30> and <A,90> in our example).

Is it possible for a kafka steams application to write multiple outputs from a single input?

I'm unsure if kafka-streams is the correct solution for a problem I'm trying to solve. I'd like to be able to use it because of the parallelism and fault tolerance it provides, but I'm struggling to come up with a way to achieve a desired processing pipeline.
The pipeline is something like this:
A record of some type arrives on an input topic
Information in this record is used to perform a database query, which returns many results
I'd like to be able to write out each result as an individual record, with its own key, rather than as a collection of results in a single record.
Ignoring the single output record per result requirement for a moment, I have code that looks like this:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<List<MyOutput>> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
outputs.to("output-topic", Produced.with(stringSerde, outputSerde));
This is simple enough, 1 message in, 1 message (albeit a collection) out.
What I'd like to be able to do is something like:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.???
sink.to("output-topic", Produced.with(stringSerde, outputSerde));
I cannot come up with anything sensible for an operation or operations to perform on the outputs stream.
Any suggestions? Or is kafka-streams maybe not the right solution to a problem like this?
yes, it's possible, for that you need to use KStream flatMap transformation. FlatMap transforms each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily)
kStream = kStream.flatMap(
(key, value) -> {
List<KeyValue<String, MyOutput>> result = new ArrayList<>();
// do your logic here
return result;
});
kStream.to("output-topic", Produced.with(stringSerde, outputSerde));
Thanks, Vasiliy, flatMap was indeed what I needed. I looked at it earlier, thought it was the right operation but then got confused and mistakenly discarded it.
Combining what I had before with your suggestion, the following works, assuming MyOutput implements a method called getKey():
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.flatMap(((key, value) ->
value.stream().map(o -> new KeyValue<>(o.getKey(), o)).collect(Collectors.toList())));
sink.to("output-topic", Produced.with(stringSerde, outputSerde));

Can KStream.to() and StreamsBuilder.table() using the same topic and in the same StreamsBuilder in Kafka Stream?

As the title shows,Java Stream code like follow:
StreamsBuilder builder = new StreamsBuilder();
KStream<String, City> citesStream = builder.stream("cities"
, Consumed.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
citesStream.filter((name, city) -> city.getParentId() != 0).to("citiesExcludeProvince"
, Produced.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
KTable<String, City> allCityTable = builder.table("citiesExcludeProvince"
, Consumed.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
I want filter some city and save to anther Kafka topic and then read it as a KTable for join as follow:
KStream<String, City> provinceStream = citesStream
.filter((name, city) -> city.getParentId() == 0);
provinceStream.leftJoin(allCityTable, (province, city) -> {
System.out.println(JsonUtil.objectToJson(province));
System.out.println(JsonUtil.objectToJson(city));
if (province != null && city != null) {
if (city.getParentId() == province.getId()) {
if (province.getChildren() == null) {
province.setChildren(Lists.newArrayList());
}
province.getChildren().add(city);
}
}
return province;
}).to("provinceWithCity", Produced.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
But i get the citiesExcludeProvince topic is alway empty.Where is the error?
Can KStream.to() and StreamsBuilder.table() using the same topic and in the same StreamsBuilder in Kafka Stream?
Yes, you can use an input topic for StreamsBuilder.table() that is an output topic from KStream.to(). StreamsBuilder doesn't allow certain types of cycles, but these—that run through some topic—are allowed. In this regard, I don't think there's anything wrong with your code.
I want filter some city and save to anther Kafka topic and then read it as a KTable for join as follow ... But i get the citiesExcludeProvince topic is alway empty.Where is the error?
There are several problems with your code:
Cities that are arriving into the join are not keyed by province ID. So the join will never happen.
If the cities were keyed by province ID, every city that arrived to the table on the right would override any previous city that arrived on the right. This is because a table is a changelog of values by key. If there are multiple cities that belong to a province in a stream keyed by province ID, in the table you will only see the last one to arrive.
The right-side table doesn't trigger computation. This is a KStream-KTable join, and the semantics of such a join are the only events on the left cause processing. Events on the right are merely stored in the table. (On a related note, you can't really use KStream-KTable joins to process historical data. When you turn on your Kafka Streams application, it has a consumer that reads all your input topics. If it reads the topic that creates provinceStream before the contents of allCityTable, then your provinces won't find anything in the allCityTable because it will still be empty.)
The left-side will never be null (you don't have to do that check).
I think this is what you are looking for:
// Step 1
KTable<String, ArrayList<City>> citiesByProvince = citesStream
.filter((name, city) -> city.getParentId() != 0)
.groupBy((k, v) -> v.getParentId())
.aggregate(ArrayList::new,
(k, v, a) -> {
a.add(v);
return a;
});
// Step 2
provinceStream
.groupBy((k, v) -> v.getId())
.reduce((a, b) -> a)
.join(citiesByProvince, (province, cities) -> {
province.setChildren(cities);
return province;
});
Step 1: aggregate all cities by province ID into a list. The resulting list is keyed by province ID.
Step 2: turn the provinces into a table keyed by province ID (you could do this equivalently by writing the contents of provinceStream to a topic and then using StreamBuilder.table(), but groupBy()->reduce() does the same thing here) and then performs the join.
Unlike your KStream-KTable join, the KTable-KTable join is not sensitive to the order in which records arrive from the underyling consumer, so you'll get deterministic results.

Using Java 8 streams for aggregating list objects

We are using 3 lists ListA,ListB,ListC to keep the marks for 10 students in 3 subjects (A,B,C).
Subject B and C are optional, so only few students out of 10 have marks in those subjects
Class Student{
String studentName;
int marks;
}
ListA has records for 10 students, ListB for 5 and ListC for 3 (which is also the size of the lists)
Want to know how we can sum up the marks of the students for their subjects using java 8 steam.
I tried the following
List<Integer> list = IntStream.range(0,listA.size() -1).mapToObj(i -> listA.get(i).getMarks() +
listB.get(i).getMarks() +
listC.get(i).getMarks()).collect(Collectors.toList());;
There are 2 issues with this
a) It will give IndexOutOfBoundsException as listB and listC don't have 10 elements
b) The returned list if of type Integer and I want it to be of type Student.
Any inputs will be very helpful
You can make a stream of the 3 lists and then call flatMap to put all the lists' elements into a single stream. That stream will contain one element per student per mark, so you will have to aggregate the result by student name. Something along the lines of:
Map<String, Integer> studentMap = Stream.of(listA, listB, listC)
.flatMap(Collection::stream)
.collect(groupingBy(student -> student.name, summingInt(student -> student.mark)));
Alternatively, if your Student class has getters for its fields, you can change the last line to make it more readable:
Map<String, Integer> studentMap = Stream.of(listA, listB, listC)
.flatMap(Collection::stream)
.collect(groupingBy(Student::getName, summingInt(Student::getMark)));
Then check the result by printing out the studentMap:
studentMap.forEach((key, value) -> System.out.println(key + " - " + value));
If you want to create a list of Student objects instead, you can use the result of the first map and create a new stream from its entries (this particular example assumes your Student class has an all-args constructor so you can one-line it):
List<Student> studentList = Stream.of(listA, listB, listC)
.flatMap(Collection::stream)
.collect(groupingBy(Student::getName, summingInt(Student::getMark)))
.entrySet().stream()
.map(mapEntry -> new Student(mapEntry.getKey(), mapEntry.getValue()))
.collect(toList());
I would do it as follows:
Map<String, Student> result = Stream.of(listA, listB, listC)
.flatMap(List::stream)
.collect(Collectors.toMap(
Student::getName, // key: student's name
s -> new Student(s.getName(), s.getMarks()), // value: new Student
(s1, s2) -> { // merge students with same name: sum marks
s1.setMarks(s1.getMarks() + s2.getMarks());
return s1;
}));
Here I've used Collectors.toMap to create the map (I've also assumed you have a constructor for Student that receives a name and marks).
This version of Collectors.toMap expects three arguments:
A function that returns the key for each element (here it's Student::getName)
A function that returns the value for each element (I've created a new Student instance that is a copy of the original element, this is to not modify instances from the original stream)
A merge function that is to be used when there are elements that have the same key, i.e. for students with the same name (I've summed the marks here).
If you could add the following copy constructor and method to your Student class:
public Student(Student another) {
this.name = another.name;
this.marks = another.marks;
}
public Student merge(Student another) {
this.marks += another.marks;
return this;
}
Then you could rewrite the code above in this way:
Map<String, Student> result = Stream.of(listA, listB, listC)
.flatMap(List::stream)
.collect(Collectors.toMap(
Student::getName,
Student::new,
Student::merge));

Resources