Should TimeWindows be the same when joining two KTable dervied from TimeWindows - apache-kafka-streams

I use two different retention time for two different KTable, and it works with RocksDB States and changelog Kafka Topics.
KTable is generated from KStream and groupBy and then windowedBy.
I believe when joining KStream with windowing, TimeWindows is the same. I'm wondering will there be benefit or drawback if TimeWindows parameters are different, when joining two different KTable windowed by TimeWindows?
code snippet:
final KStream<Integer, String> eventStream = builder.stream("events",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> eventWindowTable = eventStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).until(Duration.ofSeconds(100).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KStream<Integer, String> clickStream = builder.stream("clicks",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> clickWindowTable = clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).until(Duration.ofSeconds(70).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KTable<Windowed<Integer>, String> join = eventWindowTable.leftJoin(clickWindowTable,
(event, click) -> event + " ; " + click + " ; " + Instant.now()
);
Initially I thought joining two different KTable with different TimeWindows parameters will not work because the joining relies on TimeWindowedKey, a key for the time slot. But after testing, it works as well.

The join is executed because the type of both keys is the same: Windowed<Integer>. The join will of course only produce a result if the keys are the same. Assume you have the following windows (note that only the window start timestamp is stored for TimeWindows):
eventWindowTable: <A,0> <A,60>
clickWindowTable: <A,0> <A,30> <A,60> <A,90>
For this case, only <A,0> and <A,60> would join. Hence, having different windows, does impact your result, because the window start timestamp is part of the key and some windows will never join (eg, <A,30> and <A,90> in our example).

Related

Is it possible for a kafka steams application to write multiple outputs from a single input?

I'm unsure if kafka-streams is the correct solution for a problem I'm trying to solve. I'd like to be able to use it because of the parallelism and fault tolerance it provides, but I'm struggling to come up with a way to achieve a desired processing pipeline.
The pipeline is something like this:
A record of some type arrives on an input topic
Information in this record is used to perform a database query, which returns many results
I'd like to be able to write out each result as an individual record, with its own key, rather than as a collection of results in a single record.
Ignoring the single output record per result requirement for a moment, I have code that looks like this:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<List<MyOutput>> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
outputs.to("output-topic", Produced.with(stringSerde, outputSerde));
This is simple enough, 1 message in, 1 message (albeit a collection) out.
What I'd like to be able to do is something like:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.???
sink.to("output-topic", Produced.with(stringSerde, outputSerde));
I cannot come up with anything sensible for an operation or operations to perform on the outputs stream.
Any suggestions? Or is kafka-streams maybe not the right solution to a problem like this?
yes, it's possible, for that you need to use KStream flatMap transformation. FlatMap transforms each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily)
kStream = kStream.flatMap(
(key, value) -> {
List<KeyValue<String, MyOutput>> result = new ArrayList<>();
// do your logic here
return result;
});
kStream.to("output-topic", Produced.with(stringSerde, outputSerde));
Thanks, Vasiliy, flatMap was indeed what I needed. I looked at it earlier, thought it was the right operation but then got confused and mistakenly discarded it.
Combining what I had before with your suggestion, the following works, assuming MyOutput implements a method called getKey():
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.flatMap(((key, value) ->
value.stream().map(o -> new KeyValue<>(o.getKey(), o)).collect(Collectors.toList())));
sink.to("output-topic", Produced.with(stringSerde, outputSerde));

Caching Java 8 stream

Suppose I have a list which I perform multiple stream operations on.
bobs = myList.stream()
.filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
...
and
tonies = myList.stream()
.filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())
Can I not just do:
Stream<Person> stream = myList.stream();
which then means I can do:
bobs = stream.filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
tonies = stream.filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())
NO, you can't. One Stream can only be use one time It will throw below error when you will try to reuse:
java.lang.IllegalStateException: stream has already been operated upon or closed
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:229)
As per Java Docs:
A stream should be operated on (invoking an intermediate or terminal stream operation) only once.
But a neat solution to your query will be to use Stream Suplier. It looks like below:
Supplier<Stream<Person>> streamSupplier = myList::stream;
bobs = streamSupplier.get().filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
tonies = streamSupplier.get().filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())
But again, every get call will return a new stream.
No you can't, doc says:
A stream should be operated on (invoking an intermediate or terminal
stream operation) only once.
But you can use a single stream by filtering all elements you want once and then group them the way you need:
Set<String> names = ...; // construct a sets containing bob, tony, etc
Map<String,List<Person>> r = myList.stream()
.filter(p -> names.contains(p.getName())
.collect(Collectors.groupingBy(Person::getName);
List<Person> tonies = r.get("tony");
List<Person> bobs = r.get("bob");
Well, what you can do in your case is generate dynamic stream pipelines. Assuming that the only variable in your pipeline is the name of the person that you filter by.
We can represent this as a Function<String, Stream<Person>> as in the following :
final Function<String, Stream<Person>> pipelineGenerator = name -> persons.stream().filter(person -> Objects.equals(person.getName(), name));
final List<Person> bobs = pipelineGenerator.apply("bob").collect(Collectors.toList());
final List<Person> tonies = pipelineGenerator.apply("tony").collect(Collectors.toList());
As already mentioned a given stream should be operated upon only once.
I can understand the "idea" of caching a reference to an object if you're going to refer to it more than once, or to simply avoid creating more objects than necessary.
However, you should not be concerned when invoking myList.stream() every time you need to query again as creating a stream, in general, is a cheap operation.

Fastest way to convert key value pairs to grouped by key objects map using java 8 stream

Model:
public class AgencyMapping {
private Integer agencyId;
private String scoreKey;
}
public class AgencyInfo {
private Integer agencyId;
private Set<String> scoreKeys;
}
My code:
List<AgencyMapping> agencyMappings;
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet()))))
.entrySet().stream().map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(Collectors.toMap(AgencyInfo::getAgencyId, identity()));
Is there a way to get the same result and use more simpler code and faster?
You can simplify the call to collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet())))) with a call to mapping(AgencyMapping::getScoreKey, toSet()).
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
mapping(AgencyMapping::getScoreKey, toSet())))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
A different way to see it using a toMap collector:
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(toMap(AgencyMapping::getAgencyId, // key extractor
e -> new HashSet<>(singleton(e.getScoreKey())), // value extractor
(left, right) -> { // a merge function, used to resolve collisions between values associated with the same key
left.addAll(right);
return left;
}))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
The latter example is arguably more complicated than the former. Nevertheless, your approach is pretty much the way to go apart from using mapping as opposed to collectingAndThen as mentioned above.
Apart from that, I don't see anything else you can simplify with the code shown.
As for faster code, if you're suggesting that your current approach is slow in performance then you may want to read the answers here that speak about when you should consider going parallel.
You are collecting to an intermediate map, then streaming the entries of this map to create AgencyInfo instances, which are finally collected to another map.
Instead of all this, you could use Collectors.toMap to collect directly to a map, mapping each AgencyMapping object to the desired AgencyInfo and merging the scoreKeys as needed:
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(Collectors.toMap(
AgencyMapping::getAgencyId,
mapping -> new AgencyInfo(
mapping.getAgencyId(),
new HashSet<>(Set.of(mapping.getScoreKey()))),
(left, right) -> {
left.getScoreKeys().addAll(right.getScoreKeys());
return left;
}));
This works by grouping the AgencyMapping elements of the stream by AgencyMapping::getAgencyId, but storing AgencyInfo objects in the map instead. We get these AgencyInfo instances from manually mapping each original AgencyMapping object. Finally, we're merging AgencyInfo instances that are already in the map by means of a merge function that folds left scoreKeys from one AgencyInfo to another.
I'm using Java 9's Set.of to create a singleton set. If you don't have Java 9, you can replace it with Collections.singleton.

Can KStream.to() and StreamsBuilder.table() using the same topic and in the same StreamsBuilder in Kafka Stream?

As the title shows,Java Stream code like follow:
StreamsBuilder builder = new StreamsBuilder();
KStream<String, City> citesStream = builder.stream("cities"
, Consumed.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
citesStream.filter((name, city) -> city.getParentId() != 0).to("citiesExcludeProvince"
, Produced.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
KTable<String, City> allCityTable = builder.table("citiesExcludeProvince"
, Consumed.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
I want filter some city and save to anther Kafka topic and then read it as a KTable for join as follow:
KStream<String, City> provinceStream = citesStream
.filter((name, city) -> city.getParentId() == 0);
provinceStream.leftJoin(allCityTable, (province, city) -> {
System.out.println(JsonUtil.objectToJson(province));
System.out.println(JsonUtil.objectToJson(city));
if (province != null && city != null) {
if (city.getParentId() == province.getId()) {
if (province.getChildren() == null) {
province.setChildren(Lists.newArrayList());
}
province.getChildren().add(city);
}
}
return province;
}).to("provinceWithCity", Produced.with(Serdes.String(), SerdesFactory.serdesFrom(City.class)));
But i get the citiesExcludeProvince topic is alway empty.Where is the error?
Can KStream.to() and StreamsBuilder.table() using the same topic and in the same StreamsBuilder in Kafka Stream?
Yes, you can use an input topic for StreamsBuilder.table() that is an output topic from KStream.to(). StreamsBuilder doesn't allow certain types of cycles, but these—that run through some topic—are allowed. In this regard, I don't think there's anything wrong with your code.
I want filter some city and save to anther Kafka topic and then read it as a KTable for join as follow ... But i get the citiesExcludeProvince topic is alway empty.Where is the error?
There are several problems with your code:
Cities that are arriving into the join are not keyed by province ID. So the join will never happen.
If the cities were keyed by province ID, every city that arrived to the table on the right would override any previous city that arrived on the right. This is because a table is a changelog of values by key. If there are multiple cities that belong to a province in a stream keyed by province ID, in the table you will only see the last one to arrive.
The right-side table doesn't trigger computation. This is a KStream-KTable join, and the semantics of such a join are the only events on the left cause processing. Events on the right are merely stored in the table. (On a related note, you can't really use KStream-KTable joins to process historical data. When you turn on your Kafka Streams application, it has a consumer that reads all your input topics. If it reads the topic that creates provinceStream before the contents of allCityTable, then your provinces won't find anything in the allCityTable because it will still be empty.)
The left-side will never be null (you don't have to do that check).
I think this is what you are looking for:
// Step 1
KTable<String, ArrayList<City>> citiesByProvince = citesStream
.filter((name, city) -> city.getParentId() != 0)
.groupBy((k, v) -> v.getParentId())
.aggregate(ArrayList::new,
(k, v, a) -> {
a.add(v);
return a;
});
// Step 2
provinceStream
.groupBy((k, v) -> v.getId())
.reduce((a, b) -> a)
.join(citiesByProvince, (province, cities) -> {
province.setChildren(cities);
return province;
});
Step 1: aggregate all cities by province ID into a list. The resulting list is keyed by province ID.
Step 2: turn the provinces into a table keyed by province ID (you could do this equivalently by writing the contents of provinceStream to a topic and then using StreamBuilder.table(), but groupBy()->reduce() does the same thing here) and then performs the join.
Unlike your KStream-KTable join, the KTable-KTable join is not sensitive to the order in which records arrive from the underyling consumer, so you'll get deterministic results.

KStream-KTable join writing to the KTable: How to sync the join with the ktable write?

I'm having some issue with how the following topology behaves:
String topic = config.topic();
KTable<UUID, MyData> myTable = topology.builder().table(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
// Receive a stream of various events
topology.eventsStream()
// Only process events that are implementing MyEvent
.filter((k, v) -> v instanceof MyEvent)
// Cast to ease the code
.mapValues(v -> (MyEvent) v)
// rekey by data id
.selectKey((k, v) -> v.data.id)
.peek((k, v) -> L.info("Event:"+v.action))
// join the event with the according entry in the KTable and apply the state mutation
.leftJoin(myTable, eventHandler::handleEvent, UUIDSerdes.get(), EventSerdes.get())
.peek((k, v) -> L.info("Updated:" + v.id + "-" + v.id2))
// write the updated state to the KTable.
.to(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
My Issue happens when i receive different events at the same time. As my state mutation is done by the leftJoin and then written by the to method. I can have the following occuring if event 1 and 2 are received at the same time with the same key:
event1 joins with state A => state A mutated to state X
event2 joins with state A => state A mutated to state Y
state X written to the KTable topic
state Y written to the KTable topic
Because of that, state Y doesn't have the changes from event1, so I lost data.
Here's in terms of logs what I see (the Processing:... part is logged from inside the value joiner):
Event:Event1
Event:Event2
Processing:Event1, State:none
Updated:1-null
Processing:Event2, State:none
java.lang.IllegalStateException: Event2 event received but we don't have data for id 1
Event1 can be considered as the creation event: it will create the entry in the KTable so it doesn't matter if the state is empty. Event2 though needs to apply it's changes to an existing state but it doesn't find any because the first state mutation still hasn't been written to the KTable (it's still hasn't been processed by the to method)
Is there anyway to make sure that my leftJoin and my writes into the ktable are done atomically ?
Thanks
Update & current solution
Thanks to the response of #Matthias I was able to find a solution using a Transformer.
Here's what the code looks like:
That's the transformer
public class KStreamStateLeftJoin<K, V1, V2> implements Transformer<K, V1, KeyValue<K, V2>> {
private final String stateName;
private final ValueJoiner<V1, V2, V2> joiner;
private final boolean updateState;
private KeyValueStore<K, V2> state;
public KStreamStateLeftJoin(String stateName, ValueJoiner<V1, V2, V2> joiner, boolean updateState) {
this.stateName = stateName;
this.joiner = joiner;
this.updateState = updateState;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.state = (KeyValueStore<K, V2>) context.getStateStore(stateName);
}
#Override
public KeyValue<K, V2> transform(K key, V1 value) {
V2 stateValue = this.state.get(key); // Get current state
V2 updatedValue = joiner.apply(value, stateValue); // Apply join
if (updateState) {
this.state.put(key, updatedValue); // write new state
}
return new KeyValue<>(key, updatedValue);
}
#Override
public KeyValue<K, V2> punctuate(long timestamp) {
return null;
}
#Override
public void close() {}
}
And here's the adapted topology:
String topic = config.topic();
String store = topic + "-store";
KTable<UUID, MyData> myTable = topology.builder().table(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic, store);
// Receive a stream of various events
topology.eventsStream()
// Only process events that are implementing MyEvent
.filter((k, v) -> v instanceof MyEvent)
// Cast to ease the code
.mapValues(v -> (MyEvent) v)
// rekey by data id
.selectKey((k, v) -> v.data.id)
// join the event with the according entry in the KTable and apply the state mutation
.transform(() -> new KStreamStateLeftJoin<UUID, MyEvent, MyData>(store, eventHandler::handleEvent, true), store)
// write the updated state to the KTable.
.to(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
As we're using the KTable's KV StateStore and applying changes directly in it through the put method events shoudl always pick up the updated state.
One thing i'm still wondering: what if I have a continuous high throughput of events.
Could there still be a race condition between the puts we do on the KTable's KV store and the writes that are done in the KTable's topic ?
A KTable is sharded into multiple physical stores and each store is only updated by a single thread. Thus, the scenario you describe cannot happen. If you have 2 records with the same timestamp that both update the same shard, they will be processed one after each other (in offset order). Thus, the second update will see the state of after the first update.
So maybe you just did describe your scenario not correctly?
Update
You cannot mutate the state when doing a join. Thus, the expectation that
event1 joins with state A => state A mutated to state X
is wrong. Independent of any processing order, when event1 joins with state A, it will access state A in read only mode and state A will not be modified.
Thus, when event2 joins, it will see the same state as event1. For stream-table join, the table state is only updated when new data is read from the table-input-topic.
If you want to have a shared state that is updated from both inputs, you would need to build a custom solution using transform():
builder.addStore(..., "store-name");
builder.stream("table-topic").transform(..., "store-name"); // will not emit anything downstream
KStream result = builder.stream("stream-topic").transform(..., "store-name");
This will create one store that is shared by both processors and both can read/write as they wish. Thus, for the table-input you can just update the state without sending anything downstream, while for the stream-input you can do the join, update the state, and send a result downstream.
Update 2
With regard to the solution, there will be no race condition between the updates the Transformer applies to the state and records the Transformer processes after the state update. This part will be executed in a single thread, and records will be processed in offset-order from the input topic. Thus, it's ensured that a state update will be available to later records.

Resources