Is it possible to get top 10 from ktable\kstream?

Is it possible to get top 10 from ktable\kstream? - apache-kafka-streams

I have a topic with a String key which is a signal type and Signal value which is a class like this
public clas Signal {
public final int deviceId;
public final int value;
...
}
Each device can send signal values which raise or fall with time without a pattern.
Is it possible to get top 10 devices with max signal value at all period of time by each type (key of the topic) as a KTable<String,Signal>? Would it helped if all signal values were raising?
Topic structure can be changed if needed.

It is possible to do with Kafka Streams for the case when values are always raising, for example. It is needed to create own Top10 aggregate, which stores top 10 and updates it on add call:
final var builder = new StreamsBuilder();
final var topTable = builder
.table(
SignalChange.TOPIC_NAME,
Consumed.with(Serdes.String(), new SignalChange.Serde())
).toStream()
.groupByKey()
.aggregate(
() -> new Top10(),
(k, v, top10) -> top10.add(v),
Materialized.with(Serdes.String(), new Top10.Serde())
);
topTable can then be joined with any stream requesting for the top.

Related

How to get size of stream and return the original stream in java8?

I am trying to do something like below:
Stream<Student> allStudent=studentRepo.findAll()
long count=allStudent.count();
then
return allStudent
But problem : count() is an terminal operation and after that i am not able to return the stream.
The reason for doing this is to stream all student record over Kafka and at the same time send the record count to the consumer.

Well, if the stream has SIZED characteristic, you can use the size from the spliterator object:
Spliterator<Integer> spliterator = stream.spliterator();
long count = spliterator.getExactSizeIfKnown();
...
return StreamSupport.stream(spliterator, stream.isParallel());
But if getExactSizeIfKnown returns -1 - eigher try to save the stream to an intermediate collection, get size and then use the stream() method to return the data, or think of something else.

Count using peek() to a list then stream that:
AtomicInteger count = new AtomicInteger();
allStudent = allStudent
.peek(o -> i.incrementAndGet())
.collect(toList())
.stream();
// do something with count
return allStudent;
Or more mundane:
List<Student> students = allStudent.stream().collect(toList());
long count = students.size();
return students.stream();

Assign random UUID on a key's first occurrence in a stream

I'm looking for a solution on how to assign a random UUID to a key only on its first occurrence in a stream.
Example:
time key value assigned uuid
| 1 A fff17a1e-9943-11eb-a8b3-0242ac130003
| 2 B f01d2c42-9943-11eb-a8b3-0242ac130003
| 3 C f8f1e880-9943-11eb-a8b3-0242ac130003
| 1 X fff17a1e-9943-11eb-a8b3-0242ac130003 (same as above)
v 1 Y fff17a1e-9943-11eb-a8b3-0242ac130003 (same as above)
As you can see fff17a1e-9943-11eb-a8b3-0242ac130003 is assigned to key "1" on its first occurrence. This uuid is subsequently reused on its second and third occurrence. The order doesn't matter, though. There is no seed for the generated uuid either.
My idea was to use a leftJoin() with a KStream and a KTable with key/uuid mappings. If the right side of the leftJoin is null I have to create a new UUID and add it to the mapping table. However, I think this does not work when there are several new entries with the same key in a short period of time. I guess this will create several UUIDs for the same key.
Is there an easy solution for this or is this simply not possible with streaming?

I don't think you need a join in your use case because joins are to merge to different streams that arrive with equal IDs. You said that you receive just one stream of events. So, your use case is an aggregation over one stream.
What I understood of your question is that you receive events: A, B, C, ... Then you want to assign some ID. You say that the ID is random. So, this is very uncertain. If it is random how would you know that A -> fff17a1e-9943-11eb-a8b3-0242ac130003 and X -> fff17a1e-9943-11eb-a8b3-0242ac130003 (the same). I suppose that you might have a seed to generate this UUID. And then you create a key based also on this seed.
I suggest you start with this sample of word count. then on the first map:
.map((key, value) -> new KeyValue<>(value, value))
you replace it with your map function. Something like this:
.map((k, v) -> {
if (v.equalsIgnoreCase("A")) {
return new KeyValue<String, ValueWithUUID>("1", new ValueWithUUID(v));
} else if (v.equalsIgnoreCase("B")) {
return new KeyValue<String, ValueWithUUID>("2", new ValueWithUUID(v));
} else {
return new KeyValue<String, ValueWithUUID>("0", new ValueWithUUID(v));
}
})
...
class ValueWithUUID {
String value;
String uuid;
public ValueWithUUID(String value) {
this.value = value;
// generate your UUID based on the value. It is random, but as you show in your question it might have a seed.
this.uuid = generateRandomUUIDWithSeed();
}
public String generateRandomUUIDWithSeed() {
return "fff17a1e-9943-11eb-a8b3-0242ac130003";
}
}
Then you decide if you want to use a windowed aggregation, every 30 seconds for instance. Or a non-windowing aggregation that updates the results for every event that arrives. Here is one nice example.

You can aggregate the raw stream as ktable, in the processing, generate or reuse the uuid; then use the stream of ktable.
final KStream<String, String> streamWithoutUUID = builder.stream("topic_name");
KTable<String, String> tableWithUUID = streamWithoutUUID.groupByKey().aggregate(
() -> "",
(k, v, t) -> {
if (!t.startsWith("uuid:")) {
return "uuid:" + "call your buildUUID function here" + ";value:" + v;
} else {
return t.split(";", 2)[0] + ";value:" + v;
}
},
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as("state_name")
.withKeySerde(Serdes.String()).withValueSerde(Serdes.String()));
final KStream<String, String> streamWithUUID = tableWithUUID.toStream();

Should TimeWindows be the same when joining two KTable dervied from TimeWindows

I use two different retention time for two different KTable, and it works with RocksDB States and changelog Kafka Topics.
KTable is generated from KStream and groupBy and then windowedBy.
I believe when joining KStream with windowing, TimeWindows is the same. I'm wondering will there be benefit or drawback if TimeWindows parameters are different, when joining two different KTable windowed by TimeWindows?
code snippet:
final KStream<Integer, String> eventStream = builder.stream("events",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> eventWindowTable = eventStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).until(Duration.ofSeconds(100).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KStream<Integer, String> clickStream = builder.stream("clicks",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> clickWindowTable = clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).until(Duration.ofSeconds(70).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KTable<Windowed<Integer>, String> join = eventWindowTable.leftJoin(clickWindowTable,
(event, click) -> event + " ; " + click + " ; " + Instant.now()
);
Initially I thought joining two different KTable with different TimeWindows parameters will not work because the joining relies on TimeWindowedKey, a key for the time slot. But after testing, it works as well.

The join is executed because the type of both keys is the same: Windowed<Integer>. The join will of course only produce a result if the keys are the same. Assume you have the following windows (note that only the window start timestamp is stored for TimeWindows):
eventWindowTable: <A,0> <A,60>
clickWindowTable: <A,0> <A,30> <A,60> <A,90>
For this case, only <A,0> and <A,60> would join. Hence, having different windows, does impact your result, because the window start timestamp is part of the key and some windows will never join (eg, <A,30> and <A,90> in our example).

Is it possible for a kafka steams application to write multiple outputs from a single input?

I'm unsure if kafka-streams is the correct solution for a problem I'm trying to solve. I'd like to be able to use it because of the parallelism and fault tolerance it provides, but I'm struggling to come up with a way to achieve a desired processing pipeline.
The pipeline is something like this:
A record of some type arrives on an input topic
Information in this record is used to perform a database query, which returns many results
I'd like to be able to write out each result as an individual record, with its own key, rather than as a collection of results in a single record.
Ignoring the single output record per result requirement for a moment, I have code that looks like this:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<List<MyOutput>> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
outputs.to("output-topic", Produced.with(stringSerde, outputSerde));
This is simple enough, 1 message in, 1 message (albeit a collection) out.
What I'd like to be able to do is something like:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.???
sink.to("output-topic", Produced.with(stringSerde, outputSerde));
I cannot come up with anything sensible for an operation or operations to perform on the outputs stream.
Any suggestions? Or is kafka-streams maybe not the right solution to a problem like this?

yes, it's possible, for that you need to use KStream flatMap transformation. FlatMap transforms each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily)
kStream = kStream.flatMap(
(key, value) -> {
List<KeyValue<String, MyOutput>> result = new ArrayList<>();
// do your logic here
return result;
});
kStream.to("output-topic", Produced.with(stringSerde, outputSerde));

Thanks, Vasiliy, flatMap was indeed what I needed. I looked at it earlier, thought it was the right operation but then got confused and mistakenly discarded it.
Combining what I had before with your suggestion, the following works, assuming MyOutput implements a method called getKey():
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.flatMap(((key, value) ->
value.stream().map(o -> new KeyValue<>(o.getKey(), o)).collect(Collectors.toList())));
sink.to("output-topic", Produced.with(stringSerde, outputSerde));

KStream-KTable join writing to the KTable: How to sync the join with the ktable write?

I'm having some issue with how the following topology behaves:
String topic = config.topic();
KTable<UUID, MyData> myTable = topology.builder().table(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
// Receive a stream of various events
topology.eventsStream()
// Only process events that are implementing MyEvent
.filter((k, v) -> v instanceof MyEvent)
// Cast to ease the code
.mapValues(v -> (MyEvent) v)
// rekey by data id
.selectKey((k, v) -> v.data.id)
.peek((k, v) -> L.info("Event:"+v.action))
// join the event with the according entry in the KTable and apply the state mutation
.leftJoin(myTable, eventHandler::handleEvent, UUIDSerdes.get(), EventSerdes.get())
.peek((k, v) -> L.info("Updated:" + v.id + "-" + v.id2))
// write the updated state to the KTable.
.to(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
My Issue happens when i receive different events at the same time. As my state mutation is done by the leftJoin and then written by the to method. I can have the following occuring if event 1 and 2 are received at the same time with the same key:
event1 joins with state A => state A mutated to state X
event2 joins with state A => state A mutated to state Y
state X written to the KTable topic
state Y written to the KTable topic
Because of that, state Y doesn't have the changes from event1, so I lost data.
Here's in terms of logs what I see (the Processing:... part is logged from inside the value joiner):
Event:Event1
Event:Event2
Processing:Event1, State:none
Updated:1-null
Processing:Event2, State:none
java.lang.IllegalStateException: Event2 event received but we don't have data for id 1
Event1 can be considered as the creation event: it will create the entry in the KTable so it doesn't matter if the state is empty. Event2 though needs to apply it's changes to an existing state but it doesn't find any because the first state mutation still hasn't been written to the KTable (it's still hasn't been processed by the to method)
Is there anyway to make sure that my leftJoin and my writes into the ktable are done atomically ?
Thanks
Update & current solution
Thanks to the response of #Matthias I was able to find a solution using a Transformer.
Here's what the code looks like:
That's the transformer
public class KStreamStateLeftJoin<K, V1, V2> implements Transformer<K, V1, KeyValue<K, V2>> {
private final String stateName;
private final ValueJoiner<V1, V2, V2> joiner;
private final boolean updateState;
private KeyValueStore<K, V2> state;
public KStreamStateLeftJoin(String stateName, ValueJoiner<V1, V2, V2> joiner, boolean updateState) {
this.stateName = stateName;
this.joiner = joiner;
this.updateState = updateState;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.state = (KeyValueStore<K, V2>) context.getStateStore(stateName);
}
#Override
public KeyValue<K, V2> transform(K key, V1 value) {
V2 stateValue = this.state.get(key); // Get current state
V2 updatedValue = joiner.apply(value, stateValue); // Apply join
if (updateState) {
this.state.put(key, updatedValue); // write new state
}
return new KeyValue<>(key, updatedValue);
}
#Override
public KeyValue<K, V2> punctuate(long timestamp) {
return null;
}
#Override
public void close() {}
}
And here's the adapted topology:
String topic = config.topic();
String store = topic + "-store";
KTable<UUID, MyData> myTable = topology.builder().table(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic, store);
// Receive a stream of various events
topology.eventsStream()
// Only process events that are implementing MyEvent
.filter((k, v) -> v instanceof MyEvent)
// Cast to ease the code
.mapValues(v -> (MyEvent) v)
// rekey by data id
.selectKey((k, v) -> v.data.id)
// join the event with the according entry in the KTable and apply the state mutation
.transform(() -> new KStreamStateLeftJoin<UUID, MyEvent, MyData>(store, eventHandler::handleEvent, true), store)
// write the updated state to the KTable.
.to(UUIDSerdes.get(), GsonSerdes.get(MyData.class), topic);
As we're using the KTable's KV StateStore and applying changes directly in it through the put method events shoudl always pick up the updated state.
One thing i'm still wondering: what if I have a continuous high throughput of events.
Could there still be a race condition between the puts we do on the KTable's KV store and the writes that are done in the KTable's topic ?

A KTable is sharded into multiple physical stores and each store is only updated by a single thread. Thus, the scenario you describe cannot happen. If you have 2 records with the same timestamp that both update the same shard, they will be processed one after each other (in offset order). Thus, the second update will see the state of after the first update.
So maybe you just did describe your scenario not correctly?
Update
You cannot mutate the state when doing a join. Thus, the expectation that
event1 joins with state A => state A mutated to state X
is wrong. Independent of any processing order, when event1 joins with state A, it will access state A in read only mode and state A will not be modified.
Thus, when event2 joins, it will see the same state as event1. For stream-table join, the table state is only updated when new data is read from the table-input-topic.
If you want to have a shared state that is updated from both inputs, you would need to build a custom solution using transform():
builder.addStore(..., "store-name");
builder.stream("table-topic").transform(..., "store-name"); // will not emit anything downstream
KStream result = builder.stream("stream-topic").transform(..., "store-name");
This will create one store that is shared by both processors and both can read/write as they wish. Thus, for the table-input you can just update the state without sending anything downstream, while for the stream-input you can do the join, update the state, and send a result downstream.
Update 2
With regard to the solution, there will be no race condition between the updates the Transformer applies to the state and records the Transformer processes after the state update. This part will be executed in a single thread, and records will be processed in offset-order from the input topic. Thus, it's ensured that a state update will be available to later records.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio