How to send record on topic when window is closed in kafka streams - apache-kafka-streams

So i have been struggeling with this for a couple of days, acctually. I am consuming records from 4 topics. I need to aggregate the records over a TimedWindow. When the time is up, i want to send either an approved message or a not approved message to a sink topic. Is this possible to do with kafka streams?
It seems it sinks every record to the new topic, even though the window is still open, and that's really not what i want.
Here is the simple code:
builder.stream(getTopicList(), Consumed.with(Serdes.ByteArray(),
Serdes.ByteArray()))
.flatMap(new ExceptionSafeKeyValueMapper<String,
FooTriggerMessage>("", Serdes.String(),
fooTriggerSerde))
.filter((key, value) -> value.getTriggerEventId() != null)
.groupBy((key, value) -> value.getTriggerEventId().toString(),
Serialized.with(Serdes.String(), fooTriggerSerde))
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(30))
.advanceBy(TimeUnit.SECONDS.toMillis(30)))
.aggregate(() -> new BarApprovalMessage(), /* initializer */
(key, value, aggValue) -> getApproval(key, value, aggValue),/*adder*/
Materialized
.<String, BarApprovalMessage, WindowStore<Bytes, byte[]>>as(
storeName) /* state store name */
.withValueSerde(barApprovalSerde))
.toStream().to(appProperties.getBarApprovalEngineOutgoing(),
Produced.with(windowedSerde, barApprovalSerde));
As of now, every record is being sinked to the outgoingTopic, i only want it to send one message when the window is closed, so to speak.
Is this possible?

I answering my own question, if anyone else needs an answer. In the transform stage, I used the context to create a scheduler. This scheduler takes three parameters. What interval to punctuate, which time to use(wall clock or stream time) and a supplier(method to be called when time is met). I used wall clock time and started a new scheduler for each unique window key. I add each message in a KeyValue store and return null. Then, In the method that is called every 30 seconds, I check that the window is closed, and iterate over the messages in the keystore, aggregates and use context.forward and context.commit. Viola! 4 messages received in a 30 seconds window, one message produced.

You can use the Suppress functionality.
From Kafka official guide:
https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#window-final-results

I faced the issue, but I solve this problem to add grace(0) after the fixed window and using Suppressed API
public void process(KStream<SensorKeyDTO, SensorDataDTO> stream) {
buildAggregateMetricsBySensor(stream)
.to(outputTopic, Produced.with(String(), new SensorAggregateMetricsSerde()));
}
private KStream<String, SensorAggregateMetricsDTO> buildAggregateMetricsBySensor(KStream<SensorKeyDTO, SensorDataDTO> stream) {
return stream
.map((key, val) -> new KeyValue<>(val.getId(), val))
.groupByKey(Grouped.with(String(), new SensorDataSerde()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(WINDOW_SIZE_IN_MINUTES)).grace(Duration.ofMillis(0)))
.aggregate(SensorAggregateMetricsDTO::new,
(String k, SensorDataDTO v, SensorAggregateMetricsDTO va) -> aggregateData(v, va),
buildWindowPersistentStore())
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.map((key, value) -> KeyValue.pair(key.key(), value));
}
private Materialized<String, SensorAggregateMetricsDTO, WindowStore<Bytes, byte[]>> buildWindowPersistentStore() {
return Materialized
.<String, SensorAggregateMetricsDTO, WindowStore<Bytes, byte[]>>as(WINDOW_STORE_NAME)
.withKeySerde(String())
.withValueSerde(new SensorAggregateMetricsSerde());
}
Here you can see the result

Related

Stop KafkaListener ( Spring Kafka Consumer) after it has read all messages till some specific time

I am trying to schedule my consumption process from a single partition topic. I can start it using endpointlistenerregistry.start() but I want to stop it after I have consumed all the messages in current partition i.e. when I reach to last offset in current partition. Production into the topic is done after I have finished the consumption and close it. How should I achieve the assurance that I have read all the messages till the time I started scheduler and stop my consumer ? I am using #Kafkalistener for consumer.
Set the idleEventInterval container property and add an #EventListener method to listen for ListenerContainerIdleEvents.
Then stop the container.
To read till the last offset, you simply poll till you are getting empty records.
You can invoke kafkaConsumer.pause() at the end of consumption. During next schedule it is required to invoke kafkaConsumer.resume().
Suspend fetching from the requested partitions. Future calls to poll(Duration) will not return any records from these partitions until they have been resumed using resume(Collection). Note that this method does not affect partition subscription. In particular, it does not cause a group rebalance when automatic assignment is used.
Something like this,
List<TopicPartition> topicPartitions = new ArrayList<>();
void scheduleProcess() {
topicPartitions = ... // assign partition info for this
kafkaConsumer.resume(topicPartitions)
while(true) {
ConsumerRecords<String, Object> events = kafkaConsumer.poll(Duration.ofMillis(1000));
if(!events.isEmpty()) {
// processing logic
} else {
kafkaConsumer.pause(List.of(topicPartition));
break;
}
}
}

Time semantics between KStream and KTable

I am trying to build the following topology:
Using Debezium Connectors, I am pulling 2 tables (let's called them tables A, and DA). As per DBZ, the topics where the table rows are stored have the structure { before: "...", after: "..." }.
First steps in my topology are to create "clean" KStreams off these two "table" topics. The sub-topology there looks roughly like this:
private static KStream<String, TABLE_A.Value> getTableARowByIdStream(
StreamsBuilder builder, Properties streamsConfig) {
return builder
.stream("TABLE_A", Consumed.withTimestampExtractor(Application::getRowDate))
.filter((key, envelope) -> [ some filtering condition ] )
.map((key, envelope) -> [ maps to TABLE_A.Value ] )
.through(tableRowByIdTopicName);
}
Notice that I am assigning the record time explicitly because the table rows will be CDC'ed "years" after they were originally published. What the function is doing at the moment is faking the time starting at 2010-01-01 and, using an AtomicInteger, adding 1 millisecond for each consumed entity. It does this for tables A but it doesn't for DA (I will explain why later).
Phase 2 of the topology is to build 1 KTable based on the "cleaned" topic for table A, like this:
private static KTable<String, EntityInfoList> getEntityInfoListById(
KStream<String, TABLE_A.Value> tableAByIdStream) {
return tableAByIdStream
.map((key, value) -> [ some mapping ] )
.groupByKey()
.aggregate(() -> [ builds up a EntityInfoList object ] ));
}
Finally, with th KTable ready, I'm joining them with the KStream over DA like so:
private static KStream<String, OutputTopicEntity> getOutputTopicEntityStream(
KStream<String, Table_DA.Value> tableDAStream,
KTable<String, EntityInfoList> tableA_KTable) {
KStream<String, Table_DA>[] branches = tableDAStream.branch(
(key, value) -> [ some logic ],
(key, value) -> true);
KStream<String, OutputTopicEntity> internalAccountRefStream = branches[0]
.join(
tableA_KTable,
(streamValue, tableValue) -> [ some logic to build a list of OutputTopicEntity ])
.flatMap((key, listValue) -> [ some logic to flatten it ]));
[ similar logic with branch[1] ]
}
My problem is, despite the fact that I am "faking" the time for records coming from the Table_A topic (I've verified that they are referencing 2010/01/01 using kafkacat) and entries in Table_DA (the stream side of the join) have timestamps around today '2019/08/14'), it doesn't seem like Kafka Streams is holding reading any of the entries from Table_DA KStream until it has ingested all records from Table_A into the KTable.
As a result of that, I don't have all the "join hits" that I was expecting and it is also nondeterministic. My understanding based on this sentence from What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()? was the opposite:
For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream.
My experience so far is this is not happening. I can also easily see how my application continues churning through the Table_A topic way after it has consumed all entries in Table_DA stream (it happens to be 10 times smaller).
Am I doing something wrong?
Timestamp synchronization is best effort before 2.1.0 release (cf. https://issues.apache.org/jira/browse/KAFKA-3514).
As of 2.1.0, timestamps are synchronized strictly. However, if one input does not have any data, Kafka Streams will "enforce" processing as described in KIP-353 to avoid blocking forever. If you have bursty inputs and want to "block" processing for some time if one input has no data, you can increase configuration parameter max.task.idle.ms (default is 0) as introduced in 2.1.0 via KIP-353.

How to delegate tombstone event to a KTable using selectKey

I have the following kafka stream configuration.
StreamBuilder builder = stream("TopicA", Serdes.String(), new
SpecificAvroSerde<TestObject>())
.filter((key, value) -> value!=null)
.selectKey((key, value) -> value.getSomeProperty())
.groupByKey(Grouped.with(Serdes.Long(), new
SpecificAvroSerde<TestObject>()))
.reduce((oldValue, newValue) -> newValue),
Materialized.as("someStore"));
This works as I expect but I can't figure put how I can deal with Tombstone message for TestObject, even I remove
.filter((key, value) -> value!=null)
I can't figure out how can I deal with 'selectKey' while when the value arrives as null I can't send a tombstone message with 'value.getSomeProperty()' while value will be also null..
How would you deal with this problem?
You can use transform() instead of selectKey() and store the old <key,value> pair in a state store. This way, when <key,null> is processed, you can get the previous value from the store, and get the previously extracted new key and send a corresponding tombstone.
However, reduce() cannot process any record with null key or null value (those would be dropped). Thus, you will need to use a surrogate value instead of null to get the record into the Reduce function. If the surrogate is received, Reduce can return null.

Tombstone messages not removing record from KTable state store?

I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable.
sample -
public KStream<String, GenericRecord> processRecord(#Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream,
KTable<String, GenericRecord> table = testStream
.map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2))
.groupByKey()
reduce((genericRecord, v1) -> v1, Materialized.as("test-store"));
GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema));
genericRecord.put("field1", Long.parseLong(test.getField1()));
ProducerRecord record = new ProducerRecord(Channel.TEST, genericRecord, null);
kafkaTemplate.send(record);
Upon triggering a message with null value, I can debug in testStream map function with null payload, but it doesn't remove record on KTable change log "test-store". Looks like it doesn't even reach reduce method, not sure what I am missing here.
Appreciate any help on this!
Thanks.
As documented in the JavaDocs of reduce()
Records with {#code null} key or value are ignored.
Because, the <key,null> record is dropped and thus (genericRecord, v1) -> v1 is never executed, no tombstone is written to the store or changelog topic.
For the use case you have in mind, you need to use a surrogate value that indicates "delete", for example a boolean flag within your Avro record. Your reduce function needs to check for the flag and return null if the flag is set; otherwise, it must process the record regularly.
Update:
Apache Kafka 2.6 adds the KStream#toTable() operator (via KIP-523) that allows to transform a KStream into a KTable.
An addition to the above answer by Matthias:
Reduce ignores the first record on the stream, so the mapped and grouped value will be stored as-is in the KTable, never passing through the reduce method for tombstoning. This means that it will not be possible to just join another stream on that table, the value itself also needs to be evaluated.
I hope KIP-523 solves this.

Kafka: Efficiently join windowed aggregates to events

I'm prototyping a fraud application. We'll frequently have metrics like "total amount of cash transactions in the last 5 days" that we need to compare against some threshold to determine if we raise an alert.
We're looking to use Kafka Streams to create and maintain the aggregates and then create an enhanced version of the incoming transaction that has the original transaction fields plus the aggregates. This enhanced record gets processed by a downstream rules system.
I'm wondering the best way to approach this. I've prototyped creating the aggregates with code like this:
TimeWindows twoDayHopping TimeWindows.of(TimeUnit.DAYS.toMillis(2))
.advanceBy(TimeUnit.DAYS.toMillis(1));
KStream<String, AdditiveStatistics> aggrStream = transactions
.filter((key,value)->{
return value.getAccountTypeDesc().equals("P") &&
value.getPrimaryMediumDesc().equals("CASH");
})
.groupByKey()
.aggregate(AdditiveStatistics::new,
(key,value,accumulator)-> {
return AdditiveStatsUtil
.advance(value.getCurrencyAmount(),accumulator),
twoDayHopping,
metricsSerde,
"sas10005_store")
}
.toStream()
.map((key,value)-> {
value.setTransDate(key.window().start());
return new KeyValue<String, AdditiveStatistics>(key.key(),value);
})
.through(Serdes.String(),metricsSerde,datedAggrTopic);;
This creates a store-backed stream that has a records per key per window. I then join the original transactions stream to this window to produce the final output to a topic:
JoinWindows joinWindow = JoinWindows.of(TimeUnit.DAYS.toMillis(1))
.before(TimeUnit.DAYS.toMillis(1))
.after(-1)
.until(TimeUnit.DAYS.toMillis(2)+1);
KStream<String,Transactions10KEnhanced> enhancedTrans = transactions.join(aggrStream,
(left,right)->{
Transactions10KEnhanced out = new Transactions10KEnhanced();
out.setAccountNumber(left.getAccountNumber());
out.setAccountTypeDesc(left.getAccountTypeDesc());
out.setPartyNumber(left.getPartyNumber());
out.setPrimaryMediumDesc(left.getPrimaryMediumDesc());
out.setSecondaryMediumDesc(left.getSecondaryMediumDesc());
out.setTransactionKey(left.getTransactionKey());
out.setCurrencyAmount(left.getCurrencyAmount());
out.setTransDate(left.getTransDate());
if(right != null) {
out.setSum2d(right.getSum());
}
return out;
},
joinWindow);
This produces the correct results, but it seems to run for quite a while, even with a low number of records. I'm wondering if there's a more efficient way to achieve the same result.
It's a config issues: cf http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Disable caching by setting cache size to zero (parameter cache.max.bytes.buffering in StreamsConfig) will resolve the "delayed" delivery to the output topic.
You might also read this blog post for some background information about Streams design: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Resources