Time semantics between KStream and KTable - apache-kafka-streams

I am trying to build the following topology:
Using Debezium Connectors, I am pulling 2 tables (let's called them tables A, and DA). As per DBZ, the topics where the table rows are stored have the structure { before: "...", after: "..." }.
First steps in my topology are to create "clean" KStreams off these two "table" topics. The sub-topology there looks roughly like this:
private static KStream<String, TABLE_A.Value> getTableARowByIdStream(
StreamsBuilder builder, Properties streamsConfig) {
return builder
.stream("TABLE_A", Consumed.withTimestampExtractor(Application::getRowDate))
.filter((key, envelope) -> [ some filtering condition ] )
.map((key, envelope) -> [ maps to TABLE_A.Value ] )
.through(tableRowByIdTopicName);
}
Notice that I am assigning the record time explicitly because the table rows will be CDC'ed "years" after they were originally published. What the function is doing at the moment is faking the time starting at 2010-01-01 and, using an AtomicInteger, adding 1 millisecond for each consumed entity. It does this for tables A but it doesn't for DA (I will explain why later).
Phase 2 of the topology is to build 1 KTable based on the "cleaned" topic for table A, like this:
private static KTable<String, EntityInfoList> getEntityInfoListById(
KStream<String, TABLE_A.Value> tableAByIdStream) {
return tableAByIdStream
.map((key, value) -> [ some mapping ] )
.groupByKey()
.aggregate(() -> [ builds up a EntityInfoList object ] ));
}
Finally, with th KTable ready, I'm joining them with the KStream over DA like so:
private static KStream<String, OutputTopicEntity> getOutputTopicEntityStream(
KStream<String, Table_DA.Value> tableDAStream,
KTable<String, EntityInfoList> tableA_KTable) {
KStream<String, Table_DA>[] branches = tableDAStream.branch(
(key, value) -> [ some logic ],
(key, value) -> true);
KStream<String, OutputTopicEntity> internalAccountRefStream = branches[0]
.join(
tableA_KTable,
(streamValue, tableValue) -> [ some logic to build a list of OutputTopicEntity ])
.flatMap((key, listValue) -> [ some logic to flatten it ]));
[ similar logic with branch[1] ]
}
My problem is, despite the fact that I am "faking" the time for records coming from the Table_A topic (I've verified that they are referencing 2010/01/01 using kafkacat) and entries in Table_DA (the stream side of the join) have timestamps around today '2019/08/14'), it doesn't seem like Kafka Streams is holding reading any of the entries from Table_DA KStream until it has ingested all records from Table_A into the KTable.
As a result of that, I don't have all the "join hits" that I was expecting and it is also nondeterministic. My understanding based on this sentence from What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()? was the opposite:
For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream.
My experience so far is this is not happening. I can also easily see how my application continues churning through the Table_A topic way after it has consumed all entries in Table_DA stream (it happens to be 10 times smaller).
Am I doing something wrong?

Timestamp synchronization is best effort before 2.1.0 release (cf. https://issues.apache.org/jira/browse/KAFKA-3514).
As of 2.1.0, timestamps are synchronized strictly. However, if one input does not have any data, Kafka Streams will "enforce" processing as described in KIP-353 to avoid blocking forever. If you have bursty inputs and want to "block" processing for some time if one input has no data, you can increase configuration parameter max.task.idle.ms (default is 0) as introduced in 2.1.0 via KIP-353.

Related

KStreams Grouping by multiple fields to get count

So I have a bunch of records in a topic like the one below. I can create the GroupBy in KSQLDB with no problem as it is more SQL than anything else. But I have been tasked to move it over to Java KStreams and am failing miserably.
Can someone guide me on the Topology for first grouping by user_id then Object_id then by day? I don't ask this lightly as I have tried over and over with state stores with so many examples but I am just chasing my tail. Basically, I would like to know how many times a user looked at a specific object on a given day.
Anything on how to accomplish this would be greatly appreciated.
{
"entrytimestamp": "2020-05-04T15:21:01.897",
"user_id": "080db36a-f205-4e32-a324-cc375b75d167",
"object_id": "fdb084f7-5367-4776-a5ae-a10d6e898d22"
}
You can create composed key, and then group by key, like:
KStream<String, Message> stream = builder.stream(MESSAGES, Consumed.with(Serdes.String(), jsonSerde));
KStream<String, Message> newKeyStream = stream.selectKey((key, message) ->
String.format("%s-%s-%s",
message.userId(),
message.objectId(),
LocalDate.ofInstant(Instant.ofEpochMilli(message.timestamp()), ZoneId.systemDefault())));
KGroupedStream<String, Message> groupedBy = newKeyStream.groupByKey();

Spring Kafka Stream doesn't get written

I'm writing a Spring Boot (2.1.4) app trying to use Spring Cloud Streams for Kafka.
What I'm trying to do is maintain a list of sensors on one topic ("sensors"). OTOH, I have incoming data on the other topic ("data"). What I'm trying to achieve is that when I get data for a sensor I don't already have, I want to add it to the sensor list.
To do that, I create a KTable<String, Sensor> from the sensors topic, map the temperature topic to the pure sensor's data (in this case, its name) and do an outer join with a ValueJoiner that retains the sensor if present, otherwise use the reading's sensor. Then, I write the result back to the sensors topic.
KTable<String, Sensor> sensorTable = ...;
KStream<String, SensorData> sensorDataStream = ...;
// get sensors providing measurements
KTable<String, Sensor> sensorsFromData =
sensorDataStream.groupByKey()
.aggregate(
Sensor::new,
(k, v, s) -> {
s.setName(k);
return s;
},
Materialized.with(Serdes.String(), SensorSerde.SERDE));
// join both sensor tables, preferring the existing ones
KTable<String, Sensor> joinedSensorTable =
sensorTable.outerJoin(
sensorsFromData,
// only use sensors from measurements if sensor not already present
(ex, ft) -> (ex != null) ? ex : ft,
Materialized.<String, Sensor, KeyValueStore<Bytes, byte[]>>as(SENSORS_TABLE)
.withKeySerde(Serdes.String()).withValueSerde(SensorSerde.SERDE));
// write to new topic for downstream services
joinedSensorTable.toStream();
This works fine if I create this using a StreamBuilder - i.e. if the sensorTable and sensorDataStream are coming from something like builder.table("sensors", Consumed.with(Serdes.String(), SensorSerde.SERDE)).
However, I'm trying to use Spring Stream Binding for this, ie the above code is wrapped in
#Configuration
#EnableBinding(SensorTableBinding.class)
class StreamConfiguration {
static final String SENSORS_TABLE = "sensors-table";
#StreamListener
#SendTo("sensorsOut")
private KStream<String, Sensor> getDataFromData
(#Input("sensors") KTable<String, Sensor> sensorTable,
#Input("data") KStream<String, SensorData> sensorDataStream) {
// ...
return joinedSensorTable.toStream();
}
}
with a
interface SensorTableBinding {
#Input("sensors")
KTable<String, Sensor> sensorStream();
#Output("sensorsOut")
KStream<String, Sensor> sensorOutput();
#Input("data")
KStream<String, SensorData> sensorDataStream();
}
Here is the spring stream section of the application.properties:
spring.cloud.stream.kafka.streams.binder.configuration.default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.streams.binder.configuration.default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.binder.brokers: ${spring.kafka.bootstrap-servers}
spring.cloud.stream.kafka.binder.configuration.auto.offset.reset: latest
spring.cloud.stream.kafka.binder.bindings.sensors.group: sensor-service
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
spring.cloud.stream.kafka.binder.data.group: sensor-service
spring.cloud.stream.kafka.binder.data.destination: data
The stream gets initialized fine, and the join is performed (the key-value-store is filled properly), however, the resulting stream is never written to the "sensors" topic.
Why? Am I missing something?
Also: I'm sure there's a better way to de/serialize my objects from/to JSON using an existing Serde, rather than having to declare classes of my own to add to the processing (SensorSerde/SensorDataSerde are thin delegation wrapper to an ObjectMapper)?
Turns out the data was written after all, but to the wrong topic, namely sensorOut.
The reason was the configuration. Instead of
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
the topics are configured with this:
spring.cloud.stream.bindings.sensors.destination: sensors
spring.cloud.stream.bindings.sensorsOut.destination: sensors
For the sensors and data topic, that didn't matter, because the binding's name was the same as the topic; but since Spring couldn't find a proper destination for the output, it used the binding's name sensorOut and wrote the data there.
As a note, the whole configuration setup around these is very confusing. The individual items are documented, but it's hard to tell for each to which configuration prefix they belong. Looking into the source code doesn't help either, because at that level what's passed around are Maps with the key stripped of the prefix at runtime, so it's really hard to tell where the data is coming from and what it will contain.
IMO it would really help to have acual #ConfigurationProperties-like data classes passed around, which would make it so much easier to understand.

Tombstone messages not removing record from KTable state store?

I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable.
sample -
public KStream<String, GenericRecord> processRecord(#Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream,
KTable<String, GenericRecord> table = testStream
.map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2))
.groupByKey()
reduce((genericRecord, v1) -> v1, Materialized.as("test-store"));
GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema));
genericRecord.put("field1", Long.parseLong(test.getField1()));
ProducerRecord record = new ProducerRecord(Channel.TEST, genericRecord, null);
kafkaTemplate.send(record);
Upon triggering a message with null value, I can debug in testStream map function with null payload, but it doesn't remove record on KTable change log "test-store". Looks like it doesn't even reach reduce method, not sure what I am missing here.
Appreciate any help on this!
Thanks.
As documented in the JavaDocs of reduce()
Records with {#code null} key or value are ignored.
Because, the <key,null> record is dropped and thus (genericRecord, v1) -> v1 is never executed, no tombstone is written to the store or changelog topic.
For the use case you have in mind, you need to use a surrogate value that indicates "delete", for example a boolean flag within your Avro record. Your reduce function needs to check for the flag and return null if the flag is set; otherwise, it must process the record regularly.
Update:
Apache Kafka 2.6 adds the KStream#toTable() operator (via KIP-523) that allows to transform a KStream into a KTable.
An addition to the above answer by Matthias:
Reduce ignores the first record on the stream, so the mapped and grouped value will be stored as-is in the KTable, never passing through the reduce method for tombstoning. This means that it will not be possible to just join another stream on that table, the value itself also needs to be evaluated.
I hope KIP-523 solves this.

Kafka: Efficiently join windowed aggregates to events

I'm prototyping a fraud application. We'll frequently have metrics like "total amount of cash transactions in the last 5 days" that we need to compare against some threshold to determine if we raise an alert.
We're looking to use Kafka Streams to create and maintain the aggregates and then create an enhanced version of the incoming transaction that has the original transaction fields plus the aggregates. This enhanced record gets processed by a downstream rules system.
I'm wondering the best way to approach this. I've prototyped creating the aggregates with code like this:
TimeWindows twoDayHopping TimeWindows.of(TimeUnit.DAYS.toMillis(2))
.advanceBy(TimeUnit.DAYS.toMillis(1));
KStream<String, AdditiveStatistics> aggrStream = transactions
.filter((key,value)->{
return value.getAccountTypeDesc().equals("P") &&
value.getPrimaryMediumDesc().equals("CASH");
})
.groupByKey()
.aggregate(AdditiveStatistics::new,
(key,value,accumulator)-> {
return AdditiveStatsUtil
.advance(value.getCurrencyAmount(),accumulator),
twoDayHopping,
metricsSerde,
"sas10005_store")
}
.toStream()
.map((key,value)-> {
value.setTransDate(key.window().start());
return new KeyValue<String, AdditiveStatistics>(key.key(),value);
})
.through(Serdes.String(),metricsSerde,datedAggrTopic);;
This creates a store-backed stream that has a records per key per window. I then join the original transactions stream to this window to produce the final output to a topic:
JoinWindows joinWindow = JoinWindows.of(TimeUnit.DAYS.toMillis(1))
.before(TimeUnit.DAYS.toMillis(1))
.after(-1)
.until(TimeUnit.DAYS.toMillis(2)+1);
KStream<String,Transactions10KEnhanced> enhancedTrans = transactions.join(aggrStream,
(left,right)->{
Transactions10KEnhanced out = new Transactions10KEnhanced();
out.setAccountNumber(left.getAccountNumber());
out.setAccountTypeDesc(left.getAccountTypeDesc());
out.setPartyNumber(left.getPartyNumber());
out.setPrimaryMediumDesc(left.getPrimaryMediumDesc());
out.setSecondaryMediumDesc(left.getSecondaryMediumDesc());
out.setTransactionKey(left.getTransactionKey());
out.setCurrencyAmount(left.getCurrencyAmount());
out.setTransDate(left.getTransDate());
if(right != null) {
out.setSum2d(right.getSum());
}
return out;
},
joinWindow);
This produces the correct results, but it seems to run for quite a while, even with a low number of records. I'm wondering if there's a more efficient way to achieve the same result.
It's a config issues: cf http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Disable caching by setting cache size to zero (parameter cache.max.bytes.buffering in StreamsConfig) will resolve the "delayed" delivery to the output topic.
You might also read this blog post for some background information about Streams design: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

Resources