I'm writing a Spring Boot (2.1.4) app trying to use Spring Cloud Streams for Kafka.
What I'm trying to do is maintain a list of sensors on one topic ("sensors"). OTOH, I have incoming data on the other topic ("data"). What I'm trying to achieve is that when I get data for a sensor I don't already have, I want to add it to the sensor list.
To do that, I create a KTable<String, Sensor> from the sensors topic, map the temperature topic to the pure sensor's data (in this case, its name) and do an outer join with a ValueJoiner that retains the sensor if present, otherwise use the reading's sensor. Then, I write the result back to the sensors topic.
KTable<String, Sensor> sensorTable = ...;
KStream<String, SensorData> sensorDataStream = ...;
// get sensors providing measurements
KTable<String, Sensor> sensorsFromData =
sensorDataStream.groupByKey()
.aggregate(
Sensor::new,
(k, v, s) -> {
s.setName(k);
return s;
},
Materialized.with(Serdes.String(), SensorSerde.SERDE));
// join both sensor tables, preferring the existing ones
KTable<String, Sensor> joinedSensorTable =
sensorTable.outerJoin(
sensorsFromData,
// only use sensors from measurements if sensor not already present
(ex, ft) -> (ex != null) ? ex : ft,
Materialized.<String, Sensor, KeyValueStore<Bytes, byte[]>>as(SENSORS_TABLE)
.withKeySerde(Serdes.String()).withValueSerde(SensorSerde.SERDE));
// write to new topic for downstream services
joinedSensorTable.toStream();
This works fine if I create this using a StreamBuilder - i.e. if the sensorTable and sensorDataStream are coming from something like builder.table("sensors", Consumed.with(Serdes.String(), SensorSerde.SERDE)).
However, I'm trying to use Spring Stream Binding for this, ie the above code is wrapped in
#Configuration
#EnableBinding(SensorTableBinding.class)
class StreamConfiguration {
static final String SENSORS_TABLE = "sensors-table";
#StreamListener
#SendTo("sensorsOut")
private KStream<String, Sensor> getDataFromData
(#Input("sensors") KTable<String, Sensor> sensorTable,
#Input("data") KStream<String, SensorData> sensorDataStream) {
// ...
return joinedSensorTable.toStream();
}
}
with a
interface SensorTableBinding {
#Input("sensors")
KTable<String, Sensor> sensorStream();
#Output("sensorsOut")
KStream<String, Sensor> sensorOutput();
#Input("data")
KStream<String, SensorData> sensorDataStream();
}
Here is the spring stream section of the application.properties:
spring.cloud.stream.kafka.streams.binder.configuration.default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.streams.binder.configuration.default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.binder.brokers: ${spring.kafka.bootstrap-servers}
spring.cloud.stream.kafka.binder.configuration.auto.offset.reset: latest
spring.cloud.stream.kafka.binder.bindings.sensors.group: sensor-service
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
spring.cloud.stream.kafka.binder.data.group: sensor-service
spring.cloud.stream.kafka.binder.data.destination: data
The stream gets initialized fine, and the join is performed (the key-value-store is filled properly), however, the resulting stream is never written to the "sensors" topic.
Why? Am I missing something?
Also: I'm sure there's a better way to de/serialize my objects from/to JSON using an existing Serde, rather than having to declare classes of my own to add to the processing (SensorSerde/SensorDataSerde are thin delegation wrapper to an ObjectMapper)?
Turns out the data was written after all, but to the wrong topic, namely sensorOut.
The reason was the configuration. Instead of
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
the topics are configured with this:
spring.cloud.stream.bindings.sensors.destination: sensors
spring.cloud.stream.bindings.sensorsOut.destination: sensors
For the sensors and data topic, that didn't matter, because the binding's name was the same as the topic; but since Spring couldn't find a proper destination for the output, it used the binding's name sensorOut and wrote the data there.
As a note, the whole configuration setup around these is very confusing. The individual items are documented, but it's hard to tell for each to which configuration prefix they belong. Looking into the source code doesn't help either, because at that level what's passed around are Maps with the key stripped of the prefix at runtime, so it's really hard to tell where the data is coming from and what it will contain.
IMO it would really help to have acual #ConfigurationProperties-like data classes passed around, which would make it so much easier to understand.
Related
How to add an incoming topic and change an outgoing topic while the application is running? Depending on which incoming topic is currently being worked with, the outgoing topic should change.
in_topic1 -> filter OK -> out_topic1;
in_topic2 -> filter OK -> out_topic2.
final Serde<byte[]> byteArraySerde = Serdes.ByteArray();
final Serde<String> stringSerde = Serdes.String();
final StreamsBuilder builder = new StreamsBuilder();
final KStream<byte[], String> textLines = builder
.stream(prop.getProperty("kafka.topic.in"), Consumed.with(byteArraySerde, stringSerde));
final KStream<byte[], String> processed = textLines
.filter(MetaModelProcessor.filter())
.mapValues(MetaModelProcessor.getMetaModel());
processed.to(prop.getProperty("kafka.topic.out"));
final org.apache.kafka.streams.KafkaStreams streams = new org.apache.kafka.streams.KafkaStreams(builder.build(), new KafkaStreamsConfig(prop.getProperty("kafka.app.id.config"), prop.getProperty("kafka.client.id.config"), prop.getProperty("kafka.server")).getStreamsConfiguration());
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
A Kafka stream application is basically a wrapper over a Producer and Consumer with higher order transformation functions. When you create a Streams application, you are initializing a topology that interacts with the broker. Adding dynamic ingress and egress topics is not a trivial operation.
What would happen to the intermediately processed results of a v1 topology that consumed a message from topic I1 and was just about to write to topic T1 when a dynamic event switches the output topic to T2. Worse, what if there's a state store being maintained.
This seems to be a weird requirement. If you are finding yourself in this place, it probably means we need to revisit the use case and the design thoroughly.
I am trying to build the following topology:
Using Debezium Connectors, I am pulling 2 tables (let's called them tables A, and DA). As per DBZ, the topics where the table rows are stored have the structure { before: "...", after: "..." }.
First steps in my topology are to create "clean" KStreams off these two "table" topics. The sub-topology there looks roughly like this:
private static KStream<String, TABLE_A.Value> getTableARowByIdStream(
StreamsBuilder builder, Properties streamsConfig) {
return builder
.stream("TABLE_A", Consumed.withTimestampExtractor(Application::getRowDate))
.filter((key, envelope) -> [ some filtering condition ] )
.map((key, envelope) -> [ maps to TABLE_A.Value ] )
.through(tableRowByIdTopicName);
}
Notice that I am assigning the record time explicitly because the table rows will be CDC'ed "years" after they were originally published. What the function is doing at the moment is faking the time starting at 2010-01-01 and, using an AtomicInteger, adding 1 millisecond for each consumed entity. It does this for tables A but it doesn't for DA (I will explain why later).
Phase 2 of the topology is to build 1 KTable based on the "cleaned" topic for table A, like this:
private static KTable<String, EntityInfoList> getEntityInfoListById(
KStream<String, TABLE_A.Value> tableAByIdStream) {
return tableAByIdStream
.map((key, value) -> [ some mapping ] )
.groupByKey()
.aggregate(() -> [ builds up a EntityInfoList object ] ));
}
Finally, with th KTable ready, I'm joining them with the KStream over DA like so:
private static KStream<String, OutputTopicEntity> getOutputTopicEntityStream(
KStream<String, Table_DA.Value> tableDAStream,
KTable<String, EntityInfoList> tableA_KTable) {
KStream<String, Table_DA>[] branches = tableDAStream.branch(
(key, value) -> [ some logic ],
(key, value) -> true);
KStream<String, OutputTopicEntity> internalAccountRefStream = branches[0]
.join(
tableA_KTable,
(streamValue, tableValue) -> [ some logic to build a list of OutputTopicEntity ])
.flatMap((key, listValue) -> [ some logic to flatten it ]));
[ similar logic with branch[1] ]
}
My problem is, despite the fact that I am "faking" the time for records coming from the Table_A topic (I've verified that they are referencing 2010/01/01 using kafkacat) and entries in Table_DA (the stream side of the join) have timestamps around today '2019/08/14'), it doesn't seem like Kafka Streams is holding reading any of the entries from Table_DA KStream until it has ingested all records from Table_A into the KTable.
As a result of that, I don't have all the "join hits" that I was expecting and it is also nondeterministic. My understanding based on this sentence from What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()? was the opposite:
For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream.
My experience so far is this is not happening. I can also easily see how my application continues churning through the Table_A topic way after it has consumed all entries in Table_DA stream (it happens to be 10 times smaller).
Am I doing something wrong?
Timestamp synchronization is best effort before 2.1.0 release (cf. https://issues.apache.org/jira/browse/KAFKA-3514).
As of 2.1.0, timestamps are synchronized strictly. However, if one input does not have any data, Kafka Streams will "enforce" processing as described in KIP-353 to avoid blocking forever. If you have bursty inputs and want to "block" processing for some time if one input has no data, you can increase configuration parameter max.task.idle.ms (default is 0) as introduced in 2.1.0 via KIP-353.
I have a topic with multiple partitions in my stream processor i just wanted to stream that from one partition, and could nto figure out how to configure this
spring.cloud.stream.kafka.streams.bindings.input.consumer.application-id=s-processor
spring.cloud.stream.bindings.input.destination=uinput
spring.cloud.stream.bindings.input.group=r-processor
spring.cloud.stream.bindings.input.contentType=application/java-serialized-object
spring.cloud.stream.bindings.input.consumer.header-mode=raw
spring.cloud.stream.bindings.input.consumer.use-native-decoding=true
spring.cloud.stream.bindings.input.consumer.partitioned=true
#StreamListener(target = "input")
// #SendTo(value = { "uoutput" })
public void process(KStream<UUID, AModel> ustream) {
I want only one partition data to be processed by this processor, there will be other processors for other partition(s)
So far my finding is something to do with https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/StreamsConfig.html#PARTITION_GROUPER_CLASS_CONFIG, but couldnot find how to set this property in spring application.properties
I think the partition grouper is to group partition with tasks within a single processor. If you want to ensure that only a single partition is processed by a processor, then you need to provide at least the same number of processor instances as the topic partitions. For e.g. if your topic has 4 partitions, then you need to have 4 instances of the stream application to ensure that each instance is only processing a single partition.
Kafka Streams does not allow to read a single partition. If you subscribe to a topic, all partitions are consumed and distributed over the available instances. Thus, you can't know in advance, which partition is assigned to what instance, and all instances execute the same code.
But each partition linked to processor has different kind of data hence require different processor application
For this case, the processor (or transformer) must be able to process data for all partitions. Kafka Streams exposes the partitions number via the ProcessorContext object that is handed to a processor via init() method: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/Transformer.html#init-org.apache.kafka.streams.processor.ProcessorContext-
Thus, you need to "branch" with within your transformer to apply different processing logic based on the partition:
ustream.transform(() -> new MyTransformer());
class MyTransformer implement Transformer {
// other methods omitted
R transform(K key, V value) {
switch(context.partition()) { // get context from `init()`
case 0:
// your processing logic
break;
case 1:
// your processing logic
break;
// ...
}
}
I'm prototyping a fraud application. We'll frequently have metrics like "total amount of cash transactions in the last 5 days" that we need to compare against some threshold to determine if we raise an alert.
We're looking to use Kafka Streams to create and maintain the aggregates and then create an enhanced version of the incoming transaction that has the original transaction fields plus the aggregates. This enhanced record gets processed by a downstream rules system.
I'm wondering the best way to approach this. I've prototyped creating the aggregates with code like this:
TimeWindows twoDayHopping TimeWindows.of(TimeUnit.DAYS.toMillis(2))
.advanceBy(TimeUnit.DAYS.toMillis(1));
KStream<String, AdditiveStatistics> aggrStream = transactions
.filter((key,value)->{
return value.getAccountTypeDesc().equals("P") &&
value.getPrimaryMediumDesc().equals("CASH");
})
.groupByKey()
.aggregate(AdditiveStatistics::new,
(key,value,accumulator)-> {
return AdditiveStatsUtil
.advance(value.getCurrencyAmount(),accumulator),
twoDayHopping,
metricsSerde,
"sas10005_store")
}
.toStream()
.map((key,value)-> {
value.setTransDate(key.window().start());
return new KeyValue<String, AdditiveStatistics>(key.key(),value);
})
.through(Serdes.String(),metricsSerde,datedAggrTopic);;
This creates a store-backed stream that has a records per key per window. I then join the original transactions stream to this window to produce the final output to a topic:
JoinWindows joinWindow = JoinWindows.of(TimeUnit.DAYS.toMillis(1))
.before(TimeUnit.DAYS.toMillis(1))
.after(-1)
.until(TimeUnit.DAYS.toMillis(2)+1);
KStream<String,Transactions10KEnhanced> enhancedTrans = transactions.join(aggrStream,
(left,right)->{
Transactions10KEnhanced out = new Transactions10KEnhanced();
out.setAccountNumber(left.getAccountNumber());
out.setAccountTypeDesc(left.getAccountTypeDesc());
out.setPartyNumber(left.getPartyNumber());
out.setPrimaryMediumDesc(left.getPrimaryMediumDesc());
out.setSecondaryMediumDesc(left.getSecondaryMediumDesc());
out.setTransactionKey(left.getTransactionKey());
out.setCurrencyAmount(left.getCurrencyAmount());
out.setTransDate(left.getTransDate());
if(right != null) {
out.setSum2d(right.getSum());
}
return out;
},
joinWindow);
This produces the correct results, but it seems to run for quite a while, even with a low number of records. I'm wondering if there's a more efficient way to achieve the same result.
It's a config issues: cf http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Disable caching by setting cache size to zero (parameter cache.max.bytes.buffering in StreamsConfig) will resolve the "delayed" delivery to the output topic.
You might also read this blog post for some background information about Streams design: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application