How to use a KeyValueStore state store in DSL? - apache-kafka-streams

KeyValueStore<String, Long> kvStore=(KeyValueStore<String, Long>)
Stores.create("InterWindowStore1").withKeys(Serdes.String())
.withValues(Serdes.Long())
.persistent()
.build().get();`
I have created statestore as shown in above code and try to insert into kvStore.put(key, value); but it is throwing me NPE
Caused by: java.lang.NullPointerException
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:117)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:82)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:202)
at org.apache.kafka.streams.kstream.internals.ForwardingCacheFlushListener.apply(ForwardingCacheFlushListener.java:42)
at org.apache.kafka.streams.state.internals.CachingWindowStore.maybeForward(CachingWindowStore.java:103)
at org.apache.kafka.streams.state.internals.CachingWindowStore.access$200(CachingWindowStore.java:34)
at org.apache.kafka.streams.state.internals.CachingWindowStore$1.apply(CachingWindowStore.java:86)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:131)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:95)

As you describe in your comments you are basically doing a window aggregation:
KStream stream = ...
KTable table = stream.groupByKey().aggregate(..., TimeWindow.of(...));
As you KTable stream might contain updates for you window aggregation, you want to modify this stream. For this, you can use a stateful transformer or value-transformer:
StateStoreSupplier myState = State.create("nameOfMyState")....;
KStream result = table.toStream().transform(..., "nameOfMyState");
Finally, you can write your result to the output topic:
result.to("output-topic");
Your Transformer that you provide to transform can get the state via the given context in init() and use within transform() each time a window output is generated/updated.

Related

Spring Kafka Stream doesn't get written

I'm writing a Spring Boot (2.1.4) app trying to use Spring Cloud Streams for Kafka.
What I'm trying to do is maintain a list of sensors on one topic ("sensors"). OTOH, I have incoming data on the other topic ("data"). What I'm trying to achieve is that when I get data for a sensor I don't already have, I want to add it to the sensor list.
To do that, I create a KTable<String, Sensor> from the sensors topic, map the temperature topic to the pure sensor's data (in this case, its name) and do an outer join with a ValueJoiner that retains the sensor if present, otherwise use the reading's sensor. Then, I write the result back to the sensors topic.
KTable<String, Sensor> sensorTable = ...;
KStream<String, SensorData> sensorDataStream = ...;
// get sensors providing measurements
KTable<String, Sensor> sensorsFromData =
sensorDataStream.groupByKey()
.aggregate(
Sensor::new,
(k, v, s) -> {
s.setName(k);
return s;
},
Materialized.with(Serdes.String(), SensorSerde.SERDE));
// join both sensor tables, preferring the existing ones
KTable<String, Sensor> joinedSensorTable =
sensorTable.outerJoin(
sensorsFromData,
// only use sensors from measurements if sensor not already present
(ex, ft) -> (ex != null) ? ex : ft,
Materialized.<String, Sensor, KeyValueStore<Bytes, byte[]>>as(SENSORS_TABLE)
.withKeySerde(Serdes.String()).withValueSerde(SensorSerde.SERDE));
// write to new topic for downstream services
joinedSensorTable.toStream();
This works fine if I create this using a StreamBuilder - i.e. if the sensorTable and sensorDataStream are coming from something like builder.table("sensors", Consumed.with(Serdes.String(), SensorSerde.SERDE)).
However, I'm trying to use Spring Stream Binding for this, ie the above code is wrapped in
#Configuration
#EnableBinding(SensorTableBinding.class)
class StreamConfiguration {
static final String SENSORS_TABLE = "sensors-table";
#StreamListener
#SendTo("sensorsOut")
private KStream<String, Sensor> getDataFromData
(#Input("sensors") KTable<String, Sensor> sensorTable,
#Input("data") KStream<String, SensorData> sensorDataStream) {
// ...
return joinedSensorTable.toStream();
}
}
with a
interface SensorTableBinding {
#Input("sensors")
KTable<String, Sensor> sensorStream();
#Output("sensorsOut")
KStream<String, Sensor> sensorOutput();
#Input("data")
KStream<String, SensorData> sensorDataStream();
}
Here is the spring stream section of the application.properties:
spring.cloud.stream.kafka.streams.binder.configuration.default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.streams.binder.configuration.default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.binder.brokers: ${spring.kafka.bootstrap-servers}
spring.cloud.stream.kafka.binder.configuration.auto.offset.reset: latest
spring.cloud.stream.kafka.binder.bindings.sensors.group: sensor-service
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
spring.cloud.stream.kafka.binder.data.group: sensor-service
spring.cloud.stream.kafka.binder.data.destination: data
The stream gets initialized fine, and the join is performed (the key-value-store is filled properly), however, the resulting stream is never written to the "sensors" topic.
Why? Am I missing something?
Also: I'm sure there's a better way to de/serialize my objects from/to JSON using an existing Serde, rather than having to declare classes of my own to add to the processing (SensorSerde/SensorDataSerde are thin delegation wrapper to an ObjectMapper)?
Turns out the data was written after all, but to the wrong topic, namely sensorOut.
The reason was the configuration. Instead of
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
the topics are configured with this:
spring.cloud.stream.bindings.sensors.destination: sensors
spring.cloud.stream.bindings.sensorsOut.destination: sensors
For the sensors and data topic, that didn't matter, because the binding's name was the same as the topic; but since Spring couldn't find a proper destination for the output, it used the binding's name sensorOut and wrote the data there.
As a note, the whole configuration setup around these is very confusing. The individual items are documented, but it's hard to tell for each to which configuration prefix they belong. Looking into the source code doesn't help either, because at that level what's passed around are Maps with the key stripped of the prefix at runtime, so it's really hard to tell where the data is coming from and what it will contain.
IMO it would really help to have acual #ConfigurationProperties-like data classes passed around, which would make it so much easier to understand.

kafka streams DSL: add an option parameter to disable repartition when using `map` `selectByKey` `groupBy`

According to the documents, streams will be marked for repartition when applied map selectKey groupBy even though the new key has been partitioned appropriately. Is it possible to add an option parameter to disable repartition ?
Here is my user case:
there is a topic has been partitioned by user_id.
# topic 'user', format '%key,%value'
partition-1:
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device2'}
partition-2:
user2,{'user_id':'user2', 'device_id':'device3'}
user2,{'user_id':'user2', 'device_id':'device4'}
I want to count user_id-device_id pairs using DSL as follow:
stream
.groupBy((user_id, value) -> {
JSONObject event = new JSONObject(value);
String userId = event.getString('user_id');
String deviceId = event.getString('device_id');
return String.format("%s&%s", userId,deviceId);
})
.count();
Actually the new key has been partitioned indirectly. There is no need to do it again.
If you use .groupBy(), it always causes data re-partitioning. If possible use groupByKey instead, which will re-partition data only if required.
In your case, you are changing the keys anyways, so that will create a re-partition topic.

spring statemachine set multiple initial state

i have a serial state of order like
public enum orderStateEnum {
STATE_UNUSED("UNUSED"),
STATE_ORDERED("ORDERED"),
STATE_ASSIGNED("ASSIGNED"),
STATE_ASSIGN_EXCEPTION("ASSIGN_EXCEPTION"),
STATE_PACKED("PACKED"),
//and so on
}
  and i want to use spring.statemachine(or other state machine implementation) to manage the transition like from STATE_UNUSED to STATE_ORDERED STATE_ORDERED to STATE_ASSIGNED STATE_ORDERED to STATE_ASSIGN_EXCEPTION STATE_ASSIGNED to STATE_PACKED   however all the order data is stored in database,so in my case, if i have an order with STATE_ASSIGNED state, i fetch the order state from the database,but in spring.statemachine, i have to ``` StateMachine stateMachine = new StateMachine(); stateMachine.createEvent(Event_take_order);
  when i new a instance of stateMachine, it's inital state is STATE_UNUSED,however i want the inital state to be the state i fetch from the database which is STATE_ASSIGNED,how can i achieve that? i've read [https://docs.spring.io/spring-statemachine/docs/1.0.0.BUILD-SNAPSHOT/reference/htmlsingle/] but i can't find any solution in it.
When you create a new StateMachine you can get StateMachineAccessor using stateMachine.getStateMachineAccessor()
StateMachineAccessor is:-
Functional interface for StateMachine to allow more programmaticaccess to underlying functionality. Functions prefixed "doWith" will expose StateMachineAccess via StateMachineFunction for better functionalaccess with jdk7. Functions prefixed "with" is better suitable for lambdas.(From Java Docs)
StateMachineAccessor has a method called doWithAllRegions where you can provide implementation of StateMachineFunction (interface) and doWithAllRegions will execute given StateMachineFunction with all recursive regions.
So, to achieve what you are trying to do the code will look like this:-
StateMachine<orderStateEnum, Events> stateMachine = smFactory.getStateMachine();
stateMachine.getStateMachineAccessor().doWithAllRegions(access -> access
.resetStateMachine(new DefaultStateMachineContext<>(STATE_ASSIGNED, null, null, null)));
I have provided the implementation of the interfaces using lambdas.

Tombstone messages not removing record from KTable state store?

I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable.
sample -
public KStream<String, GenericRecord> processRecord(#Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream,
KTable<String, GenericRecord> table = testStream
.map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2))
.groupByKey()
reduce((genericRecord, v1) -> v1, Materialized.as("test-store"));
GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema));
genericRecord.put("field1", Long.parseLong(test.getField1()));
ProducerRecord record = new ProducerRecord(Channel.TEST, genericRecord, null);
kafkaTemplate.send(record);
Upon triggering a message with null value, I can debug in testStream map function with null payload, but it doesn't remove record on KTable change log "test-store". Looks like it doesn't even reach reduce method, not sure what I am missing here.
Appreciate any help on this!
Thanks.
As documented in the JavaDocs of reduce()
Records with {#code null} key or value are ignored.
Because, the <key,null> record is dropped and thus (genericRecord, v1) -> v1 is never executed, no tombstone is written to the store or changelog topic.
For the use case you have in mind, you need to use a surrogate value that indicates "delete", for example a boolean flag within your Avro record. Your reduce function needs to check for the flag and return null if the flag is set; otherwise, it must process the record regularly.
Update:
Apache Kafka 2.6 adds the KStream#toTable() operator (via KIP-523) that allows to transform a KStream into a KTable.
An addition to the above answer by Matthias:
Reduce ignores the first record on the stream, so the mapped and grouped value will be stored as-is in the KTable, never passing through the reduce method for tombstoning. This means that it will not be possible to just join another stream on that table, the value itself also needs to be evaluated.
I hope KIP-523 solves this.

Kafka Streams not triggering output for joined streams?

I have raw streams from 3 mysql tables, 1 primary and two child table. I tried to join three raw streams and transformed into single output stream. It works if there is any update on parent stream but not triggering output if anything changes on child stream.
#StreamListener
public Stream<Long, Output> handleStreams(#Input KStream<Long, Parent> parentStream,
#Input KStream<Long, Child1> child1Stream,
#Input KStream<Long, Child2> child2Stream) {
KTable<Long, Parent> parentTable = convertParent(parentStream);
KTable<Long, ArrayList<Child1>> child1Table = convertChild1(parentStream);
KTable<Long, ArrayList<Child2>> child2Table = convertChild2(parentStream);
parentTable
.leftJoin(child1Table, (parent, child1List) -> new Output(k, v))
.leftJoin(child2Table, (output, child2List) -> output.setChild2List(child2List))
.toStream()
}
Any new add or update on parent stream is picked up the processor and joins it with other KTable and return it on output stream. But any add or update on child1stream or child2stream doesn't trigger an output stream.
I thought making all input streams as KTable, they will always store changes as all of them have same key and any update on parent or child tables will be picked up the joins. But it is not happening, can anyone suggest what I am missing in this ?
I already tried KStream-KStream, Stream-KTable, KTable-KTable joins, none of them worked in case of child updates.
-Thanks
Can you show where you have the EnableBinding and the processor interface that you are binding to?
This doesn't look right to me:
#StreamListener
public Stream<Long, Output> handleStreams(#Input KStream<Long, Parent> parentStream,
#Input KStream<Long, Child1> child1Stream,
#Input KStream<Long, Child2> child2Stream) {
You are not specifying a binding on the inputs. You need to have something like this when you have multiple inputs:
#StreamListener
public Stream<Long, Output> handleStreams(#Input("input1") KStream<Long, Parent> parentStream,
#Input("input2") KStream<Long, Child1> child1Stream,
#Input("input3") KStream<Long, Child2> child2Stream) {
Each of those inputs needs to be defined in the processor interface. See here for an example: https://github.com/spring-cloud/spring-cloud-stream-samples/blob/master/kafka-streams-samples/kafka-streams-table-join/src/main/java/kafka/streams/table/join/KafkaStreamsTableJoin.java#L46
Notice how your child tables are created from the same stream as the parentTable:
KTable<Long, ArrayList<Child1>> child1Table = convertChild1(parentStream);
KTable<Long, ArrayList<Child2>> child2Table = convertChild2(parentStream);
Not sure what convertChild1 and convertChild2 methods do, but shouldn't they be given child1Stream and child2Stream as argument, respectively?

Resources