Kafka Streams not triggering output for joined streams? - apache-kafka-streams

I have raw streams from 3 mysql tables, 1 primary and two child table. I tried to join three raw streams and transformed into single output stream. It works if there is any update on parent stream but not triggering output if anything changes on child stream.
#StreamListener
public Stream<Long, Output> handleStreams(#Input KStream<Long, Parent> parentStream,
#Input KStream<Long, Child1> child1Stream,
#Input KStream<Long, Child2> child2Stream) {
KTable<Long, Parent> parentTable = convertParent(parentStream);
KTable<Long, ArrayList<Child1>> child1Table = convertChild1(parentStream);
KTable<Long, ArrayList<Child2>> child2Table = convertChild2(parentStream);
parentTable
.leftJoin(child1Table, (parent, child1List) -> new Output(k, v))
.leftJoin(child2Table, (output, child2List) -> output.setChild2List(child2List))
.toStream()
}
Any new add or update on parent stream is picked up the processor and joins it with other KTable and return it on output stream. But any add or update on child1stream or child2stream doesn't trigger an output stream.
I thought making all input streams as KTable, they will always store changes as all of them have same key and any update on parent or child tables will be picked up the joins. But it is not happening, can anyone suggest what I am missing in this ?
I already tried KStream-KStream, Stream-KTable, KTable-KTable joins, none of them worked in case of child updates.
-Thanks

Can you show where you have the EnableBinding and the processor interface that you are binding to?
This doesn't look right to me:
#StreamListener
public Stream<Long, Output> handleStreams(#Input KStream<Long, Parent> parentStream,
#Input KStream<Long, Child1> child1Stream,
#Input KStream<Long, Child2> child2Stream) {
You are not specifying a binding on the inputs. You need to have something like this when you have multiple inputs:
#StreamListener
public Stream<Long, Output> handleStreams(#Input("input1") KStream<Long, Parent> parentStream,
#Input("input2") KStream<Long, Child1> child1Stream,
#Input("input3") KStream<Long, Child2> child2Stream) {
Each of those inputs needs to be defined in the processor interface. See here for an example: https://github.com/spring-cloud/spring-cloud-stream-samples/blob/master/kafka-streams-samples/kafka-streams-table-join/src/main/java/kafka/streams/table/join/KafkaStreamsTableJoin.java#L46

Notice how your child tables are created from the same stream as the parentTable:
KTable<Long, ArrayList<Child1>> child1Table = convertChild1(parentStream);
KTable<Long, ArrayList<Child2>> child2Table = convertChild2(parentStream);
Not sure what convertChild1 and convertChild2 methods do, but shouldn't they be given child1Stream and child2Stream as argument, respectively?

Related

KStreams Grouping by multiple fields to get count

So I have a bunch of records in a topic like the one below. I can create the GroupBy in KSQLDB with no problem as it is more SQL than anything else. But I have been tasked to move it over to Java KStreams and am failing miserably.
Can someone guide me on the Topology for first grouping by user_id then Object_id then by day? I don't ask this lightly as I have tried over and over with state stores with so many examples but I am just chasing my tail. Basically, I would like to know how many times a user looked at a specific object on a given day.
Anything on how to accomplish this would be greatly appreciated.
{
"entrytimestamp": "2020-05-04T15:21:01.897",
"user_id": "080db36a-f205-4e32-a324-cc375b75d167",
"object_id": "fdb084f7-5367-4776-a5ae-a10d6e898d22"
}
You can create composed key, and then group by key, like:
KStream<String, Message> stream = builder.stream(MESSAGES, Consumed.with(Serdes.String(), jsonSerde));
KStream<String, Message> newKeyStream = stream.selectKey((key, message) ->
String.format("%s-%s-%s",
message.userId(),
message.objectId(),
LocalDate.ofInstant(Instant.ofEpochMilli(message.timestamp()), ZoneId.systemDefault())));
KGroupedStream<String, Message> groupedBy = newKeyStream.groupByKey();

Spring Kafka Stream doesn't get written

I'm writing a Spring Boot (2.1.4) app trying to use Spring Cloud Streams for Kafka.
What I'm trying to do is maintain a list of sensors on one topic ("sensors"). OTOH, I have incoming data on the other topic ("data"). What I'm trying to achieve is that when I get data for a sensor I don't already have, I want to add it to the sensor list.
To do that, I create a KTable<String, Sensor> from the sensors topic, map the temperature topic to the pure sensor's data (in this case, its name) and do an outer join with a ValueJoiner that retains the sensor if present, otherwise use the reading's sensor. Then, I write the result back to the sensors topic.
KTable<String, Sensor> sensorTable = ...;
KStream<String, SensorData> sensorDataStream = ...;
// get sensors providing measurements
KTable<String, Sensor> sensorsFromData =
sensorDataStream.groupByKey()
.aggregate(
Sensor::new,
(k, v, s) -> {
s.setName(k);
return s;
},
Materialized.with(Serdes.String(), SensorSerde.SERDE));
// join both sensor tables, preferring the existing ones
KTable<String, Sensor> joinedSensorTable =
sensorTable.outerJoin(
sensorsFromData,
// only use sensors from measurements if sensor not already present
(ex, ft) -> (ex != null) ? ex : ft,
Materialized.<String, Sensor, KeyValueStore<Bytes, byte[]>>as(SENSORS_TABLE)
.withKeySerde(Serdes.String()).withValueSerde(SensorSerde.SERDE));
// write to new topic for downstream services
joinedSensorTable.toStream();
This works fine if I create this using a StreamBuilder - i.e. if the sensorTable and sensorDataStream are coming from something like builder.table("sensors", Consumed.with(Serdes.String(), SensorSerde.SERDE)).
However, I'm trying to use Spring Stream Binding for this, ie the above code is wrapped in
#Configuration
#EnableBinding(SensorTableBinding.class)
class StreamConfiguration {
static final String SENSORS_TABLE = "sensors-table";
#StreamListener
#SendTo("sensorsOut")
private KStream<String, Sensor> getDataFromData
(#Input("sensors") KTable<String, Sensor> sensorTable,
#Input("data") KStream<String, SensorData> sensorDataStream) {
// ...
return joinedSensorTable.toStream();
}
}
with a
interface SensorTableBinding {
#Input("sensors")
KTable<String, Sensor> sensorStream();
#Output("sensorsOut")
KStream<String, Sensor> sensorOutput();
#Input("data")
KStream<String, SensorData> sensorDataStream();
}
Here is the spring stream section of the application.properties:
spring.cloud.stream.kafka.streams.binder.configuration.default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.streams.binder.configuration.default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.binder.brokers: ${spring.kafka.bootstrap-servers}
spring.cloud.stream.kafka.binder.configuration.auto.offset.reset: latest
spring.cloud.stream.kafka.binder.bindings.sensors.group: sensor-service
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
spring.cloud.stream.kafka.binder.data.group: sensor-service
spring.cloud.stream.kafka.binder.data.destination: data
The stream gets initialized fine, and the join is performed (the key-value-store is filled properly), however, the resulting stream is never written to the "sensors" topic.
Why? Am I missing something?
Also: I'm sure there's a better way to de/serialize my objects from/to JSON using an existing Serde, rather than having to declare classes of my own to add to the processing (SensorSerde/SensorDataSerde are thin delegation wrapper to an ObjectMapper)?
Turns out the data was written after all, but to the wrong topic, namely sensorOut.
The reason was the configuration. Instead of
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
the topics are configured with this:
spring.cloud.stream.bindings.sensors.destination: sensors
spring.cloud.stream.bindings.sensorsOut.destination: sensors
For the sensors and data topic, that didn't matter, because the binding's name was the same as the topic; but since Spring couldn't find a proper destination for the output, it used the binding's name sensorOut and wrote the data there.
As a note, the whole configuration setup around these is very confusing. The individual items are documented, but it's hard to tell for each to which configuration prefix they belong. Looking into the source code doesn't help either, because at that level what's passed around are Maps with the key stripped of the prefix at runtime, so it's really hard to tell where the data is coming from and what it will contain.
IMO it would really help to have acual #ConfigurationProperties-like data classes passed around, which would make it so much easier to understand.

sending input from single spout to multiple bolts with Fields grouping in Apache Storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?
It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.
An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

How to use a KeyValueStore state store in DSL?

KeyValueStore<String, Long> kvStore=(KeyValueStore<String, Long>)
Stores.create("InterWindowStore1").withKeys(Serdes.String())
.withValues(Serdes.Long())
.persistent()
.build().get();`
I have created statestore as shown in above code and try to insert into kvStore.put(key, value); but it is throwing me NPE
Caused by: java.lang.NullPointerException
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:117)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:82)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:202)
at org.apache.kafka.streams.kstream.internals.ForwardingCacheFlushListener.apply(ForwardingCacheFlushListener.java:42)
at org.apache.kafka.streams.state.internals.CachingWindowStore.maybeForward(CachingWindowStore.java:103)
at org.apache.kafka.streams.state.internals.CachingWindowStore.access$200(CachingWindowStore.java:34)
at org.apache.kafka.streams.state.internals.CachingWindowStore$1.apply(CachingWindowStore.java:86)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:131)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:95)
As you describe in your comments you are basically doing a window aggregation:
KStream stream = ...
KTable table = stream.groupByKey().aggregate(..., TimeWindow.of(...));
As you KTable stream might contain updates for you window aggregation, you want to modify this stream. For this, you can use a stateful transformer or value-transformer:
StateStoreSupplier myState = State.create("nameOfMyState")....;
KStream result = table.toStream().transform(..., "nameOfMyState");
Finally, you can write your result to the output topic:
result.to("output-topic");
Your Transformer that you provide to transform can get the state via the given context in init() and use within transform() each time a window output is generated/updated.

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

Resources