Spring #StreamListener process(KStream<?,?> stream) Partition - apache-kafka-streams

I have a topic with multiple partitions in my stream processor i just wanted to stream that from one partition, and could nto figure out how to configure this
spring.cloud.stream.kafka.streams.bindings.input.consumer.application-id=s-processor
spring.cloud.stream.bindings.input.destination=uinput
spring.cloud.stream.bindings.input.group=r-processor
spring.cloud.stream.bindings.input.contentType=application/java-serialized-object
spring.cloud.stream.bindings.input.consumer.header-mode=raw
spring.cloud.stream.bindings.input.consumer.use-native-decoding=true
spring.cloud.stream.bindings.input.consumer.partitioned=true
#StreamListener(target = "input")
// #SendTo(value = { "uoutput" })
public void process(KStream<UUID, AModel> ustream) {
I want only one partition data to be processed by this processor, there will be other processors for other partition(s)
So far my finding is something to do with https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/StreamsConfig.html#PARTITION_GROUPER_CLASS_CONFIG, but couldnot find how to set this property in spring application.properties

I think the partition grouper is to group partition with tasks within a single processor. If you want to ensure that only a single partition is processed by a processor, then you need to provide at least the same number of processor instances as the topic partitions. For e.g. if your topic has 4 partitions, then you need to have 4 instances of the stream application to ensure that each instance is only processing a single partition.

Kafka Streams does not allow to read a single partition. If you subscribe to a topic, all partitions are consumed and distributed over the available instances. Thus, you can't know in advance, which partition is assigned to what instance, and all instances execute the same code.
But each partition linked to processor has different kind of data hence require different processor application
For this case, the processor (or transformer) must be able to process data for all partitions. Kafka Streams exposes the partitions number via the ProcessorContext object that is handed to a processor via init() method: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/Transformer.html#init-org.apache.kafka.streams.processor.ProcessorContext-
Thus, you need to "branch" with within your transformer to apply different processing logic based on the partition:
ustream.transform(() -> new MyTransformer());
class MyTransformer implement Transformer {
// other methods omitted
R transform(K key, V value) {
switch(context.partition()) { // get context from `init()`
case 0:
// your processing logic
break;
case 1:
// your processing logic
break;
// ...
}
}

Related

Utilize a single processor to process data from multiple sources of different Key and Value "Serdes"

Is it possible to utilize a single processor to process data from multiple sources of different Key and Value "Serdes"?
Below is my topology
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "MarketData", "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");
Below is the process method from the processor.
public void process(Record<String, MarketData> record) {
MarketData marketData = record.value();
}
Is it possible to have a generic record in the process method that can be processed differently depending on the type of record?
In the event that the above solution is not feasible, is it possible to have multiple sources and processors without having intermittent topics as a result? Example:
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addProcessor("StrategyTwoMarketData", new StrategyTwoMarketDataProcessorSupplier(), "MarketData")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwoEventData", new StrategyTwoEventDataProcessorSupplier(), "EventData")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");

sending input from single spout to multiple bolts with Fields grouping in Apache Storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?
It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.
An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

Kafka: Efficiently join windowed aggregates to events

I'm prototyping a fraud application. We'll frequently have metrics like "total amount of cash transactions in the last 5 days" that we need to compare against some threshold to determine if we raise an alert.
We're looking to use Kafka Streams to create and maintain the aggregates and then create an enhanced version of the incoming transaction that has the original transaction fields plus the aggregates. This enhanced record gets processed by a downstream rules system.
I'm wondering the best way to approach this. I've prototyped creating the aggregates with code like this:
TimeWindows twoDayHopping TimeWindows.of(TimeUnit.DAYS.toMillis(2))
.advanceBy(TimeUnit.DAYS.toMillis(1));
KStream<String, AdditiveStatistics> aggrStream = transactions
.filter((key,value)->{
return value.getAccountTypeDesc().equals("P") &&
value.getPrimaryMediumDesc().equals("CASH");
})
.groupByKey()
.aggregate(AdditiveStatistics::new,
(key,value,accumulator)-> {
return AdditiveStatsUtil
.advance(value.getCurrencyAmount(),accumulator),
twoDayHopping,
metricsSerde,
"sas10005_store")
}
.toStream()
.map((key,value)-> {
value.setTransDate(key.window().start());
return new KeyValue<String, AdditiveStatistics>(key.key(),value);
})
.through(Serdes.String(),metricsSerde,datedAggrTopic);;
This creates a store-backed stream that has a records per key per window. I then join the original transactions stream to this window to produce the final output to a topic:
JoinWindows joinWindow = JoinWindows.of(TimeUnit.DAYS.toMillis(1))
.before(TimeUnit.DAYS.toMillis(1))
.after(-1)
.until(TimeUnit.DAYS.toMillis(2)+1);
KStream<String,Transactions10KEnhanced> enhancedTrans = transactions.join(aggrStream,
(left,right)->{
Transactions10KEnhanced out = new Transactions10KEnhanced();
out.setAccountNumber(left.getAccountNumber());
out.setAccountTypeDesc(left.getAccountTypeDesc());
out.setPartyNumber(left.getPartyNumber());
out.setPrimaryMediumDesc(left.getPrimaryMediumDesc());
out.setSecondaryMediumDesc(left.getSecondaryMediumDesc());
out.setTransactionKey(left.getTransactionKey());
out.setCurrencyAmount(left.getCurrencyAmount());
out.setTransDate(left.getTransDate());
if(right != null) {
out.setSum2d(right.getSum());
}
return out;
},
joinWindow);
This produces the correct results, but it seems to run for quite a while, even with a low number of records. I'm wondering if there's a more efficient way to achieve the same result.
It's a config issues: cf http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Disable caching by setting cache size to zero (parameter cache.max.bytes.buffering in StreamsConfig) will resolve the "delayed" delivery to the output topic.
You might also read this blog post for some background information about Streams design: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

Support method assignTimestampsAndWatermarks in KeyedStream of Flink DataStream

we are using flink to process data from kafka topic, the data in kafka topic comes from massive IoT devices, we hash those devices into different partition based on hash code of device ID, so each partition has event data from many devices, and those event data contains "timestamp", "value", "deviceID" and so on. we want to use flink to do streaming process of those data, and our rule is to calculate the average value in 20 seconds for each device. but this is one problem, since all device share same timestamp and watermark generator, and each device has clock skew, the skew maybe very large, so this breaks rule of strictly ascending event time, if we can support assignTimestampsAndWatermarks in KeyedStream, then all those problem could be resolved.
Sample code:
DataStream<Order> timestampStream = orderStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
KeyedStream<Order, String> keyStream = timestampStream.keyBy(new KeySelector<Order,String>() {
public String getKey(Order value) throws Exception {
return value.getId();
}
});
keyStream.window(TumblingEventTimeWindows.of(Time.seconds(20)));

Resources