Utilize a single processor to process data from multiple sources of different Key and Value "Serdes" - apache-kafka-streams

Is it possible to utilize a single processor to process data from multiple sources of different Key and Value "Serdes"?
Below is my topology
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "MarketData", "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");
Below is the process method from the processor.
public void process(Record<String, MarketData> record) {
MarketData marketData = record.value();
}
Is it possible to have a generic record in the process method that can be processed differently depending on the type of record?
In the event that the above solution is not feasible, is it possible to have multiple sources and processors without having intermittent topics as a result? Example:
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addProcessor("StrategyTwoMarketData", new StrategyTwoMarketDataProcessorSupplier(), "MarketData")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwoEventData", new StrategyTwoEventDataProcessorSupplier(), "EventData")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");

Related

Spring #StreamListener process(KStream<?,?> stream) Partition

I have a topic with multiple partitions in my stream processor i just wanted to stream that from one partition, and could nto figure out how to configure this
spring.cloud.stream.kafka.streams.bindings.input.consumer.application-id=s-processor
spring.cloud.stream.bindings.input.destination=uinput
spring.cloud.stream.bindings.input.group=r-processor
spring.cloud.stream.bindings.input.contentType=application/java-serialized-object
spring.cloud.stream.bindings.input.consumer.header-mode=raw
spring.cloud.stream.bindings.input.consumer.use-native-decoding=true
spring.cloud.stream.bindings.input.consumer.partitioned=true
#StreamListener(target = "input")
// #SendTo(value = { "uoutput" })
public void process(KStream<UUID, AModel> ustream) {
I want only one partition data to be processed by this processor, there will be other processors for other partition(s)
So far my finding is something to do with https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/StreamsConfig.html#PARTITION_GROUPER_CLASS_CONFIG, but couldnot find how to set this property in spring application.properties
I think the partition grouper is to group partition with tasks within a single processor. If you want to ensure that only a single partition is processed by a processor, then you need to provide at least the same number of processor instances as the topic partitions. For e.g. if your topic has 4 partitions, then you need to have 4 instances of the stream application to ensure that each instance is only processing a single partition.
Kafka Streams does not allow to read a single partition. If you subscribe to a topic, all partitions are consumed and distributed over the available instances. Thus, you can't know in advance, which partition is assigned to what instance, and all instances execute the same code.
But each partition linked to processor has different kind of data hence require different processor application
For this case, the processor (or transformer) must be able to process data for all partitions. Kafka Streams exposes the partitions number via the ProcessorContext object that is handed to a processor via init() method: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/Transformer.html#init-org.apache.kafka.streams.processor.ProcessorContext-
Thus, you need to "branch" with within your transformer to apply different processing logic based on the partition:
ustream.transform(() -> new MyTransformer());
class MyTransformer implement Transformer {
// other methods omitted
R transform(K key, V value) {
switch(context.partition()) { // get context from `init()`
case 0:
// your processing logic
break;
case 1:
// your processing logic
break;
// ...
}
}

sending input from single spout to multiple bolts with Fields grouping in Apache Storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?
It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.
An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

Spring Batch - Loop reader, processor and writer for N times

In Spring Batch, how to loop the reader,processor and writer for N times?
My requirement is:
I have "N" no of. customers/clients.
For each customer/client, I need to fetch the records from database (Reader), then I have to process (Processor) all records for the customer/client and then I have to write the records into a file (Writer).
How to loop the spring batch job for N times?
AFAIK I'm afraid there's no framework support for this scenario. Not at least the way you want to solve it.
I'd suggest to solve the problem differently:
Option 1
Read/Process/Write all records from all customers at once.You can only do this if they are all in the same DB. I would not recommend it otherwise, because you'll have to configure JTA/XA transactions and it's not worth the trouble.
Option 2
Run your job once for each client (best option in my opinion). Save necessary info of each client in different properties files (db data connections, values to filter records by client, whatever other data you may need specific to a client) and pass through a param to the job with the client it has to use. This way you can control which client is processed and when using bash files and/or cron. If you use Spring Boot + Spring Batch you can store the client configuration in profiles (application-clientX.properties) and run the process like:
$> java -Dspring.profiles.active="clientX" \
-jar "yourBatch-1.0.0-SNAPSHOT.jar" \
-next
Bonus - Option 3
If none of the abobe fits your needs or you insist in solving the problem they way you presented, then you can dynamically configure the job depending on parameters and creating one step for each client using JavaConf:
#Bean
public Job job(){
JobBuilder jb = jobBuilders.get("job");
for(Client c : clientsToProcess) {
jb.flow(buildStepByClient(c));
};
return jb.build();
}
Again, I strongly advise you not to go this way: ugly, against framework philosophy, hard to maintain, debug, you'll probably have to also use JTA/XA here, ...
I hope I've been of any help!
Local Partitioning will solve your problem.
In your partitioner, you will put all of your clients Ids in map as shown below ( just pseudo code ) ,
public class PartitionByClient implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> result = new HashMap<>();
int partitionNumber = 1;
for (String client: allClients) {
ExecutionContext value = new ExecutionContext();
value.putString("client", client);
result.put("Client [" + client+ "] : THREAD " + partitionNumber, value);
partitionNumber++;
}
}
return result;
}
}
This is just a pseudo code. You have to look to detailed documentation of partitioning.
You will have to mark your reader , processor and writer in #StepScope ( i.e. which ever part needs the value of your client ). Reader will use this client in WHERE clause of SQL. You will use #Value("#{stepExecutionContext[client]}") String client in reader etc definition to inject this value.
Now final piece , you will need a task executor and clients equal to concurrencyLimit will start in parallel provided you set this task executor in your master partitioner step configuration.
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor simpleTaskExecutor = new SimpleAsyncTaskExecutor();
simpleTaskExecutor.setConcurrencyLimit(concurrencyLimit);
return simpleTaskExecutor;
}
concurrencyLimit will be 1 if you wish only one client running at a time.

StateMap keys across different instances of the same processor

Nifi 1.2.0.
In a custom processor, an LSN is used to fetch data from a SQL Server db table.
Following are the snippets of the code used for:
Storing a key-value pair
final StateManager stateManager = context.getStateManager();
try {
StateMap stateMap = stateManager.getState(Scope.CLUSTER);
final Map<String, String> newStateMapProperties = new HashMap<>();
String lsnUsedDuringLastLoadStr = Base64.getEncoder().encodeToString(lsnUsedDuringLastLoad);
//Just a constant String used as key
newStateMapProperties.put(ProcessorConstants.LAST_MAX_LSN, lsnUsedDuringLastLoadStr);
if (stateMap.getVersion() == -1) {
stateManager.setState(newStateMapProperties, Scope.CLUSTER);
} else {
stateManager.replace(stateMap, newStateMapProperties, Scope.CLUSTER);
}
}
Retrieving the key-value pair
final StateManager stateManager = context.getStateManager();
final StateMap stateMap;
final Map<String, String> stateMapProperties;
byte[] lastMaxLSN = null;
try {
stateMap = stateManager.getState(Scope.CLUSTER);
stateMapProperties = new HashMap<>(stateMap.toMap());
lastMaxLSN = (stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN) == null
|| stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).isEmpty()) ? null
: Base64.getDecoder()
.decode(stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).getBytes());
}
When a single instance of this processor is running, the LSN is stored and retrieved properly and the logic of fetching data from SQL Server tables works fine.
As per the NiFi doc. about state management :
Storing and Retrieving State State is stored using the StateManager’s
getState, setState, replace, and clear methods. All of these methods
require that a Scope be provided. It should be noted that the state
that is stored with the Local scope is entirely different than state
stored with a Cluster scope. If a Processor stores a value with the
key of My Key using the Scope.CLUSTER scope, and then attempts to
retrieve the value using the Scope.LOCAL scope, the value retrieved
will be null (unless a value was also stored with the same key using
the Scope.CLUSTER scope). Each Processor’s state, is stored in
isolation from other Processors' state.
When two instances of this processor are running, only one is able to fetch the data. This has led to the following question:
Is the StateMap a 'global map' which must have unique keys across the instances of the same processor and also the instances of different processors? In simple words, whenever a processor puts a key in the statemap, the key should be unique across the NiFi processors(and other services, if any, that use the State API) ? If yes, can anyone suggest what unique key should I use in my case?
Note: I quickly glanced at the standard MySQL CDC processor code class(CaptureChangeMySQL.java) and it has a similar logic to store and retrieve the state but then am I overlooking something ?
The StateMap for a processor is stored underneath the id of the component, so if you have two instances of the same type of processor (meaning you can see two processors on the canvas) you would have something like:
/components/1111-1111-1111-1111 -> serialized state map
/components/2222-2222-2222-2222 -> serialized state map
Assuming 1111-1111-1111-1111 was the UUID of processor 1 and 2222-2222-22222-2222 was the UUID of processor 2. So the keys in the StateMap don't have to be unique across all instances because they are scoped per component id.
In a cluster, the component id of each component is the same on all nodes. So if you have a 3 node cluster and processor 1 has id 1111-1111-1111-1111, then there is a processor with that id on each node.
If that processor is scheduled to run on all nodes and stores cluster state, then all three instances of the processor are going to be updating the same StateMap in the clustered state provider (ZooKeeper).

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

Resources