Spring Batch: One Reader, two processors and two kafkawriters - spring

My scenario is
+---> ItemProcessor#1 ---> KafkaItemWriter#1
|
ItemReader ---> item ---+
|
+---> ItemProcessor#2 ---> KafkaItemWriter#2
ItemReader is reading data from db and gave java object 'Product'. Now Product object is sent to ItemProcessor1 to get back AggregatedPrice. And same 'Product' object also sent to ItemProcessor2 to get back AggregatedProductDetais. Both returned processed objects needs to be to sent to KafkaWriter to push to two different topics. I am thinking to do in one job execution because otherwise same data is going to be read from database by reader twice. Please provide hint and best approach how i can proceed here. I believe my scenario is not of CompositeItemProcessor as result of one processor is not required to be passed to processor2. Both are independent processors and return totally different object but both processors take same input to process.

Related

Spring Batch Remote Chunking Chunk Response

I have implemented Spring Batch Remote Chunking with Kafka. I have implemented both Manager and worker configuration. I want to send some DTO or object in chunkresponse from worker side to Manager and do some processing once I receive the response. Is there any way to achieve this. I want to know the count of records processed after each chunk is processed from worker side and I have to update the database frequently with count.
I want to send some DTO or object in chunkresponse from worker side to Manager and do some processing once I receive the response. Is there any way to achieve this.
I'm not sure the remote chunking feature was designed to send items from the manager to workers and back again. The ChunkResponse is what the manager is expecting from workers and I see no way you can send processed items in it (except probably serializing the item in the ChunkResponse#message field, or storing it in the execution context, which both are not good ideas..).
I want to know the count of records processed after each chunk is processed from worker side and I have to update the database frequently with count.
The StepContribution is what you are looking for here. It holds all the counts (read count, write count, etc). You can get the step contribution from the ChunkResponse on the manager side and do what is required with the result.

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Access ProcessContext::forward from multiple user threads

Given: DSL topology with KStream::transform. As part of Transformer::transform execution multiple messages are generated from the input one (it could be thousands of output messages from the single input message).
New messages are generated based on the data retrieved from the database. To speed up the process I would like to create multiple user threads to access data in DB in parallel. Upon generating a new message the thread will call ProcessContext::forward to send the message downstream.
Is it safe to call ProcessContext::forward from the different threads?
It is not safe and not allowed to call ProcessorContext#forward() from a different thread. If you try it, an exception will be thrown.
As a workaround, you could let all threads "buffer" their result data, and collect all data in the next call to process(). As an alternative, you could also schedule a punctuation that collects and forwards the data from the different threads.

Clustering the Batch Job & distributing the data load

I have Batch Processing project, wanted to cluster on 5 machines.
Suppose I have input source is database having 1000 records.
I want to split these records equally i.e. 200 records/instance of batch job.
How could we distribute the work load ?
Given below, is the workflow that you may want to follow.
Assumptions:
You have the necessary Domain Objects respective to the DB table.
You have a batch flow configured wherein, there is a
reader/writer/tasklet mechanism.
You have a Messaging System (Messaging Queues are a great way to
make distributed applications talk to each other)
Input object is an object to the queue that contains the set of
input records split as per the required size.
Result object is an object to the queue that contains the processed
records or result value(if scalar)
The chunkSize is configured in a property file. Here 200
Design:
In the application,
Configure a queueReader to read from a queue
Configure a queueWriter to write to a queue
If using the task/tasklet mechanism, configure different queues to carry the input/result objects.
Configure a DB reader which reads from a DB
Logic in the DBReader
Read records from DB one by one and count of records maintained. if
(count%chunkSize==0) then write all the records to the inputMessage
object and write the object to the queue.
Logic in queueReader
Read the messages one by one
For each present message do the necessary processing.
Create a resultObject
Logic in the queueWriter
Read the resultObject (usually batch frameworks provide a way to
ensure that writers are able to read the output from readers)
If any applicable processing or downstream interaction is needed,
add it here.
Write the result object to the outputQueue.
Deployment
Package once, deploy multiple instances. For better performance, ensure that the chunkSize is small to enable fast processing. The queues are managed by the messaging system (The available systems in the market provide ways to monitor the queues) where you will be able to see the message flow.

Chunk processing in Spring Batch effectively limits you to one step?

I am writing a Spring Batch application to do the following: There is an input table (PostgreSQL DB) to which someone continually adds rows - that is basically work items being added. For each of these rows, I need to fetch more data from another DB, do some processing, and then do an output transaction which can be multiple SQL queries touching multiple tables (this needs to be one transaction for consistency reasons).
Now, the part between the input and output should be a modular - it already has 3-4 logically separated things, and in future there would be more. This flow need not be linear - what processing is done next can be dependent on the result of previous. In short, this is basically like the flow you can setup using steps inside a job.
My main problem is this: Normally a single chunk processing step has both ItemReader and ItemWriter, i.e., input to output in a single step. So, should I include all the processing steps as part of a single ItemProcessor? How would I make a single ItemProcessor a stateful workflow in itself?
The other option is to make each step a Tasklet implementation, and write two tasklets myself to behave as ItemReader and ItemWriter.
Any suggestions?
Found an answer - yes you are effectively limited to a single step. But:
1) For linear workflows, you can "chain" itemprocessors - that is create a composite itemprocessor to which you can provide all the itemprocessors which do actual work through applicationContext.xml. Composite itemprocessor just runs them one by one. This is what I'm doing right now.
2) You can always create the internal subflow as a seperate spring batch workflow and call it through code in an itemprocessor similar to composite itemprocessor above. I might move to this in the future.

Resources