I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.
I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.
It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.
As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.
Any suggestions as to how to improve performance here?
Here is the writer code:-
public void write(List<? extends Message> items) {
items.forEach(this::process);
}
private void process(Message message) {
if (message == null)
return;
try {
//message is a DTO that have info about success or failure.
if (success) {
//post kafka message using spring cloud stream
//insert record in DB using spring jpaRepository
} else {
//insert record in DB using spring jpaRepository
}
} catch (Exception e) {
//throw exception
}
}
Best regards,
Preeti
Please refer to below SO thread and refer the git hub source code for parallel processing
Spring Batch multiple process for heavy load with multiple thread under every process
Spring batch to process huge data
Related
I'm using SpringBatch for my app. In one of the batch jobs, I need to process multiple data. Each data requires several database updates. And I need to make one transaction for one data. Meaning, if when processing one data an exception is thrown, database updates are rolled back for that data, then keep processing the next data.
I've put all database updates in one method in service layer. In my springbatch tasklet, I call that method for each data, like this;
for (RequestViewForBatch request : requestList) {
orderService.processEachRequest(request);
}
In the service class the method is like this;
Transactional(propagation = Propagation.NESTED, timeout = 100, rollbackFor = Exception.class)
public void processEachRequest(RequestViewForBatch request) {
//update database
}
When executing the task, it gives me this error message
org.springframework.transaction.NestedTransactionNotSupportedException: Transaction manager does not allow nested transactions by default - specify 'nestedTransactionAllowed' property with value 'true'
but i don't know how to solve this error.
Any suggestion would be appreciated. Thanks in advance.
The tasklet step will be executed in a transaction driven by Spring Batch. You need to remove the #Transactional on your processEachRequest method.
You would need a fault-tolerant chunk-oriented step configured with a skip policy. In this case, only faulty items will be skipped. Please refer to the Configuring Skip Logic section of the documentation. You can find an example here.
I am working on a Spring Webflux project,
I want to do something like, When client make API call, I want to send success message to client and perform large file operation in background.
So client does not have to wait till my entire file is process.
For try out I made sample code as below
REST controller
#GetMapping(value = "/{jobId}/process")
#ApiOperation("Start import job")
public Mono<Integer> process(#PathVariable("jobId") long jobId) {
return service.process(jobId);
}
File processing Service
public Mono<Integer> process(Integer jobId) {
return repository
.findById(jobId)
.map(
job -> {
File file = new File("read.csv");
return processFile(file);
});
}
Following is my stack
Spring Webflux 2.2.2.RELEASE
I try to make this call using WebClient, but till entire file is not processed I am not getting response.
As one of the options, you can run processing in a different thread.
For example:
Create an Event Listener Link
Enable #Async and #EnableAsync Link
Or use deferent types of Executors from Java concurrency package
Or manually run the thread
Also for Kotlin you can use Coroutines
You can use the subscribe method and start a job with its own scope in background.
Mono.delay(Duration.ofSeconds(10)).subscribeOn(Schedulers.newElastic("myBackgroundTask")).subscribe(System.out::println);
As long as you do not tie this to your response publisher using one of the zip/merge or similar operators your job will be run on background on its own scheduler pool.
subscribe() method returns a Disposable instance which can later be used cancel the background job by calling dispose() method.
I have a scenario where we need to keep on polling a database table for all active users and perform an api call to fetch any unread emails from their inbox. My approach is to use two verticles, one for polling and another for fetching emails for an user. The first verticle when found an user, sends a message(userId) to the second verticle through an event bus to fetch emails. That way, I can increase the number of second verticle instances required when there are lots of users.
Following two ways I found I can use to poll the database for active users and then perform an api call for each user.
vertx.setPeriodic
vertx.executeBlocking
But in the manual, its mentioned that for long running/polling tasks, its better to create an application managed thread to handle the task.
Is my approach for the problem correct, or is there a better approach to solve the problem at hand?
If I go through an application managed thread, can you please help illustrate with an example.
Thanks.
You can create a dedicated worker thread pool for that, and run your periodic tasks on it:
public class PeriodicWorkerExample {
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
vertx.deployVerticle(new MyPeriodicWorker(), new DeploymentOptions()
.setWorker(true)
.setWorkerPoolSize(1)
.setWorkerPoolName("periodic"));
}
}
class MyPeriodicWorker extends AbstractVerticle {
#Override
public void start() {
vertx.setPeriodic(1000, (r) -> {
System.out.println(Thread.currentThread().getName());
});
}
}
I have a requirement of add sleep statement if i am unable to consume message and want to retry after 5s. To do this do i need to set any configuration properties?
rdd.foreachPartition(new VoidFunction<Iterator<ConsumerRecord<String, Object>>>() {
#Override
public void call(Iterator<ConsumerRecord<String, Object>> record)
throws Exception {
while (record.hasNext()) {
ConsumerRecord<String, Object> consumerRecord = record.next();
boolean flag=false;
while(flag){
flag= processmessage(record.value())
if(!flag)
Thread.sleep(1000)
}
}
}
});
Currently, i am unable to run my job
you can use sleep in spark streaming application.
But wait,
Spark streaming job runs micro batches and we define the stream interval time which is usually in few seconds ( could be 1s, 2s,... etc ). If you use sleep in your spark streaming code, it will take additional time to finish running each micro batch. This might impact the performance if data is coming very frequently.
It totally depends on application requirement whether sleep will make any performance issue or delay or it might just not impact if data is coming after long intervals.
Hope this helps.
I am doing some ETL that process some input CSV files and load then to Neo4j using Spring data Neo4j.
I have 2 routes one that takes the input CSV split by line and send to the second route that does the load line by line in transnational mode.
The following is the first route
#Override
void configure() throws Exception {
from(endpoint)
.id('CSV_ROUTE')
.unmarshal(buildCsvDataFormat())
.split(body())
.streaming()
.parallelProcessing()
.recipientList(header('IMPORTER_ROUTE'))
And the following is the second route
#Override
void configure() throws Exception {
from(endpoint)
.transacted()
.id(routeId)
.bean(importer)
}
How can I make transaction to commit in batches for example of 10 lines instead of every line ?
Thank you
Luis Oscar
You cannot do this. Transactions is per message in Camel.
Also mind transaction is not some magic fairy dust you can turn on and then anything you touch becomes transactional.
In Java transactions often only work with transactional resources such as JDBC and JMS.