Spring batch job with multi level chunk - spring-boot

I need to write a batch jobs, which reads from DB a list of users and then for each user get data from external system which will return data in batch of 50 to 500 and then insert this data to DB.
I have wrote a steps as below
Tasklet load all users and pass to Reader (step 1) .
Reader takes all users and then pass one by one to processor (step-2) .
Processor calls external system and get full data for one user.
Writer gets this data for each user and insert to DB.
Using SimpleAsyncTaskExecutor , I am running this in paraller but , I want to have a steps as below
Tasklet load all users and pass to Reader (step-1 ,stepTasklet).
Reader reads list of all user (step-2,parallelstep).
Another reader/processor get once chunk of data(50 records) for that user.
Writer gets 50 records data for each user and insert to DB.
This step need to be in parallel for multiple users and also chunk should be at data level not like currently as user level
Can someone help.
#Bean
public Job getParallelJob() {
JobParametersIncrementer jobParametersIncrementers = new JobParametersIncrementer() {
long count = new Date().getTime();
#Override
public JobParameters getNext( JobParameters parameters) {
return new JobParameters(Collections.singletonMap("count", new JobParameter(count)));
}
};
return this.jobBuilderFactory.get("paralleljob").incrementer(jobParametersIncrementers)
.start(stepTasklet())
.next(parallelStep(taskExecutor())).build();
}

Related

Reading a csv file with huge number of records and a bulk insert using spring boot

Not looking for Spring Batch
Just to use spring boot and read a csv file with huge number of records, using BufferedReader, How can I use a bulk insert if I go with this approach.
Storing all student records in a List and partition the list to chunks and use saveAll is something I was planning to do, may not be a good approach
try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileSystemResource("/tmp/test.csv").getInputStream(), DEFAULT_CHARSET))) {
reader
.lines()
.skip(1)
.map(s -> {
return student;
}).forEach(s -> {
studentRepository.save(s);
});
}

Updating Apache Camel JPA object in database triggers deadlock

So I have a Apache Camel route that reads Data elements from a JPA endpoint, converts them to DataConverted elements and stores them into a different database via a second JPA endpoint. Both endpoints are Oracle databases.
Now I want to set a flag on the original Data element that it got copied successfully. What is the best way to achieve that?
I tried it like that: saving the ID in the context and then reading it and accessing a dao method in the .onCompletion().onCompleteOnly().
from("jpa://Data")
.onCompletion().onCompleteOnly().process(ex -> {
var id = Long.valueOf(getContext().getGlobalOption("id"));
myDao().setFlag(id);
}).end()
.process(ex -> {
Data data = ex.getIn().getBody(Data.class);
DataConverted dataConverted = convertData(data);
ex.getMessage().setBody(data);
var globalOptions = getContext().getGlobalOptions();
globalOptions.put("id", data.getId().toString());
getContext().setGlobalOptions(globalOptions);
})
.to("jpa://DataConverted").end();
However, this seems to trigger a deadlock, the dao method is stalling on the commit of the update. The only explanation could be that the Data object gets locked by Camel and is still locked in the .onCompletion().onCompleteOnly() part of the route, therefore it can't get updated there.
Is there a better way to do it?
Have you tried using the recipient list EIP where first destination is the jpa:DataConverted endpoint and the second destination will be the endpoint to set the flag. This way both get the same message and will be executed sequentially.
https://camel.apache.org/components/3.17.x/eips/recipientList-eip.html
from("jpa://Data")
.process(ex -> {
Data data = ex.getIn().getBody(Data.class);
DataConverted dataConverted = convertData(data);
ex.getIn().setBody(data);
})
.recipientList(constant("direct:DataConverted","direct:updateFlag"))
.end();
from("direct:DataConverted")
.to("jpa://DataConverted")
.end();
from("direct:updateFlag")
.process(ex -> {
var id = ((MessageConverted) ex.getIn().getBody()).getId();
myDao().setFlag(id);
})
.end();
Keep in mind, you might want to make the route transactional by adding .transacted()
https://camel.apache.org/components/3.17.x/eips/transactional-client.html

Parsing multi-format & multi line data file in spring batch job

I am writing a spring batch job to process the below mentioned data file and write it into a db.
Sample data file is of this format where I have multiple headers and
each header has a bunch of rows associated with it .
I can have million of records for each header and I can have n number
of headers in a flat file that am processing.My requirement is to
pick a few readers which am concerned .
For all the picked readers I need to pick all the data rows .Each
header and its data format is also different .I can receive either of
these data in my processor and need to write them into my DB.
HDR01
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
HDR02
A|41|57|Data1|S|62|Data2|9|N|
A|41|57|Data1|S|62|Data2|9|N|
I tried exploring the PatternMatchingCompositeLineMapper where I can
map the different header pattern I have to a tokenizer and
corresponding FieldSetMapper but I need to read the body and not the
header here .
Don't have any footer to Crete a end of line policy of my own as well .
Also tried using AggregateItemReader but don't want to club all the
records of a header before I process them .
Each rows corresponding a header should be processed parallel .
#Bean
public LineMapper myLineMapper() {
PatternMatchingCompositeLineMapper< Domain > mapper = new PatternMatchingCompositeLineMapper<>();
final Map<String, LineTokenizer> tokenizers = new HashMap<String, LineTokenizer>();
tokenizers.put("* HDR01*", new DelimitedLineTokenizer());
tokenizers.put("*HDR02*", new DelimitedLineTokenizer());
tokenizers.put("*", new DelimitedLineTokenizer("|"));
mapper.setTokenizers(tokenizers);
Map<String, FieldSetMapper<VMSFeedStyleInfo>> mappers = new HashMap<String, FieldSetMapper<VMSFeedStyleInfo>>();
try {
mappers.put("* HDR01*", customMapper());
mappers.put("*HDR02*", customMapper());
mappers.put("*", customMapper() );
} catch (Exception e) {
e.printStackTrace();
}
mapper.setFieldSetMappers(mappers);
return mapper;
}
Can somebody help me provide some inputs as to how should I achieve this .

Best strategy to Handle large data in Apache Camel

I am using Apache Camel to generate monthly reports. I have a MySQL query which when ran against my DB generates around 5 million records (20 columns each). The query itself takes approximately 70 minutes to execute.
To speed up the process, I created 5 seda (worker) routes and used multicast().parallelProcessing()
which query the DB in parallel for different time ranges, and then merged the result using an aggregator.
Now, I can see 5 million records in my exchange body (in the form of List<HashMap<String, Object>>). When I try to format this using a Camel Bindy to generate a csv file out of this data, I am getting a GC Overhead Exception. I tried increasing Java Heap Size, but it takes forever to transform.
Is there any other method, to convert this raw data into a well formatted csv file? Can Java 8 streams be useful?
Code
from("direct://logs/testLogs")
.routeId("Test_Logs_Route")
.setProperty("Report", simple("TestLogs-${date:now:yyyyMMddHHmm}"))
.bean(Logs.class, "buildLogsQuery") // bean that generates the logs query
.multicast()
.parallelProcessing()
.to("seda:worker1?waitForTaskToComplete=Always&timeout=0", // worker routes
"seda:worker2?waitForTaskToComplete=Always&timeout=0",
"seda:worker3?waitForTaskToComplete=Always&timeout=0",
"seda:worker4?waitForTaskToComplete=Always&timeout=0",
"seda:worker5?waitForTaskToComplete=Always&timeout=0");
All my worker routes look like this
from("seda:worker4?waitForTaskToComplete=Always")
.routeId("ParallelProcessingWorker4")
.log(LoggingLevel.INFO, "Parallel Processing Worker 4 Flow Started")
.setHeader("WorkerId", constant(4))
.bean(Logs.class, "testBean") // appends time-clause to the query based in WorkerID
.to("jdbc:oss-ro-ds")
.to("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0");
Aggregator
from("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0")
.routeId("Aggregator_ParallelProcessing")
.log(LoggingLevel.INFO, "Aggregation triggered for processor ${header.WorkerId}")
.aggregate(header("Report"), new ParallelProcessingAggregationStrategy())
.completionSize(5)
.to("direct://logs/processResultSet")
from("direct://logs/processResultSet")
.routeId("Process_Result_Set")
.bean(Test.class, "buildLogReport");
.marshal(myLogBindy)
.to("direct://deliver/ooma");
Method buildLogReport
public void buildLogReport(List<HashMap<String, Object>> resultEntries, Exchange exchange) throws Exception {
Map<String, Object> headerMap = exchange.getIn().getHeaders();
ArrayList<MyLogEntry> reportList = new ArrayList<>();
while(resultEntries != null){
HashMap<String, Object> resultEntry = resultEntries.get(0);
MyLogEntry logEntry = new MyLogEntry();
logEntry.setA((String) resultEntry.get("A"));
logEntry.setB((String) resultEntry.get("B"));
logEntry.setC(((BigDecimal) resultEntry.get("C")).toString());
if (null != resultEntry.get("D"))
logEntry.setD(((BigInteger) resultEntry.get("D")).toString());
logEntry.setE((String) resultEntry.get("E"));
logEntry.setF((String) resultEntry.get("F"));
logEntry.setG(((BigDecimal) resultEntry.get("G")).toString());
logEntry.setH((String) resultEntry.get("H"));
logEntry.setI(((Long) resultEntry.get("I")).toString());
logEntry.setJ((String) resultEntry.get("J"));
logEntry.setK(TimeUtils.convertDBToTZ((Date) resultEntry.get("K"), (String) headerMap.get("TZ")));
logEntry.setL(((BigDecimal) resultEntry.get("L")).toString());
logEntry.setM((String) resultEntry.get("M"));
logEntry.setN((String) resultEntry.get("State"));
logEntry.setO((String) resultEntry.get("Zip"));
logEntry.setP("\"" + (String) resultEntry.get("Type") + "\"");
logEntry.setQ((String) resultEntry.get("Gate"));
reportList.add(logEntry);
resultEntries.remove(resultEntry);
}
// Transform The Exchange Message
exchange.getIn().setBody(reportList);
}

Ehcache Refresh cache not periodically but conditional

I am using ehcache with spring architecture.
Right now, I am refreshing the cache at FIXED interval of every 15 minute from the database.
#Cacheable(cacheName = "fpodcache",refreshInterval=60000, decoratedCacheType= DecoratedCacheType.REFRESHING_SELF_POPULATING_CACHE)
public List<Account> getAccount(String key) {
//Running a database query to fetch the data.
}
Instead of time based cache refresh, I want CONDITION BASED cache refresh. There are 2 reasons behind it -
1. database doesn't update very frequently(15 times a day but NOT at fixed interval) and 2. data fetched and cached is huge.
So, I decided to maintain two variables - a version in database (version_db) and one version in cache (version_cache). I wish to put a condition that if(version_db > version_cache) then only refresh cache, otherwise dont refresh. Something like -
#Cacheable(cacheName = "fpodcache", conditionforrefresh = version_db>version_cache, decoratedCacheType= DecoratedCacheType.REFRESHING_SELF_POPULATING_CACHE)
public List<Account> getAccount(String key) {
//Running a database query to fetch the data.
}
What is the right syntax for conditionforrefresh = version_db>version_cache in the above code ?
How do I achieve this?
You can have a check in the beginning of your refresh logic to do the check you need.
If true then load from DB otherwise use the existing loaded data.
#Cacheable(cacheName = "fpodcache",refreshInterval=60000, decoratedCacheType= DecoratedCacheType.REFRESHING_SELF_POPULATING_CACHE)
public List<Account> getAccount(String key) {
// Fetch version_db
// Fetch version_cache
// Check if version_db>version_cache
// Is true --> Run a database query to fetch the data.
// Else --> Return existing data
}

Resources