spring batch object creation - spring

I am very new to Spring batch.I need to develop a spring batch application which reads 1,00,000 records from csv file. i have developed one spring batch application like below.
<job id="hellojob" xmlns="http://www.springframework.org/schema/batch">
<step id="orderprocessor">
<tasklet allow-start-if-complete="true">
<chunk reader="reader" writer="writer" commit-interval="500" skip-limit="2">
<skippable-exception-classes>
<include class="org.springframework.batch.item.file.FlatFileParseException" />
</skippable-exception-classes>
</chunk>
</tasklet>
</step>
</job>
Also I am having field set mapper class as
public class OrderDataMapper implements FieldSetMapper<Product> {
#Override
public Order mapFieldSet(FieldSet fieldSet) throws BindException {
Product product = new Product();
order.setCustId(fieldSet.readString(0));
order.setOrderNum(fieldSet.readString(1));
order.setCountry(fieldSet.readString(2));
return product;
}
}
As per my understanding ,the above field set mapper class is called for every record and each time it creates one new object. So for 1,00,000 records it will create 1,00,000 objects.
I feel these are large number of objects for jvm to handle and are not available for garbage collection as everything runs on single
thread.
Please let me know Is there any way where i can create less number of objects while iterating 1,00,000 records

ItemReader and ItemProcessor handle objects individually. However, ItemWriter is handled with fragmentation mode, with the definition of confirmation interval.
Therefore, if the confirmation interval is set to 500, Spring Batch will store the read and / or processed items and execute ItemWriter when it reaches the indicated number.
From that moment, the objects will be available to be eliminated by the GC.

Related

Spring Batch CompositeItemProcessor get value from other delegates

I have a compositeItemProcessor as below
<bean id="compositeItemProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor">
<property name="delegates">
<list>
<bean class="com.example.itemProcessor1"/>
<bean class="com.example.itemProcessor2"/>
<bean class="com.example.itemProcessor3"/>
<bean class="com.example.itemProcessor4"/>
</list>
</property>
</bean>
The issue i have is that within itemProcessor4 i require values from both itemProcessor1 and itemProcessor3.
I have looked at using the Step Execution Context but this does not work as this is within one step. I have also looked at using #AfterProcess within ItemProcessor1 but this does not work as it isn't called until after ItemProcessor4.
What is the correct way to share data between delegates in a compositeItemProcessor?
Is a solution of using util:map that is updated in itemProcessor1 and read in itemProcessor4 under the circumstances that the commit-interval is set to 1?
Using the step execution context won't work as it is persisted at chunk boundary, so it can't be shared between processors within the same chunk.
AfterProcess is called after the registered item processor, which is the composite processor in your case (so after ItemProcessor4). This won't work neither.
The only option left is to use some data holder object that you share between item processors.
Hope this helps.
This page seems to state that there are two types of ExecutionContexts, one at step-level, one at job-level.
https://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#passingDataToFutureSteps
You should be able to get the job context and set keys on that, from the step context
I had a similar requirement in my application too. I went with creating a data transfer object ItemProcessorDto which will be shared by all the ItemProcessors. You can store data in this DTO object in first processor and all the remaining processors will get the information out of this DTO object. In addition to that any ItemProcessor could update or retrieve the data out of the DTO.
Below is a code snippet:
#Bean
public ItemProcessor1<ItemProcessorDto> itemProcessor1() {
log.info("Generating ItemProcessor1");
return new ItemProcessor1();
}
#Bean
public ItemProcessor2<ItemProcessorDto> itemProcessor2() {
log.info("Generating ItemProcessor2");
return new ItemProcessor2();
}
#Bean
public ItemProcessor3<ItemProcessorDto> itemProcessor3() {
log.info("Generating ItemProcessor3");
return new ItemProcessor3();
}
#Bean
public ItemProcessor4<ItemProcessorDto> itemProcessor4() {
log.info("Generating ItemProcessor4");
return new ItemProcessor4();
}
#Bean
#StepScope
public CompositeItemProcessor<ItemProcessorDto> compositeItemProcessor() {
log.info("Generating CompositeItemProcessor");
CompositeItemProcessor<ItemProcessorDto> compositeItemProcessor = new CompositeItemProcessor<>();
compositeItemProcessor.setDelegates(Arrays.asList(itemProcessor1(), itemProcessor2(), itemProcessor3), itemProcessor4()));
return compositeItemProcessor;
}
#Data
public class ItemProcessorDto {
private List<String> sharedData_1;
private Map<String, String> sharedData_2;
}

How spring batch share data between job

I have one query on Spring Batch Job.
I want to share data of one job with another job in same execution context. Is it possible? If so, then how?
My requirement is caching. I have file, where some data is stored. My job runs daily and need data of that file. I don't want to read file by my job daily. instead of it, I want to store data of file in cache(Hash Map). So when same job runs next day, it will use data from cache only. Is it possible in spring batch?
Your suggestion are welcome.
You can use spring initialize bean which initializes your cache at startup.
Add initialize bean to your application context;
<bean id="yourCacheBean" class="yourpackage.YourCacheBean" init-method="initialize">
</bean>
YourCacheBean looks like;
public class YourCacheBean {
private Map<Object, Object> yourCache;
public void initialize() {
//TODO: Intialize your cache
}
}
Give the initialize bean to the itemReader or itemProcessor or itemWriter in job.xml;
<bean id="exampleProcessor" class="yourpackage.ExampleProcessor" scope="step">
<property name="cacheBean" ref="yourCacheBean" />
</bean>
ExampleProcessor looks like;
public class ExampleProcessor implements ItemProcessor<String, String> {
private YourCacheBean cacheBean;
public String process(String arg0) {
return "";
}
public void setCacheBean(YourCacheBean cacheBean) {
this.cacheBean = cacheBean;
}
}
Create a job to import file into database. Other jobs will use data from database as a cache.
Another way may be to read file into a Map<> and serialize object to a file than de-serialize when need (but I still prefer database as cache)
Spring have a cache annotation that may help that kind of case and it is really easy to implement. The first call to a method will be executed, afterwards if you call the same method with exactly the same arguments, the value will be returned by the cache.
Here you have a little tutorial: http://www.baeldung.com/spring-cache-tutorial
In your case, if your call to read the file is always with the same arguments will work as you want. Just take care of TTL.

Parametrized BeforeSaveEvent<T> is triggered for every registered listener irrespective of T

I'm trying to generate unique incremented ids for entities. For that purpose I have registered multiple ApplicationListeners for every entity type as:
public class Neo4JCustomerSaveListener implements ApplicationListener<BeforeSaveEvent<Customer>> {
public void onApplicationEvent(BeforeSaveEvent<Customer> event) {
...
}
}
definitions are like:
<!-- Listeners -->
<bean id="customerSaveListener" class="c.b.listener.Neo4JCustomerSaveListener" />
<bean id="employeeSaveListener" class="c.b.listener.Neo4JEmployeeSaveListener" />
<bean id="messageSaveListener" class="c.b.listener.Neo4JMessageSaveListener" />
As seen, BeforeSaveEvent is parametrized with Customer type. But the thing is, when a save event is about to happen for customer, other listeners are also triggered although tese listeners' BeforeSaveEvent is parametrized with different types, such as Employee or Message (while inheriting from the same Entity parent class)
Is this to be expected? And what should be my approach to this problem? First thing coming to mind is to use only one listener and differentiate inside with instanceOf ifs but this seems very ugly.
I'm using neo4j 1.9.M03, spring-data-neo4j 2.1.0.RC4 and spring 3.1.3

ItemProcessor needs to both modify the working entity but also create new referenced ones

Using spring-batch and JPA (provided by hibernate).
I have a step that does the following:
reads all clients from DB (Client entity)
enhances them with data from 3rd party. The ItemProcessor goes to the 3rd party data source, fetches some data that it stores in the Client entity itself (its fields) but also brings more data that is stored as different entities (ClientSale) and Client has a property of List which is mapped by ManyToOne.
The modified entity (Client) and the new ones (ClientSale) need to be stored in DB.
The reader part is straight forward, and for the writer I used JPAItemWriter. At the processing stage I tried to update the fields, create the new ones and add them to the client's list and return the client, hoping that the writer will write both the referenced objects and the client itself to the DB.
Instead, I got an error saying that ClientSale with id #123213213 doesn't exist in the DB.
How to I overcome this? Should I return a list of objects (different types) from my processor (the client + all ClientSale)? Can JPAItemWriter handle a list of objects? Another problem with that is that I'll have to manually update the client_id in the ClientSale entities instead of adding them to the list and having hibernate to understand the relation between them and who points where..
What's the best practice here?
Thanks!
OK.. Here is what I did at the end based on everything:
I created a MultiEntityItemWriter that can receive as item a list (and in that case it opens it and writes all elements to the delegated ItemWriter.
Code:
public class MultiEntityItemWriter implements ItemWriter<Object>{
private ItemWriter<Object> delegate;
#Override
public void write(List<? extends Object> items) throws Exception {
List<Object> toWrite = new ArrayList<>();
for (Object item : items) {
if (item instanceof Collection<?>) {
toWrite.addAll((Collection<?>)item);
} else {
toWrite.add(item);
}
}
delegate.write(toWrite);
}
public ItemWriter<Object> getDelegate() {
return delegate;
}
public void setDelegate(ItemWriter<Object> delegate) {
this.delegate = delegate;
}
}
Now, my ItemProcessor can output a list with all entities to be written and I don't need to rely on JPA to understand that there are more entities to be committed to the DB.
hope it helps...
I think you are trying to accommodate multiple steps in single step. Try finding a way and define your job as two step process instead of one.
<batch:job id="MyJob" incrementer="incrementer" job-repository="jobRepository">
<batch:step id="step1" next="step2">
<tasklet >
<chunk reader="reader1"
writer="writer1"
processor="processor1"
commit-interval="10" />
</tasklet>
</batch:step>
<batch:step id="step2">
<tasklet >
<chunk reader="reader2"
writer="writer2"
processor="processor2"
commit-interval="10" />
</tasklet>
</batch:step>
</batch:job>
If required, use appropriate caching for optimal performance.
EDIT:
In your item writer, please make sure you are using entityManager/session of first data source. Also use merge to persist your eneities in place of persist.

Spring Batch-Repeat step for each item in a data list

This is a tough one, but I am sure it is not unheard of.
I have two datasets, Countries and Demographics. The countries dataset contains the name of a country and an ID to it's Demographic data.
The demographic dataset is a hierarchal dataset starting from the country down to the suburb.
Both of these datasets are pulled from a 3rd party on a weekly basis.
I need to split the demographics out into files, one for each country.
So far the steps that i have are
1) Pull Countries
2) Pull Demographics
3) (this is needed) Loop over the country dataset calling a "Write Country Demographics to File"
Is it possible to somehow repeat a step passing the current country id?
EDIT: Added link to sample of PartitionHandler
Thanks JBristow. The below link shows the use of overriding the PartitionHandler to pass parameters using the addArgument of a JavaTask object, but it looks like a lot of heavy lifting by the developer and not very "business problem specific" which is the goal of Spring batch.
http://www.activeeon.com/blog/all/integration/distribute-a-spring-batch-job-on-the-proactive-scheduler
I also saw in your original link section 7.4.3. Binding Input Data to Steps this is in the context of 7.4.2. Partitioner, this looks very exciting
<bean id="itemReader" scope="step"
class="org.spr...MultiResourceItemReader">
<property name="resource" value="#{stepExecutionContext[fileName]}/*"/>
</bean>
I don's supose that anyone has some sample XML config of this in play?
Partitioner
Passing dynamic values to steps within the partition
Thanks in advance.
Yes, check out the partitioning feature of spring-batch! http://static.springsource.org/spring-batch/reference/html-single/index.html#partitioning
Basically, it allows you to use a "partitioner" to create new execution contexts to pass to a handler that then does something with that information.
While partitioning was made for parallelization, its default concurrency is 1, so you can start small and ratchet it up to match the hardware at your disposal. Since I assume that each country's data is not dependent on the others (at least in the download demographics step), your job could make use of basic parallelization.
/EDIT: Adding example.
Here's what I do (more or less):
First, the XML:
<beans>
<batch:job id="jobName">
<batch:step id="innerStep.master">
<batch:partition partitioner="myPartitioner" step="innerStep"/>
</batch:step>
</batch:job>
<bean id="myPartitioner" class="org.lapseda.MyPartitioner" scope="step">
<property name="jdbcTemplate" ref="jdbcTemplate"/>
<property name="runDate" value="#{jobExecutionContext['runDate']}"/>
<property name="recurrenceId" value="D"/>
</bean>
<batch:step id="summaryDetailsReportStep">
<batch:tasklet>
<batch:chunk reader="someReader" processor="someProcessor" writer="someWriter" commit-interval="10"/>
</batch:tasklet>
</batch:step>
</beans>
And now some Java:
public class MyPartitioner implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
List<String> list = getValuesToRunOver();
/* I use treemap because my partitions are ordered, hashmap should work if order isn't important */
Map<String, ExecutionContext> out = new TreeMap<String, ExecutionContext>();
for (String item : list) {
ExecutionContext context = new ExecutionContext();
context.put("key", "value"); // add your own stuff!
out.put("innerStep"+item, context);
}
return out;
}
}
Then you just read from the context like you would from a normal step or job context inside your step.

Resources