Issues with Spring Batch - spring

Hi I have been working in Spring batch recently and need some help.
1) I want to run my Job using multiple threads, hence I have used TaskExecutor as below,
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(4);
return taskExecutor;
}
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<MyEntity,AnotherEntity> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.taskExecutor(taskExecutor())
.throttleLimit(4)
.build();
}
but, while executing in can see below line in console.
o.s.b.c.l.support.SimpleJobLauncher : No TaskExecutor has been set, defaulting to synchronous executor.
What does this mean? However, while debugging I can see four SimpleAsyncExecutor threads running. Can someone shed some light on this?
2) I don't want to run my Batch application with the metadata tables that spring batch creates. I have tried adding spring.batch.initialize-schema=never. But it didn't work. I also saw some way to do this by using ResourcelessTransactionManager, MapJobRepositoryFactoryBean. But I have to make some database transactions for my job. So will it be alright if I use this?
Also I was able to do this by extending DefaultBatchConfigurer and overriding:
#Override
public void setDataSource(DataSource dataSource) {
// override to do not set datasource even if a datasource exist.
// initialize will use a Map based JobRepository (instead of database)
}
Please guide me further. Thanks.
Update:
My full configuration class here.
#EnableBatchProcessing
#EnableScheduling
#Configuration
public class MyBatchConfiguration{
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public DataSource dataSource;
/* #Override
public void setDataSource(DataSource dataSource) {
// override to do not set datasource even if a datasource exist.
// initialize will use a Map based JobRepository (instead of database)
}*/
#Bean
public Step myStep() {
return stepBuilderFactory.get("myStep")
.<MyEntity,AnotherEntity> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.taskExecutor(executor())
.throttleLimit(4)
.build();
}
#Bean
public Job myJob() {
return jobBuilderFactory.get("myJob")
.incrementer(new RunIdIncrementer())
.listener(listener())
.flow(myStep())
.end()
.build();
}
#Bean
public MyJobListener myJobListener()
{
return new MyJobListener();
}
#Bean
public ItemReader<MyEntity> reader()
{
return new MyReader();
}
#Bean
public ItemWriter<? super AnotherEntity> writer()
{
return new MyWriter();
}
#Bean
public ItemProcessor<MyEntity,AnotherEntity> processor()
{
return new MyProcessor();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(4);
return taskExecutor;
}}

In the future, please break this up into two independent questions. That being said, let me shed some light on both questions.
SimpleJobLauncher : No TaskExecutor has been set, defaulting to synchronous executor.
Your configuration is configuring myStep to use your TaskExecutor. What that does is it causes Spring Batch to execute each chunk in it's own thread (based on the parameters of the TaskExecutor). The log message you are seeing has nothing to do with that behavior. It has to do with launching your job. By default, the SimpleJobLauncher will launch the job on the same thread it is running on, thereby blocking that thread. You can inject a TaskExecutor into the SimpleJobLauncher which will cause the job to be executed on a different thread from the JobLauncher itself. These are two separate uses of multiple threads by the framework.
I don't want to run my Batch application with the metadata tables that spring batch creates
The short answer here is to just use an in memory database like HSQLDB or H2 for your metadata tables. This provides a production grade data store (so that concurrency is handled correctly) without actually persisting the data. If you use the ResourcelessTransactionManager, you are effectively turning transactions off (a bad idea if you're using a database in any capacity) because that TransactionManager doesn't actually do anything (it's a no-op implementation).

Related

Spring Batch/Data JPA application not persisting/saving data to Postgres database when calling JPA repository (save, saveAll) methods

I am near wits-end. I read/googled endlessly so far and tried the solutions on all the google/stackoverflow posts that have this similiar issue (there a quite a few). Some seemed promising, but nothing has worked for me yet; though I have made some progress and I am on the right track I believe (I'm believing at this point its something with the Transaction manager and some possible conflict with Spring Batch vs. Spring Data JPA).
References:
Spring boot repository does not save to the DB if called from scheduled job
JpaItemWriter: no transaction is in progress
Similar to the aforementioned posts, I have a Spring Boot application that is using Spring Batch and Spring Data JPA. It reads comma delimited data from a .csv file, then does some processing/transformation, and attempts to persist/save to database using the JPA Repository methods, specifically here .saveAll() (I also tried .save() method and this did the same thing), since I'm saving a List<MyUserDefinedDataType> of a user-defined data type (batch insert).
Now, my code was working fine on Spring Boot starter 1.5.9.RELEASE, but I recently attempted to upgrade to 2.X.X, which I found, after countless hours of debugging, only version 2.2.0.RELEASE would persist/save data to database. So an upgrade to >= 2.2.1.RELEASE breaks persistence. Everything is read fine from the .csv, its just when the first time the code flow hits a JPA repository method like .save() .saveAll(), the application keeps running but nothing gets persisted. I also noticed the Hikari pool logs "active=1 idle=4", but when I looked at the same log when on version 1.5.9.RELEASE, it says active=0 idle=5 immediately after persisting the data, so the application is definitely hanging. I went into the debugger and even saw after jumping into the Repository calls, it goes into almost an infinite cycle through the Spring AOP libraries and such (all third party) and I don't believe ever comes back to the real application/business logic that I wrote.
3c22fb53ed64 2021-05-20 23:53:43.909 DEBUG
[HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool - HikariPool-1 - Pool stats (total=5, active=1, idle=4, waiting=0)
Anyway, I tried the most common solutions that worked for other people which were:
Defining a JpaTransactionManager #Bean and injecting it into the Step function, while keeping the JobRepository using the PlatformTransactionManager. This did not work. Then I also I tried using the JpaTransactionManager also in the JobRepository #Bean, this also did not work.
Defining a #RestController endpoint in my application to manually trigger this Job, instead of doing it manually from my main Application.java class. (I talk about this more below). And per one of the posts I posted above, the data persisted correctly to the database even on spring >= 2.2.1, which further I suspect now something with the Spring Batch persistence/entity/transaction managers is messed up.
The code is basically this:
BatchConfiguration.java
#Configuration
#EnableBatchProcessing
#Import({DatabaseConfiguration.class})
public class BatchConfiguration {
// Datasource is a Postgres DB defined in separate IntelliJ project that I add to my pom.xml
DataSource dataSource;
#Autowired
public BatchConfiguration(#Qualifier("dataSource") DataSource dataSource) {
this.dataSource = dataSource;
}
#Bean
#Primary
public JpaTransactionManager jpaTransactionManager() {
final JpaTransactionManager tm = new JpaTransactionManager();
tm.setDataSource(dataSource);
return tm;
}
#Bean
public JobRepository jobRepository(PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
jobRepositoryFactoryBean.setDataSource(dataSource);
jobRepositoryFactoryBean.setTransactionManager(transactionManager);
jobRepositoryFactoryBean.setDatabaseType("POSTGRES");
return jobRepositoryFactoryBean.getObject();
}
#Bean
public JobLauncher jobLauncher(JobRepository jobRepository) {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
return simpleJobLauncher;
}
#Bean(name = "jobToLoadTheData")
public Job jobToLoadTheData() {
return jobBuilderFactory.get("jobToLoadTheData")
.start(stepToLoadData())
.listener(new CustomJobListener())
.build();
}
#Bean
#StepScope
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
threadPoolTaskExecutor.setCorePoolSize(maxThreads);
threadPoolTaskExecutor.setThreadGroupName("taskExecutor-batch");
return threadPoolTaskExecutor;
}
#Bean(name = "stepToLoadData")
public Step stepToLoadData() {
TaskletStep step = stepBuilderFactory.get("stepToLoadData")
.transactionManager(jpaTransactionManager())
.<List<FieldSet>, List<myCustomPayloadRecord>>chunk(chunkSize)
.reader(myCustomFileItemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(myCustomPayloadRecordItemProcessor())
.writer(myCustomerWriter())
.faultTolerant()
.skipPolicy(new AlwaysSkipItemSkipPolicy())
.skip(DataValidationException.class)
.listener(new CustomReaderListener())
.listener(new CustomProcessListener())
.listener(new CustomWriteListener())
.listener(new CustomSkipListener())
.taskExecutor(taskExecutor())
.throttleLimit(maxThreads)
.build();
step.registerStepExecutionListener(stepExecutionListener());
step.registerChunkListener(new CustomChunkListener());
return step;
}
My main method:
Application.java
#Autowired
#Qualifier("jobToLoadTheData")
private Job loadTheData;
#Autowired
private JobLauncher jobLauncher;
#PostConstruct
public void launchJob () throws JobParametersInvalidException, JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException
{
JobParameters parameters = (new JobParametersBuilder()).addDate("random", new Date()).toJobParameters();
jobLauncher.run(loadTheData, parameters);
}
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
Now, normally I'm reading this .csv from Amazon S3 bucket, but since I'm testing locally, I am just placing the .csv in the project directory and reading it directly by triggering the job in the Application.java main class (as you can see above). Also, I do have some other beans defined in this BatchConfiguration class but I don't want to over-complicate this post more than it already is and from the googling I've done, the problem possibly is with the methods I posted (hopefully).
Also, I would like to point out, similar to one of the other posts on Google/stackoverflow with a user having a similar problem, I created a #RestController endpoint that simply calls the .run() method the JobLauncher and I pass in the JobToLoadTheData Bean, and it triggers the batch insert. Guess what? Data persists to the database just fine, even on spring >= 2.2.1.
What is going on here? is this a clue? is something funky going wrong with some type of entity or transaction manager? I'll take any advice tips! I can provide any more information that you guys may need , so please just ask.
You are defining a bean of type JobRepository and expecting it to be picked up by Spring Batch. This is not correct. You need to provide a BatchConfigurer and override getJobRepository. This is explained in the reference documentation:
You can customize any of these beans by creating a custom implementation of the
BatchConfigurer interface. Typically, extending the DefaultBatchConfigurer
(which is provided if a BatchConfigurer is not found) and overriding the required
getter is sufficient.
This is also documented in the Javadoc of #EnableBatchProcessing. So in your case, you need to define a bean of type Batchconfigurer and override getJobRepository and getTransactionManager, something like:
#Bean
public BatchConfigurer batchConfigurer(EntityManagerFactory entityManagerFactory, DataSource dataSource) {
return new DefaultBatchConfigurer(dataSource) {
#Override
public PlatformTransactionManager getTransactionManager() {
return new JpaTransactionManager(entityManagerFactory);
}
#Override
public JobRepository getJobRepository() {
JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
jobRepositoryFactoryBean.setDataSource(dataSource);
jobRepositoryFactoryBean.setTransactionManager(getTransactionManager());
// set other properties
return jobRepositoryFactoryBean.getObject();
}
};
}
In a Spring Boot context, you could also override the createTransactionManager and createJobRepository methods of org.springframework.boot.autoconfigure.batch.JpaBatchConfigurer if needed.

How to add tasklet to run after each partition step completion in Spring Batch

I am new to Spring batch and implementing a spring batch job where it has to pull huge data set from DB and write to file. Below is the sample job config which is working as expected for me.
#Bean
public Job customDBReaderFileWriterJob() throws Exception {
return jobBuilderFactory.get(MY_JOB)
.incrementer(new RunIdIncrementer())
.flow(partitionGenerationStep())
.next(cleanupStep())
.end()
.build();
}
#Bean
public Step partitionGenerationStep() throws Exception {
return stepBuilderFactory
.get("partitionGenerationStep")
.partitioner("Partitioner", partitioner())
.step(multiOperationStep())
.gridSize(50)
.taskExecutor(taskExecutor())
.build();
}
#Bean
public Step multiOperationStep() throws Exception {
return stepBuilderFactory
.get("MultiOperationStep")
.<Input, Output>chunk(100)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
#Bean
#StepScope
public DBPartitioner partitioner() {
DBPartitioner dbPartitioner = new DBPartitioner();
dbPartitioner.setColumn(ID);
dbPartitioner.setDataSource(dataSource);
dbPartitioner.setTable(TABLE);
return dbPartitioner;
}
#Bean
#StepScope
public Reader reader() {
return new Reader();
}
#Bean
#StepScope
public Processor processor() {
return new Processor();
}
#Bean
#StepScope
public Writer writer() {
return new Writer();
}
#Bean
public Step cleanupStep() {
return stepBuilderFactory.get("cleanupStep")
.tasklet(cleanupTasklet())
.build();
}
#Bean
#StepScope
public CleanupTasklet cleanupTasklet() {
return new CleanupTasklet();
}
#Bean
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(10);
executor.setQueueCapacity(10);
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
executor.setThreadNamePrefix("MultiThreaded-");
return executor;
}
As the data set is huge, i have configured thread pool value for task-executor as 10 and grid size 50. With this setup 10 threads are writing to 10 files at a time, and reader is reading file in chunks so reader processor and writer flow is iterating multiple times (for a group of 10, before moving to next partition).
Now, I would like to add a tasklet where i can compress files once all iteration (read, process,write) for one thread is completed i.e. after completion of each partition.
I do have a cleanup tasklet to run at last, but having compression logic there means to get all files generated from each partition first and then perform compression. Please suggest.
You can change your worker step multiOperationStep to be a FlowStep of a chunk-oriented step followed by a simple tasklet step where you do the compression. In other words, the worker step is actually two steps combined in one FlowStep.

Define an in-memory JobRepository

I'm testing Spring Batch using Spring boot. My need is to define jobs working on an Oracle Database but I don't want to save jobs and steps states inside this DB.
I've read in the documentation I can use a in-memory repository with the MapJobRepositoryFactoryBean.
Then, I've implemented this bean:
#Bean
public JobRepository jobRepository() {
MapJobRepositoryFactoryBean factoryBean = new MapJobRepositoryFactoryBean(new ResourcelessTransactionManager());
try {
JobRepository jobRepository = factoryBean.getObject();
return jobRepository;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
But when my job starts, the first thing Spring Batch does is to create the table in the Oracle DB and continues to use the Oracle datasource. It's like my JobRepository definition isn't taken account.
What did I miss ?
EDIT: I'm using Spring Boot 1.5.3 and Spring Batch 3.0.7
With SpringBoot 2.x, the solution is simpler.
You have to extend the DefaultBatchConfigurer class like this:
#Component
public class NoPersistenceBatchConfigurer extends DefaultBatchConfigurer {
#Override
public void setDataSource(DataSource dataSource) {
}
}
Without datasource, the framework automatically switches to use the MapJobRepository.
A few things here:
If you have a DataSource configured in your ApplicationContext, by default Spring Batch will try to use it.
In order to not use a DataSource when one is available within the ApplicationContext, you'll need to create your own BatchConfigurer. You can do that by extending the DefaultBatchConfigurer.
Don't use the MapJobRepository except only for testing purposes. I has a number of issues (thread safety, etc) and is not recommended for production use. Use an in memory database like HSQLDB instead (you'll still need to create your own BatchConfigurer to do so).
Thank the comment of pvpkiran I've found my problem. It's necessary to define a JobLauncher bean.
Below an example:
#Bean
public JobRepository jobRepository() {
MapJobRepositoryFactoryBean factoryBean = new MapJobRepositoryFactoryBean(new ResourcelessTransactionManager());
try {
JobRepository jobRepository = factoryBean.getObject();
return jobRepository;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
#Bean
public JobLauncher jobLauncher(JobRepository jobRepository) {
SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
jobLauncher.setJobRepository(jobRepository);
return jobLauncher;
}
If using Spring Boot and #EnableBatchProcessing, you would extend the DefaultBatchConfigurer and override the createJobRepository method. Create a ResourcelessTransactionManager and JobRepository using MapJobRepositoryFactoryBean, the rest of the beans will be auto created by Spring Boot.
#Configuration
public class InMemoryBatchContextConfigurer extends DefaultBatchConfigurer {
#Bean
private ResourcelessTransactionManager resoucelessTransactionManager() {
return new ResourcelessTransactionManager();
}
#Override
protected JobRepository createJobRepository() throws Exception {
MapJobRepositoryFactoryBean factoryBean = new MapJobRepositoryFactoryBean();
factoryBean.setTransactionManager(resoucelessTransactionManager());
return factoryBean.getObject();
}
}`
Extend DefaultBatchConfigurer class and override createJobRepository method just like below.
#Configuration
public class InMemoryBatchConfigurer extends DefaultBatchConfigurer {
#Override
protected JobRepository createJobRepository() throws Exception {
return new MapJobRepositoryFactoryBean().getObject();
}
}

Spring Batch dynamic Flow/Job construction

I'm currently using Spring Batch to run a job that processes a file, does some stuff on each line and writes the output to another file.
This was developed in a 'core' product but now (as always) we have some client-specific requirements that mandate the inclusion of some extra steps in the job.
I've been able to do a proof-of-concept where use the common Spring features to be able to 'replace' the job with another one with the extra steps either by using distinct names for the job (if we define them in the same Configuration class) or by creating a completely distinct Configuration class and loading that as the Spring context.
What i'm asking is, and i'm 'almost' there, if it's possible to easily define a base Job (maybe with an initial step or not) and then only add the steps that make sense for that specific 'client'.
I'm using standard class inheritance to do this but it doesn't work properly with standard Spring facilities since Spring won't know which implementation of the "getSteps" method to use (code below).
abstract class JobConfig {
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
protected StepBuilderFactory stepBuilderFactory;
#Bean
Job job() {
List<Step> steps = getSteps();
final JobBuilder jobBuilder = jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer());
SimpleJobBuilder builder = jobBuilder.start(steps.remove(0));
for (Step s : steps) {
builder = builder.next(s);
}
return builder.build();
}
protected abstract List<Step> getSteps();
}
#Configuration
#Import(BaseConfig.class)
public class Client1JobConfig extends JobConfig {
#Override
protected List<Step> getSteps() {
List<Step> steps = new ArrayList<>();
steps.add(step1());
return steps;
}
Step step1() {
return stepBuilderFactory.get("step1")
.<Integer, Integer>chunk(1)
.reader(dummyReader())
.processor(processor1())
.writer(dummyWriter())
.build();
}
}
#Configuration
#Import(BaseConfig.class)
public class Client2JobConfig extends JobConfig {
#Override
protected List<Step> getSteps() {
List<Step> steps = new ArrayList<>();
steps.add(step1());
steps.add(step2());
return steps;
}
Step step1() {
return stepBuilderFactory.get("step1")
.<Integer, Integer>chunk(1)
.reader(dummyReader())
.processor(processor1())
.writer(dummyWriter())
.build();
}
Step step2() {
return stepBuilderFactory.get("step2")
.<Integer, Integer>chunk(1)
.reader(dummyReader())
.processor(processor2())
.writer(dummyWriter())
.build();
}
}
I can make it work if i load just one Configuration class into the Spring context but if i have all the Configuration classes loaded (either by component scanning, or manually adding them to the context) of course it doesn't work because there's no way to select wither one implementation of the other.
I can also make it work by having differently-named jobs like "client1" and "client2" but let's say i can't change the calling code and the job is Autowired. How can i have the 'same' some but with different steps?
Is there a better way to accomplish this?

run spring batch job from the controller

I am trying to run my batch job from a controller. It will be either fired up by a cron job or by accessing a specific link.
I am using Spring Boot, no XML just annotations.
In my current setting I have a service that contains the following beans:
#EnableBatchProcessing
#PersistenceContext
public class batchService {
#Bean
public ItemReader<Somemodel> reader() {
...
}
#Bean
public ItemProcessor<Somemodel, Somemodel> processor() {
return new SomemodelProcessor();
}
#Bean
public ItemWriter writer() {
return new CustomItemWriter();
}
#Bean
public Job importUserJob(JobBuilderFactory jobs, Step step1) {
return jobs.get("importUserJob")
.incrementer(new RunIdIncrementer())
.flow(step1)
.end()
.build();
}
#Bean
public Step step1(StepBuilderFactory stepBuilderFactory,
ItemReader<somemodel> reader,
ItemWriter<somemodel> writer,
ItemProcessor<somemodel, somemodel> processor) {
return stepBuilderFactory.get("step1")
.<somemodel, somemodel> chunk(100)
.reader(reader)
.processor(processor)
.writer(writer)
.build();
}
}
As soon as I put the #Configuration annotation on top of my batchService class, job will start as soon as I run the application. It finished successfully, everything is fine. Now I am trying to remove #Configuration annotation and run it whenever I want. Is there a way to fire it from the controller?
Thanks!
You need to create a application.yml file in the src/main/resources and add following configuration:
spring.batch.job.enabled: false
With this change, the batch job will not automatically execute with the start of Spring Boot. And batch job will be triggered when specific link.
Check out my sample code here:
https://github.com/pauldeng/aws-elastic-beanstalk-worker-spring-boot-spring-batch-template
You can launch a batch job programmatically using JobLauncher which can be injected into your controller. See the Spring Batch documentation for more details, including this example controller:
#Controller
public class JobLauncherController {
#Autowired
JobLauncher jobLauncher;
#Autowired
Job job;
#RequestMapping("/jobLauncher.html")
public void handle() throws Exception{
jobLauncher.run(job, new JobParameters());
}
}
Since you're using Spring Boot, you should leave the #Configuration annotation in there and instead configure your application.properties to not launch the jobs on startup. You can read more about the autoconfiguration options for running jobs at startup (or not) in the Spring Boot documentation here: http://docs.spring.io/spring-boot/docs/current-SNAPSHOT/reference/htmlsingle/#howto-execute-spring-batch-jobs-on-startup

Resources