Spring Batch Slow Write and Read - spring

I have a batch Job to read the records From SQLServer and Write into MariaDB.Even though i have implemented the concept of partition in the batch process , the process is very slow
Below is the Datasource Configuration for source and target systems.
#Bean(name = "sourceSqlServerDataSource")
public DataSource mysqlDataSource() {
HikariDataSource hikariDataSource = new HikariDataSource();
hikariDataSource.setMaximumPoolSize(100);
hikariDataSource.setUsername(username);
hikariDataSource.setPassword(password);
hikariDataSource.setJdbcUrl(jdbcUrl);
hikariDataSource.setDriverClassName(driverClassName);
hikariDataSource.setPoolName("Source-SQL-Server");
return hikariDataSource;
}
#Bean(name = "targetMySqlDataSource")
#Primary
public DataSource mysqlDataSource() {
HikariDataSource hikariDataSource = new HikariDataSource();
hikariDataSource.setMaximumPoolSize(100);
hikariDataSource.setUsername(username);
hikariDataSource.setPassword(password);
hikariDataSource.setJdbcUrl(jdbcUrl);
hikariDataSource.setDriverClassName(driverClassName);
hikariDataSource.setPoolName("Target-Myql-Server");
return hikariDataSource;
}
Below is the My Bean configured and thread pool taskexecutor
#Bean(name = "myBatchJobsThreadPollTaskExecutor")
public ThreadPoolTaskExecutor initializeThreadPoolTaskExecutor() {
ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
threadPoolTaskExecutor.setCorePoolSize(100);
threadPoolTaskExecutor.setMaxPoolSize(200);
threadPoolTaskExecutor.setThreadNamePrefix("My-Batch-Jobs-TaskExecutor ");
threadPoolTaskExecutor.setWaitForTasksToCompleteOnShutdown(Boolean.TRUE);
threadPoolTaskExecutor.initialize();
log.info("Thread Pool Initialized with min {} and Max {} Pool Size",threadPoolTaskExecutor.getCorePoolSize(),threadPoolTaskExecutor.getMaxPoolSize() );
return threadPoolTaskExecutor;
}
Here are the step and partition step configured
#Bean(name = "myMainStep")
public Step myMainStep() throws Exception{
return stepBuilderFactory.get("myMainStep").chunk(500)
.reader(myJdbcReader(null,null))
.writer(myJpaWriter()).listener(chunkListener)
.build();
}
#Bean
public Step myPartitionStep() throws Exception {
return stepBuilderFactory.get("myPartitionStep").listener(myStepListener)
.partitioner(myMainStep()).partitioner("myPartition",myPartition)
.gridSize(50).taskExecutor(asyncTaskExecutor).build();
}
Updating the post with reader and writer
#Bean(name = "myJdbcReader")
#StepScope
public JdbcPagingItemReader myJdbcReader(#Value("#{stepExecutionContext[parameter1]}") Integer parameter1, #Value("#{stepExecutionContext[parameter2]}") Integer parameter2) throws Exception{
JdbcPagingItemReader jdbcPagingItemReader = new JdbcPagingItemReader();
jdbcPagingItemReader.setDataSource(myTargetDataSource);
jdbcPagingItemReader.setPageSize(500);
jdbcPagingItemReader.setRowMapper(myRowMapper());
Map<String,Object> paramaterMap=new HashMap<>();
paramaterMap.put("parameter1",parameter1);
paramaterMap.put("parameter2",parameter2);
jdbcPagingItemReader.setQueryProvider(myQueryProvider());
jdbcPagingItemReader.setParameterValues(paramaterMap);
return jdbcPagingItemReader;
}
#Bean(name = "myJpaWriter")
public ItemWriter myJpaWriter(){
JpaItemWriter<MyTargetTable> targetJpaWriter = new JpaItemWriter<>();
targetJpaWriter.setEntityManagerFactory(localContainerEntityManagerFactoryBean.getObject());
return targetJpaWriter;
}
Can some one throw light on how to increase the performance of read write using Spring batch...?

Improving the performance of such an application depends on multiple parameters (grid size, chunk size, page size, thread pool size, db connection pool size, latency between db servers and your JVM, etc). So I can't give you a precise answer to your question but I will try to provide some guide lines:
Before starting to improve performance, you need to clearly define a baseline + target. Saying "it is slow" makes no sense. Get yourself ready with at least a JVM profiler and SQL client with a query execution plan analyser. Those are required to find the performance bottle neck either on your JVM or on your Database.
Setting the grid size to 50 and using a thread pool with core size = 100 means 50 threads will be created but not used. Make sure you are using the thread pool task executor in .taskExecutor(asyncTaskExecutor) and not a SimpleAsyncTaskExecutor which does not reuse threads.
50 partitions for 250k records seems a lot to me. You will have 5000 records per partition, each partition will yield 10 transactions (since chunkSize = 500). So you will have 10 transactions x 50 partitions = 500 transactions between two databases servers and your JVM. This can be a performance issue. I would recommend to start with fewer partitions, 5 or 10 for example. Increasing concurrency does not necessarily mean increasing performance. There is always a break even point where your app will spend more time in context switching and dealing with concurrency rather than doing its business logic. Finding that point is an empirical process.
I would run any sql query outside of any Spring Batch job first to see if there is a performance issue with the query itself (query grabbing too much columns, too much records, etc) or with the db schema (missing index for example)
I would not use JPA/Hibernate for such an ETL job. Mapping data to domain objects can be expensive, especially if the O/R mapping is not optimized. Raw JDBC is usually faster in these cases.
There are a lot of other tricks like estimating an item size in memory and make sure the total chunk size in memory is < heap size to avoid unnecessary GC within a chunk, choosing the right GC algorithm for batch apps, etc but those are somehow advanced. The list of guide lines above is a good starting point IMO.
Hope this helps!

Related

Spring batch working only in case of chunk size set as 1 otherwise resulting in OptimisticLockingFailureException?

I am implementing Spring batch using chunk model. Item reader to read from the list, Item processor to perform business logic and Item writer to finally write to database.
Batch processing works fine when configuring chunk size as 1 but ran into OptimisticLockingFailureException while increasing chunk size. org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=XXXXXX with wrong version (X), where current version is Y.
Here is my Spring batch configuration.
#Bean(BatchJob.BATCH_JOB)
public Job importJob(#Qualifier(BatchJob.READER) ItemReader<Details> reader,
#Qualifier(BatchJob.WRITER) ItemWriter<Details> writer,
#Qualifier(BatchJob.PROCESSOR) ItemProcessor<Details,Details> processor,
#Qualifier(BatchJob.TASK_EXECUTOR) TaskExecutor taskExecutor) {
final Step writeToDatabase = stepBuilderFactory.get(BatchJob.BATCH_STEP)
.<Details, Details>chunk(chunkSize)
.reader(reader)
.processor(processor)
.writer(writer)
.transactionManager(transactionManager)
.taskExecutor(taskExecutor)
.throttleLimit(throttleLimit)
.build();
return jobBuilderFactory.get(BatchJob.JOB_BUILDER_FACTORY)
.incrementer(new RunIdIncrementer())
.start(writeToDatabase)
.build();
}
I have my doubt on ItemProcessor as empty ItemProcessor with no operation works fine even with higher value of chunk size. #Transactional(propagation = Propagation.REQUIRES_NEW) has been used in our business logic.
Can it be because of the transaction management of Spring batch and that used in our business logic or any other possible reasons?
Is there any advantage of using spring batch with chunk size set as 1 and performing same business logic by normal loop one by one.

Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member

I am using spring-kafka with a recordFilterStrategy.
#Bean("manualImmediateListenerContainerFactory")
public KafkaListenerContainerFactory<ConcurrentMessageListenerContainer<Object, Object>> manualImmediateListenerContainerFactory(
ConsumerFactory<Object, Object> consumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory);
factory.getContainerProperties().setPollTimeout(9999999);
factory.setBatchListener(false);
//配置手动提交offset
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
factory.setAckDiscarded(true);
factory.setRecordFilterStrategy(new RecordFilterStrategy<Object, Object>() {
#Override
public boolean filter(ConsumerRecord<Object, Object> consumerRecord) {
Shipment shipment = (Shipment) consumerRecord.value();
return shipment.getType().contains("YAW");
}
});
return factory;
}
Here I have did factory.setAckDiscarded(true). When it received a message which should be discarded. It will try to ack discarded message. Then it will get an exception like below.
I already increased max.poll.interval.ms and decreased maximum size of batches.
Any hints will be highly appreciated!
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
I noticed in kafka console. It was continuesly prepare for rebalanceing. Basic i think the issue is caused by kafka broker is not stable except the spring application code has the issue.

How to use JedisConfig pool efficiently without increasing number of connections more than maxtotal?

Jedis pool is not working as expected .I have mentioned active connections 10 but it is allowing even above 10 connections.
I have overridden getConnection() method from RedisConnectionFactory. This method has been called almost for 30 times for getting the connection.
I have configured the jedis config pool as mentioned below.
Can some one please help me out why it is creating the connections more than the maxtotal? And can someone please help me out with the closing of jedisconnection pool as well.
#Configuration
public class RedisConfiguration {
#Bean
public RedisTenantDataFactory redisTenantDataFactory(){
JedisPoolConfig poolConfig = new JedisPoolConfig();
poolConfig.setMaxIdle(1);
poolConfig.setMaxTotal(10);
poolConfig.setBlockWhenExhausted(true);
poolConfig.setMaxWaitMillis(10);
JedisConnectionFactory jedisConnectionFactory = new
JedisConnectionFactory(poolConfig);
jedisConnectionFactory.setHostName(redisHost);
jedisConnectionFactory.setUsePool(true);
jedisConnectionFactory.setPort(Integer.valueOf(redisPort));
}
#####
#Bean
public RedisTemplate<String, Object> redisTemplate(#Autowired RedisConnectionFactory redisConnectionFactory) {
RedisTemplate<String, Object> template = new RedisTemplate<>();
template.setConnectionFactory(redisConnectionFactory);
template.afterPropertiesSet();
return template;
}
}
I have overridden getConnection() method from RedisConnectionFactory. This method has been called almost for 30 times for getting the connection.
It is probably a misunderstanding of the ConnectionPool behaviour. Without having the details about how you are using the pool in your application I guess.
So you pool is configured as followed:
...
poolConfig.setMaxIdle(1);
poolConfig.setMaxTotal(10);
poolConfig.setBlockWhenExhausted(true)
...
This means as you expect, you will not have more than 10 active connections from this specific pool to Redis.
You can check the number of clients (open connection) from Redis itself using RedisInsight or using the command CLIENT LIST, you will see that you will not have more than 10 connections coming from this JVM.
The fact that your see many call to getConnection() is just because your application is calling it each time a connection is needed.
This does NOT means "open a new connection", this means "give me a connection from the pool", and your configuration define the behaviour, as follow:
poolConfig.setMaxIdle(1) => you will have at least always 1 connection open and available for your application. This is important to chose a good number since "creating a new connection" is taking time and resources. (1 is probably too low in a normal application)
poolConfig.setMaxTotal(10) => this mean that the pool will not have more than 10 connections open in the same time. So you MUST define what happen when you have reach 10, and your app need one. This is where
poolConfig.setBlockWhenExhausted(true) => This means that if you have already 10 "active" connections used by your application, and the application call getConnection(), it will "block" until one of the 10 connections is returned to the pool.
So "blocking" is probably not a very good idea... (but once again it depends of your application)
Maybe you are wondering why your application is calling the getConnection() 30 times, and why it does not stop/block at 10....
Because your code is good ;), what I mean by that your application:
1- Jedis jedis = pool.getCoonnection(); (so it takes one active connection from the pool)
2- you are using jedis connection as much as needed
3- you close the connection jedis.close() ( this does not necessary close the real connection, it returns back the connection to the pool, and the pool can reuse it or close it depending of the application/configuration)
Does it make sense?
Usually you will work with the following code
/// Jedis implements Closeable. Hence, the jedis instance will be auto-closed after the last statement.
try (Jedis jedis = pool.getResource()) {
/// ... do stuff here ... for example
jedis.set("foo", "bar");
String foobar = jedis.get("foo");
jedis.zadd("sose", 0, "car"); jedis.zadd("sose", 0, "bike");
Set<String> sose = jedis.zrange("sose", 0, -1);
}
/// ... when closing your application:
pool.close()
You can find more information about JedisPool and Apache CommonPool here:
Getting-started
Apache Commons Pool

how to stop consuming messages from kafka when error occurred and restart consuming again after some time in spring boot

This is the first time i am using Kafka. i have a spring boot application and i am consuming messages from kafka topics and storing messages in DB. I have a requirement to handle DB fail over, if DB is down that message should not be committed and suspend consuming messages for some time and after some time listener can start consuming messages again. what is the better approach to do this.
i am using spring-kafka:2.2.8.RELEASE which is internally using kafka 2.0.1
Configure a ContainerStoppingErrorHandler and throw an exception from your listener.
https://docs.spring.io/spring-kafka/docs/2.2.13.RELEASE/reference/html/#container-stopping-error-handlers
You can restart the container later when you have detected that your DB is back online.
https://docs.spring.io/spring-kafka/docs/2.2.13.RELEASE/reference/html/#kafkalistener-lifecycle
EDIT
#SpringBootApplication
public class So62125817Application {
public static void main(String[] args) {
SpringApplication.run(So62125817Application.class, args);
}
#Bean
TaskScheduler scheduler() {
return new ThreadPoolTaskScheduler();
}
#Bean
public NewTopic topic() {
return TopicBuilder.name("so62125817").partitions(1).replicas(1).build();
}
}
#Component
class Listener {
private final TaskScheduler scheduler;
private final KafkaListenerEndpointRegistry registry;
public Listener(TaskScheduler scheduler, KafkaListenerEndpointRegistry registry,
AbstractKafkaListenerContainerFactory<?, ?, ?> factory) {
this.scheduler = scheduler;
this.registry = registry;
factory.setErrorHandler(new ContainerStoppingErrorHandler());
}
#KafkaListener(id = "so62125817.id", topics = "so62125817")
public void listen(String in) {
System.out.println(in);
// run this code if you want to stop the container and restart it in 60 seconds
this.scheduler.schedule(() -> {
this.registry.getListenerContainer("so62125817.id").start();
}, new Date(System.currentTimeMillis() + 60_000));
throw new RuntimeException("test restart");
}
}
There are two approaches which I can think of doing this:
First Approach: Let the auto-commit option for consuming messages be true. The configuration for this is enable.auto.commit. By default, this would be true, so you do not need to change anything. Whenever your DB operation fails, you can put the messages on a different topic say a topic named failed_events. When you do this, you can have the same application (Which populates the DB) running say once at a daily level to consume the message from failed_events topic and populate the DB again. This way you can keep track of how many times the DB write gets failed. One small thing to note is what if during this run also the DB is down, then what do you do. You can decide what to do in this case. Probably discard the message if it is Ok to do so, or do a certain number of retries.
Second approach: If it is very deterministic to know that for how long the DB would be down. And if the time period is very small, then it is better to do a sleep operation in the case of DB write failure. Say the application sleeps for 10 minutes before it retries again. You will not have to create a separate topic in this case.
The advantage of this approach is that you don't have to run a separate instance of the same application to fetch from a different topic. You could do all of them in one single application. Maintaining this becomes relatively easier.
The disadvantage of this approach is that if the DB is down for a very long period, say 1 day, Then you will end up losing the message.

Which one to choose Betweeen Spring scheduler and JMS ? And difference between them

I am using Spring scheduler and JMS, Which one would be better approach for scheduling.
#Service
public class ScheduledProcessor implements Processor {
private final AtomicInteger counter = new AtomicInteger();
#Autowired
private Worker worker;
#Scheduled(fixedDelay = 30000)
public void process() {
System.out.println("processing next 10 at " + new Date());
for (int i = 0; i < 10; i++) {
worker.work(counter.incrementAndGet());
}
}
}
These solutions are fundamentally different.
Scheduled services are kicked off every n milliseconds after the last run and processes whatever is available. It's not guaranteed to process in a timely manner and might not scale if the amount of data to process grows (and if the processing has a level of complexity).
I tend to lean towards JMS. First off, messages are processed as they come in, pushed to the listener, rather than being polled as in a service. Second, if necessary you can scale the messages both horizontal and vertical, giving you more knobs to make sure that the actual processing doesn't overwhelm the application.
Basic question might be: what are your requirements?

Resources