Why is Spring Boot #Async dropping items in my List argument? - spring-boot

I am experiencing some sort of thread issue with the #Async method annotation where one argument contains a List of enum and is dropping items. The list is very small, 2 items. The dropping of items is not immediate, but sometimes takes hours or days to appear.
This is the general flow of our program:
A Controller generates the said List in its #RequestMapping method, passes the list to a Service class, which makes a call to a database for batching and triggers an event for each item from the database, passing the list. This list eventually gets passed into an #Async method which then drops either the first item or both items.
Controller.methodA()
-> Creates list with two items in it
-> Calls void Service.methodX(list)
-> Load batch from database
-> Iterate over batch
-> Print items from list --- list in tact
-> Calls void AsyncService.asyncMethod(list)
-> Print items from list --- eventually drops items here always the first item, sometimes both.
Code configuration and bare-bones sample:
We configured it to have 2 threads:
#Configuration
#EnableAsync
public class AsyncConfig implements AsyncConfigurer {
#Override
public Executor getAsyncExecutor() {
ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
threadPoolTaskExecutor.setMaxPoolSize(5); // Never actually creates 5 threads
threadPoolTaskExecutor.setCorePoolSize(2); // Only 2 threads are ever created
threadPoolTaskExecutor.initialize();
return threadPoolTaskExecutor;
}
}
This is a local replica to try to trigger the core issue, but no luck:
#RestController
public class ThreadingController {
private final ThreadingService threadingService;
public ThreadingController(ThreadingService threadingService) {
this.threadingService = threadingService;
}
#GetMapping("/test")
public void testThreads() {
List<SomeEnum> list = new ArrayList<>();
list.add(SomeEnum.FIRST_ENUM);
list.add(SomeEnum.SECOND_ENUM);
for (int i = 0; i < 1000; i++) {
this.threadingService.workSomeThreads(i, list);
}
}
}
public enum SomeEnum {
FIRST_ENUM("FIRST_ENUM"),
SECOND_ENUM("SECOND_ENUM");
#Getter
private String name;
SomeEnum(String name) {
this.name = name;
}
}
#Slf4j
#Service
public class ThreadingService {
#Async
public void workSomeThreads(int i, List<SomeEnum> list) {
try {
Thread.sleep(100L); // Add some delay to slow things down to trigger GC or other tests during processing
} catch (InterruptedException e) {
e.printStackTrace();
}
log.info("Count {} ; Here are the list items: {}", i, list.toString());
assert(list.size() == 2);
}
}
If we look through this, I have one controller simulating both the Controller and Service mentioned earlier. It spins through a batch of data, sending the same list over and over. There's an aync method in another class to test that the list is the same. I was not able to replicate the issue locally, but this is the core problem.
To my knowledge, Java is pass-by-reference and every variable passed into a method gets its own pointer in the stack to that reference in memory, but don't think it would cause us to run out of memory. We are running in PCF and don't see any memory spikes or anything during this time. Memory is constant around 50%. I also tried using a CopyOnWriteArrayList (thread safe) instead of ArrayList and still the problem exists.
Questions:
Any idea why the #Async method would drop items in the method argument? The list is never modified after construction, so why would items disappear? Why would the first item always disappear? Why not the second item? Why would both disappear?
Edit: So this question had little to do with #Async in the end. I found deeply nested code that removed items from the list, causing items to go missing.

What you said is correct, Java is indeed pass-by-reference. The change in your list must be definitely due to some other code that is modifying this list while the threads are executing. There is no other way the object would change its values.
You must investigate the code in the below section to identify if there is something that is modifying the list.
-> Print items from list --- eventually drops items here always the first item, sometimes both.
-> code following this might be changing the list.
As the AsyncService would execute its code asynchronously and in the mean time, some other code modifies the list.
You may as well make the method params to be final.

Related

Safe processing data coming from KafkaListener

I'm implementing Spring Boot App which reads some data from kafka to provide it for all requesting clients. Let's say I have a following class:
#Component
public class DataProvider {
private Prices prices;
public DataProvider() {
this.prices = Prices.of();
}
public Prices getPrices() {
return prices;
}
}
Each client may perform GET /api/prices to get info about newest prices. Live updates about prices are consumed from kafka. Due to the fact, that update comes every 5 seconds, which is not super often, the topic has only one partition.
I tried the very basic option using Kafka Listener:
#Component
public class DataProvider {
private Prices prices;
public DataProvider() {
this.prices = Prices.of();
}
public Prices getPrices() {
return prices;
}
#KafkaListener(topics = "test-topic")
public void consume(String message) {
Prices prices = Prices.of(message);
this.prices = prices;
}
}
Is this approach safe?
The prices must be volatile. But again: you need to be sure that an actual data for prices is OK to be dispersed. One HTTP request may return one data, but another concurrent may return other. Just because it has been just update by the Kafka consumer.
You may have your consume() and getPrices() as synchronized. So, every one is going to get an actual data at the same moment. However they are not going to be parallel since synchronized ensures only one thread can get access to the object.
Another way for consistency is to look into a ReadWriteLock barrier. So, getPrices() calls can be parallel, but as long as consume() takes a WriteLock, everyone is blocked until it is done.
So, technically your code is really safe. Only the problem if it is safe from a business purpose.

Multiple writers for different types in the same Spring Batch step

I am writing a Spring Batch application with the following workflow:
Read some items of type A (using a FlatFileItemReader<A>).
Process an item, transforming it from A to B.
Write the processed items of type B (using a JdbcBatchItemWriter<B>)
Eventually, I should call an external service (a RESTful API, but it could be a SimpleMailMessageItemWriter<A>) using data from the source type A.
How can I configure such a workflow?
So far, I have found the following workaround:
Configuring a CompositeItemWriter<B> which delegates to:
The actual ItemWriter<B>
A custom ItemWriter<B> implementation which converts B back to A and then writes an A
But this is a cumbersome solution because it forces me to either:
Duplicate processing logic: from A to B and back again.
Sneakily hide some attributes from the source object A inside B, polluting the domain model.
Note: since my custom item writer for A needs to invoke an external service, I would like to perform this operation after B has been successfully written.
Here are the relevant parts of the batch configuration code.
#Bean
public Step step(StepBuilderFactory steps, ItemReader<A> reader, ItemProcessor<A, B> processor, CompositeItemWriter<B> writer) {
return steps.get("step")
.<A, B>chunk(10)
.reader(reader)
.processor(processor)
.writer(writer)
.build();
}
#Bean
public CompositeItemWriter<B> writer(JdbcBatchItemWriter<B> jdbcBatchItemWriter, CustomItemWriter<B, A> customItemWriter) {
return new CompositeItemWriterBuilder<B>()
.delegates(jdbcBatchItemWriter, customItemWriter)
.build();
}
For your use case, I would encapsulate A and B in a wrapper type, such AB:
class AB {
private A originalItem;
private B transformedItem;
}
With that, you would have: ItemReader<A>, ItemProcessor<A, AB> and ItemWriter<AB>. The processor creates instances of AB in which it keeps a reference to the original item. The writer can then get access to both types and delegate to the JdbcBatchItemReader<B> and SimpleMailMessageItemWriter<A> as needed, something like:
class ABItemWriter implements ItemWriter<AB> {
private JdbcBatchItemWriter<B> jdbcBatchItemWriter;
private SimpleMailMessageItemWriter mailMessageItemWriter;
// constructor with delegates
#Override
public void write(List<? extends AB> items) throws Exception {
jdbcBatchItemWriter.write(getBs(items));
mailMessageItemWriter.write(getAs(items)); // this would not be called if the jdbc writer fails
}
}
The methods getAs and getBs would extract items of type A/B from AB. Encapsulation for the win! BTW, a Java record is a good option for type AB.

spring boot how to handle fault tolerance in async method?

Suppose I have a caller to distribute work to multiple async tasks:
public class Caller{
public boolean run() {
for (int i = 0: i< 100; i++) {
worker.asyncFindOrCreate(entites[i]);
}
return true;
}
public class Worker{
#Autowired
Dao dao;
#Async
public E asyncFindOrCreate(User entity) {
return dao.findByName(entity.getName).elseGet(() -> dao.save(entity));
}
}
If we have 2 same entities:
with the synchronized method, the first one will be created and then the second one will be retrieved from the existing entity;
with async, the second entities might pass the findByName and go to save because the first entity hasn't been saved yet, which cause the save of the second entity throws unique identifier error.
Is there a way to add some fault tolerance mechanic to have some features like retry and skipAfterRetry, in particular for database operations.
In this special case you should convert your array to a map. Use the name property as a key, so there will be no duplicated entries.
However, if this method also can be called by multiple threads (ie. it's in a web-server) or there are multiple instances running it's still not fail-safe.
In generic, you should let the DB to check the uniqueness. There is no safest/easiest way to do that. Put the save method inside a try-catch block and check/handle the unique identifier exception.

Spring Data Solr #Transaction Commits

I currently have a setup where data is inserted into a database, as well as indexed into Solr. These two steps are wrapped in a spring-managed transaction via the #Transaction annotation. What I've noticed is that spring-data-solr issues an update with the following parameters whenever the transaction is closed : params{commit=true&softCommit=false&waitSearcher=true}
#Transactional
public void save(Object toSave){
dbRepository.save(toSave);
solrRepository.save(toSave);
}
The rate of commits into solr is fairly high, so ideally I'd like send data to the solr index, and have solr auto commit at regular intervals. I have the autoCommit (and autoSoftCommit) set in my solrconfig.xml, but since spring-data-solr is sending those commit parameters, it does a hard commit every time.
I'm aware that I can drop down to the SolrTemplate API and issue commits manually, I would like to keep the solr repository.save call within a spring-managed transaction if possible. Is there a way to modify the parameters that are sent to solr on commit?
After putting in an IDE debug breakpoint in org.springframework.data.solr.repository.support.SimpleSolrRepository here:
private void commitIfTransactionSynchronisationIsInactive() {
if (!TransactionSynchronizationManager.isSynchronizationActive()) {
this.solrOperations.commit(solrCollectionName);
}
}
I discovered that wrapping my code as #Transactional (and other details to actually enable the framework to begin/end code as a transaction) doesn't achieve what we expect with "Spring Data for Apache Solr". The stacktrace shows the Proxy and Transaction Interceptor classes for our code's Transactional scope but then it also shows the framework starting its own nested transaction with another Proxy and Transaction Interceptor of its own. When the framework exits its CrudRepository.save() method my code calls, the action to commit to Solr is done by the framework's nested transaction. It happens before our outer transaction is exited. So, the attempt to batch-process many saves with one commit at the end instead of one commit for every save is futile. It seems, for this area in my code, I'll have to make use of SolrJ to save (update) my entities to Solr and then have "my" transaction's exit be followed with a commit.
If using Spring Solr, I found using the SolrTemplate bean allows you to 'batch' updates when adding data to the Solr index. By using the bean for SolrTemplate, you can use "addBeans" method, which will add a collection to the index and not commit until the end of the transaction. In my case, I started out using solrClient.add() and taking up to 4 hours for my collection to get saved to the index by iterating over it, as it commits after every single save. By using solrTemplate.addBeans(Collect<?>), it finishes in just over 1 second, as the commit is on the entire collection. Here is a code snippet:
#Resource
SolrTemplate solrTemplate;
public void doReindexing(List<Image> images) {
if (images != null) {
/* CMSSolrImage is a class with #SolrDocument mappings.
* the List<Image> images is a collection pulled from my database
* I want indexed in Solr.
*/
List<CMSSolrImage> sImages = new ArrayList<CMSSolrImage>();
for (Image image : images) {
CMSSolrImage sImage = new CMSSolrImage(image);
sImages.add(sImage);
}
solrTemplate.saveBeans(sImages);
}
}
The way I've done something similar is to create a custom repository implementation of the save methods.
Interface for the repository:
public interface FooRepository extends SolrCrudRepository<Foo, String>, FooRepositoryCustom {
}
Interface for the custom overrides:
public interface FooRepositoryCustom {
public Foo save(Foo entity);
public Iterable<Foo> save(Iterable<Foo> entities);
}
Implementation of the custom overrides:
public class FooRepositoryImpl {
private SolrOperations solrOperations;
public SolrSampleRepositoryImpl(SolrOperations fooSolrOperations) {
this.solrOperations = fooSolrOperations;
}
#Override
public Foo save(Foo entity) {
Assert.notNull(entity, "Cannot save 'null' entity.");
registerTransactionSynchronisationIfSynchronisationActive();
this.solrOperations.saveBean(entity, 1000);
commitIfTransactionSynchronisationIsInactive();
return entity;
}
#Override
public Iterable<Foo> save(Iterable<Foo> entities) {
Assert.notNull(entities, "Cannot insert 'null' as a List.");
if (!(entities instanceof Collection<?>)) {
throw new InvalidDataAccessApiUsageException("Entities have to be inside a collection");
}
registerTransactionSynchronisationIfSynchronisationActive();
this.solrOperations.saveBeans((Collection<? extends T>) entities, 1000);
commitIfTransactionSynchronisationIsInactive();
return entities;
}
private void registerTransactionSynchronisationIfSynchronisationActive() {
if (TransactionSynchronizationManager.isSynchronizationActive()) {
registerTransactionSynchronisationAdapter();
}
}
private void registerTransactionSynchronisationAdapter() {
TransactionSynchronizationManager.registerSynchronization(SolrTransactionSynchronizationAdapterBuilder
.forOperations(this.solrOperations).withDefaultBehaviour());
}
private void commitIfTransactionSynchronisationIsInactive() {
if (!TransactionSynchronizationManager.isSynchronizationActive()) {
this.solrOperations.commit();
}
}
}
and you also need to provide a SolrOperations bean for the right solr core:
#Configuration
public class FooSolrConfig {
#Bean
public SolrOperations getFooSolrOperations(SolrClient solrClient) {
return new SolrTemplate(solrClient, "foo");
}
}
Footnote: auto commit is (to my mind) conceptually incompatible with a transaction. An auto commit is a promise from solr that it will try to start to write it to disk within a certain time limit. Many things might stop that from actually happening however - a timely power or hardware failure, errors between the document and the schema, etc. But the client won't know that solr failed to keep its promise, and the transaction will see a success when it actually failed.

Reset state before each Spring scheduled (#Scheduled) run

I have a Spring Boot Batch application that needs to run daily. It reads a daily file, does some processing on its data, and writes the processed data to a database. Along the way, the application holds some state such as the file to be read (stored in the FlatFileItemReader and JobParameters), the current date and time of the run, some file data for comparison between read items, etc.
One option for scheduling is to use Spring's #Scheduled such as:
#Scheduled(cron = "${schedule}")
public void runJob() throws Exception {
jobRunner.runJob(); //runs the batch job by calling jobLauncher.run(job, jobParameters);
}
The problem here is that the state is maintained between runs. So, I have to update the file to be read, the current date and time of the run, clear the cached file data, etc.
Another option is to run the application via a unix cron job. This will obviously meet the need to clear state between runs but I prefer to tie the job scheduling to the application instead of the OS (and prefer it to OS agnostic). Can the application state be reset between #Scheduled runs?
You could always move the code that performs your task (and more importantly, keeps your state) into a prototype-scoped bean. Then you can retrieve a fresh instance of that bean from the application context every time your scheduled method is run.
Example
I created a GitHub repository which contains a working example of what I'm talking about, but the gist of it is in these two classes:
ScheduledTask.java
Notice the #Scope annotation. It specifies that this component should not be a singleton. The randomNumber field represents the state that we want to reset with every invocation. "Reset" in this case means that a new random number is generated, just to show that it does change.
#Component
#Scope(ConfigurableBeanFactory.SCOPE_PROTOTYPE)
class ScheduledTask {
private double randomNumber = Math.random();
void execute() {
System.out.printf(
"Executing task from %s. Random number is %f%n",
this,
randomNumber
);
}
}
TaskScheduler.java
By autowiring in ApplicationContext, you can use it inside the scheduledTask method to retrieve a new instance of ScheduledTask.
#Component
public class TaskScheduler {
#Autowired
private ApplicationContext applicationContext;
#Scheduled(cron = "0/5 * * * * *")
public void scheduleTask() {
ScheduledTask task = applicationContext.getBean(ScheduledTask.class);
task.execute();
}
}
Output
When running the code, here's an example of what it looks like:
Executing task from com.thomaskasene.example.schedule.reset.ScheduledTask#329c8d3d. Random number is 0.007027
Executing task from com.thomaskasene.example.schedule.reset.ScheduledTask#3c5b751e. Random number is 0.145520
Executing task from com.thomaskasene.example.schedule.reset.ScheduledTask#3864e64d. Random number is 0.268644
Thomas' approach seems to be a reasonable solution, that's why I upvoted it. What is missing is how this can be applied in the case of a spring batch job. Therefore I adapted his example little bit:
#Component
public class JobCreatorComponent {
#Bean
#Scope(ConfigurableBeanFactory.SCOPE_PROTOTYPE)
public Job createJob() {
// use the jobBuilderFactory to create your job as usual
return jobBuilderFactory.get() ...
}
}
your component with the launch method
#Component
public class ScheduledLauncher {
#Autowired
private ... jobRunner;
#Autwired
private JobCreatorComponent creator;
#Scheduled(cron = "${schedule}")
public void runJob() throws Exception {
// it would probably make sense to check the applicationContext and
// remove any existing job
creator.createJob(); // this should create a complete new instance of
// the Job
jobRunner.runJob(); //runs the batch job by calling jobLauncher.run(job, jobParameters);
}
I haven't tried out the code, but this is the approach I would try.
When constructing the job, it is important to ensure that all reader, processors and writers used in this job are complete new instances as well. This means, if they are not instantiated as pure java objects (not as spring beans) or as spring beans with scope "step" you must ensure that always a new instance is used.
Edited:
How to handle SingeltonBeans
Sometimes singleton beans cannot be prevented, in these cases there must be a way to "reset" them.
An simple approach would be to define an interface "ResetableBean" with a reset method that is implemented by such beans. Autowired can then be used to collect a list of all such beans.
#Component
public class ScheduledLauncher {
#Autowired
private List<ResetableBean> resetables;
...
#Scheduled(cron = "${schedule}")
public void runJob() throws Exception {
// reset all the singletons
resetables.forEach(bean -> bean.reset());
...

Resources