Spring batch reader - How to avoid returning a list of objects - spring

So I have a spring batch app that I have getting a list of ids that it then uses 'read()' on to get 1 to many results back. The issue is, I have no control over how many results I get back for each id meaning that my chunking is spotty at best. Is there a suggested way to avoid spikes in memory/cpu? An example is below:
#Before
public void getIds() {
*getListOfIds* //Usually around 10,000 or so
}
#Override
public AccountObject read() {
if(list of ids havent all been used) {
List<AccountObject> myAccounts = myService.getAccounts(id);
return myAccounts; //This could be anywhere from 1 result to 100,000 results.
} else {
return null;
}
}
So the myAccounts object above could be small or huge. This causes chunking to basically be useless because at the moment I am chunking by List. I'd really rather chunk by straight AccountObject but don't see an easy way to do this.
Is there a class, strategy, etc. that I am missing here?

Related

Spring Batch - use JpaPagingItemReader to read lists instead of individual items

Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.
I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.
The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}

Spring Webflux throws a "block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-2"

I have a small issue with doing a blocking operation in Spring Webflux. I retrieve a list of article documents and from the list of article documents, i would like to update another object.
When i execute the below, sometimes it works and sometimes it throws a "block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-2". Could you please suggest how to fix. I dont really want to make it blocking but not sure how to proceed. There are similar threads in stackoverflow but not with respective to my requirement.
It would be really nice if someone could suggest a way to work around ?
private OrderInfo setPrices(final OrderInfo orderInfo) {
final List<ArticleDocument> articleDocuments = getArticleDocuments(orderInfo).block(); // Problematic line
for (ArticleDocument article : articleDocuments) {
//Update orderInfo based on one of the article price and few more condition.
}
return orderInfo;
}
private Mono<List<ArticleDocument>> getArticleDocuments(final OrderInfo orderInfo) {
return this.articleRepository.findByArticleName(orderInfo.getArticleName()).collectList();
}
It has to be something like this. Please take note that I have not tested it on my IDE. To modify anything please comment and figure it out together.
private Mono<OrderInfo> setPrices(final OrderInfo orderInfo) {
getArticleDocuments(orderInfo)
.map(articleDocuments -> {
articleDocuments.forEach(article -> // UPDATE AS YOU NEED);
return orderInfo;
});
private Mono<List<ArticleDocument>> getArticleDocuments(final OrderInfo orderInfo) {
return this.articleRepository.findByArticleName(orderInfo.getArticleName()).collectList();
}
Remember, you have to put everything under chaining. that's why you have to return Mono<OrderInfo> instead of OrderInfo from setPrices method. If you find my suggested code is tough to adapt to your current coding structure, you can show me the full code. Let's find out we can build a good chain or not.
BTW, you were using getArticleDocuments(orderInfo).block();. See? you were using .block()? Don't do that in a chain. don't ever block anything in a request to the response chain process. you will return mono or flux from the controller and everything will be handled by webflux

Performance in microservice-to-microservice data transfer

I have controller like this:
#RestController
#RequestMapping("/stats")
public class StatisticsController {
#Autowired
private LeadFeignClient lfc;
private List<Lead> list;
#GetMapping("/leads")
private int getCount(#RequestParam(value = "count", defaultValue = "1") int countType) {
list = lfc.getLeads(AccessToken.getToken());
if (countType == 1) {
return MainEngine.getCount(list);
} else if (countType == 2) {
return MainEngine.getCountRejected(list);
} else if (countType == 3) {
return MainEngine.getCountPortfolio(list);
} else if (countType == 4) {
return MainEngine.getCountInProgress(list);
} else if (countType == 5) {
return MainEngine.getCountForgotten(list);
} else if (countType == 6) {
return MainEngine.getCountAddedInThisMonth(list);
} else if (countType == 7) {
return MainEngine.getCountAddedInThisYear(list);
} else {
throw new RuntimeException("Wrong mapping param");
}
}
#GetMapping("/trends")
private boolean getTrend() {
return MainEngine.tendencyRising(list);
}
It is basically a microservice that will handle statistics basing on list of 'Business Leads'. FeignClient is GETting list of trimmed to the required data leads. Everything is working properly.
My only concern is about performance - all of this statistics (countTypes) are going to be presented on the landing page of webapp. If i will call them one by one, does every call will retrieve lead list again and again? Or list will be somehow stored in temporary memory? I can imagine that if list become longer, it could take a while to load them.
I've tried to call them outside this method, by #PostConstruct, to populate list at the start of service, but this solution has two major problems: authentication cannot be handled by oauth token, retrieved list will be insensitive to adding/deleting leads, cause it is loaded at the beginning only.
The list = lfc.getLeads(AccessToken.getToken()); will be called with each GET request. Either take a look at caching the responses which might be useful when you need to obtain a large volume of data often.
I'd start here: Baeldung's: Spring cache tutorial which gives you an idea about the caching. Then you can take a look at the EhCache implementation or implement own interceptor putting/getting from/to external storage such as Redis.
The caching is the only way I see to resolve this: Since the Feign client is called with a different request (based on the token) the data are not static and need to be cached.
You need to implement a caching layer to improve performance. What you can do is, you can have cache preloaded immediately after application starts. This way you will have the response ready in the cache. I would suggest to go with Redis cache. But any cache will do the job.
Also, it will be better if you can move the logic of getCount() to some service class.

Spring data Page/Pageable returns duplicates on large data sets?

When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.

Thread safe caching

I am trying to analyze what problem i might be having with unsafe threading in my code.
In my mvc3 webapplication i try to the following:
// Caching code
public static class CacheExtensions
{
public static T GetOrStore<T>(this Cache cache, string key, Func<T> generator)
{
var result = cache[key];
if(result == null)
{
result = generator();
lock(sync) {
cache[key] = result;
}
}
return (T)result;
}
}
Using the caching like this:
// Using the cached stuff
public class SectionViewData
{
public IEnumerable<Product> Products {get;set;}
public IEnumerable<SomethingElse> SomethingElse {get;set;}
}
private void Testing()
{
var cachedSection = HttpContext.Current.Cache.GetOrStore("Some Key", 0 => GetSectionViewData());
// Threading problem?
foreach(var product in cachedSection.Products)
{
DosomestuffwithProduct...
}
}
private SectionViewData GetSectionViewData()
{
SectionViewData viewData = new SectionViewData();
viewData.Products = CreateProductList();
viewData.SomethingElse = CreateSomethingElse();
return viewData;
}
Could i run inte problem with the IEnumerable? I dont have much experience with threading problems. The cachedSection would not get touched if some other thread adds a new value to cache right? To me this would work!
Should i cache Products and SomethingElse indivually? Would that be better than caching the whole SectionViewData??
Threading is hard;
In your GetOrStore method, the get/generator sequence is entirely unsynchronized, so any nymber of threads can get null from the cache and run the generator function at the same time. This may - or may not - be a problem.
Your lock statement only locks the setter of cache[string], which is already thread safe and doesn't need to be "extra locked".
The variation of double-checked locking in the cache is suspect, I'd try to get rid of it. Since the thread that never enters the lock() section can get result without a memory barrier, result may not be entirely constructed by the time the thread gets it.
Enumerating the cached IEnumrators is safe as long as nothing modifies them at the same time. If GetSectionViewData() returns an object with immutable (as in non changing) collections, you're safe.
Your code is missing parts like how would Products be populated? Only in GetSectionViewData?
If so, then I don't see a major problem with your code.
There is however a chance that two threads generate the same data(CachedSection) for the same key, it shouldn't create a threading problem except that you are doing the work twice, so if this was an expensive operation I would change the code so it only generates it once per key. If it is not expensive, it works fine as is.
IEnumerable for Products is not touched (assuming you create it separately per thread, but the enumerator on the cache is modified for each insert operation, hence it is not thread safe. So if you are using this I would be careful about that.

Resources