When I insert data through a DAO which references a mybatis mapper, multiple tables are affected.
public void insertStuff(Collection<Stuff> data) {
for (Stuff item : data) {
mapper.insertT1(item.getT1Stuff());
mapper.insertT2(item.getT2Stuff());
Collection<MainStuff> mainData = item.getMainStuff();
for (MainStuff mainItem : mainData) {
mapper.insertMainData(mainItem);
}
}
}
I'm using mybatis' BATCH executor type, but I'm quickly reaching Oracle's MAX_CURSOR limit because a new PreparedStatement (and a new Connection) is created for each of the three mapper statements on each iteration through the main loop. I can avoid this by iterating multiple times through the loop:
public void insertStuff(Collection<Stuff> data) {
for (Stuff item : data) {
mapper.insertT1(item.getT1Stuff());
}
for (Stuff item : data) {
mapper.insertT2(item.getT2Stuff());
}
for (Stuff item : data) {
Collection<MainStuff> mainData = item.getMainStuff();
for (MainStuff mainItem : mainData) {
mapper.insertMainData(mainItem);
}
}
}
However, the latter code is less readable, costs a little bit performance-wise, and breaks modularity.
Is there a better way to do this? Do I need to use the SqlSession directly and flush statements after a certain number are queued?
If you want to use batches you should use second way. In first code you actually don't have any batches. Real batch have N same statements. If you executed 3 different queries and encapsulated them into batch, you jdbc driver will divide them in 3 batches with one query. In second code there will be three batches which is the fastest if you have a lot of data.
Related
Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.
I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.
The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}
When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.
So I have a spring batch app that I have getting a list of ids that it then uses 'read()' on to get 1 to many results back. The issue is, I have no control over how many results I get back for each id meaning that my chunking is spotty at best. Is there a suggested way to avoid spikes in memory/cpu? An example is below:
#Before
public void getIds() {
*getListOfIds* //Usually around 10,000 or so
}
#Override
public AccountObject read() {
if(list of ids havent all been used) {
List<AccountObject> myAccounts = myService.getAccounts(id);
return myAccounts; //This could be anywhere from 1 result to 100,000 results.
} else {
return null;
}
}
So the myAccounts object above could be small or huge. This causes chunking to basically be useless because at the moment I am chunking by List. I'd really rather chunk by straight AccountObject but don't see an easy way to do this.
Is there a class, strategy, etc. that I am missing here?
Use case:
A one-time read of data set X (from database) into a Collection C. [Collection size could be say 5000]
Use Collection C to process/enrich items in a Spring Batch Step (say enrichStep)
If C is much greater than what can be passed via ExecutionContext, how can we make it available in the ItemProcessor of the enrichStep?
In your enrichStep add a StepExecutionListener.beforeStep and load your huge collection in a HugeCollectionBeanHolder bean.
In this way you will load collection only once (when step start or re-start) and without persist it into execution context.
In your enrich processor wire the HugeCollectionBeanHolder to access huge collection.
class HugeCollectionBeanHolder {
Collection<Item> hudeCollection;
void setHugeCollection(Collection<Item> c) { this.hugeCollection = c;}
Collection<Item> getHugeCollection() { return this.hugeCollection;}
}
class MyProcessor implements ItemProcessor<Input,Output> {
HugeCollectionBeanHolder hcbh;
void setHugeCollectionBeanHolder(HugeCollectionBeanHolder bean) { this.hcbh = bean;}
// other methods...
}
You can also look at Spring Batch: what is the best way to use, the data retrieved in one TaskletStep, in the processing of another step
I am trying to analyze what problem i might be having with unsafe threading in my code.
In my mvc3 webapplication i try to the following:
// Caching code
public static class CacheExtensions
{
public static T GetOrStore<T>(this Cache cache, string key, Func<T> generator)
{
var result = cache[key];
if(result == null)
{
result = generator();
lock(sync) {
cache[key] = result;
}
}
return (T)result;
}
}
Using the caching like this:
// Using the cached stuff
public class SectionViewData
{
public IEnumerable<Product> Products {get;set;}
public IEnumerable<SomethingElse> SomethingElse {get;set;}
}
private void Testing()
{
var cachedSection = HttpContext.Current.Cache.GetOrStore("Some Key", 0 => GetSectionViewData());
// Threading problem?
foreach(var product in cachedSection.Products)
{
DosomestuffwithProduct...
}
}
private SectionViewData GetSectionViewData()
{
SectionViewData viewData = new SectionViewData();
viewData.Products = CreateProductList();
viewData.SomethingElse = CreateSomethingElse();
return viewData;
}
Could i run inte problem with the IEnumerable? I dont have much experience with threading problems. The cachedSection would not get touched if some other thread adds a new value to cache right? To me this would work!
Should i cache Products and SomethingElse indivually? Would that be better than caching the whole SectionViewData??
Threading is hard;
In your GetOrStore method, the get/generator sequence is entirely unsynchronized, so any nymber of threads can get null from the cache and run the generator function at the same time. This may - or may not - be a problem.
Your lock statement only locks the setter of cache[string], which is already thread safe and doesn't need to be "extra locked".
The variation of double-checked locking in the cache is suspect, I'd try to get rid of it. Since the thread that never enters the lock() section can get result without a memory barrier, result may not be entirely constructed by the time the thread gets it.
Enumerating the cached IEnumrators is safe as long as nothing modifies them at the same time. If GetSectionViewData() returns an object with immutable (as in non changing) collections, you're safe.
Your code is missing parts like how would Products be populated? Only in GetSectionViewData?
If so, then I don't see a major problem with your code.
There is however a chance that two threads generate the same data(CachedSection) for the same key, it shouldn't create a threading problem except that you are doing the work twice, so if this was an expensive operation I would change the code so it only generates it once per key. If it is not expensive, it works fine as is.
IEnumerable for Products is not touched (assuming you create it separately per thread, but the enumerator on the cache is modified for each insert operation, hence it is not thread safe. So if you are using this I would be careful about that.