I'd like to prefix this with the fact that this issue is due to PostgresSQL and its known problem with row counting.
With only tens of millions of rows, a call to localhost:8080/myObject takes a significant amount of time to execute because of the extra call to count all rows.
Given that, is there any way to disable the count call from the base collection resource in Spring Data REST / JPA, without writing custom Repository implementations that use List / Iterable / Slice return types for the Pageable methods?
Related
I have the following scenario. In a database A I have a table with huge amount of records (several millions); these records increase day by day very rapidly (also 100.000 records at day).
I need to fetch these records, check if these records are valid and import them in my own database. At the first interaction I should take all the stored records. Then I can take only the new records saved. I have a timestamp column I can use for this filter but I can't figure how to create a JpaPagingItemReader or a JdbcPagingItemReader and pass the dynamic filter based on the date (e.g. select all records where timestamp is greater than job last execution date)
I'm using spring boot, spring data jpa and spring batch.I'm configuring the Job instance in chunks with dimension 1000. I can also use a paging query (is it useful if I use chunks?)
I have a micro service (let's call this MSA) with all the business logic needed to check if records are valid and insert the valid records.
I have another service on a separate server. This service contains all the batch operation (let's call this MSB).
I'm wondering what is the best approach to the batch. I was thinking to these solutions:
in MSB I duplicate all the entities, repositories and services I use in the MSA. Then in MSB I can make all needed queries
in MSA I create all the rest API needed. The ItemProcessor of MSB will call these rest API to perform checks on items to be processed and finally in the ItemWriter I'll call the rest API for saving data
The first solution would avoid the http calls but it forces me to duplicate all repositories and services between the 2 micro services. Sadly I can't use a common project where to place all the common objects.
The second solution, on the other hand, would avoid the code duplication but it would imply a lot of http calls (above all in the ItemProcessor to check if an item is valid or less).
Do you have any other suggestion? Is there a better approach?
Thank you
Angelo
In spring Data, I can easily perform queries such as:
Page<MyClass> findByX(String x, Pageable pageable);
In reactive Spring data (MongoDB), I couldn't find a valid way to paginate the result.
Mono<Page<MyClass>> findByX(String x, Pageable pageable);
seems like a good candidate but fails with an error that requires usage of Flux.buffer(size, skip)
If there is no valid way to do it, Is there a way to get the query total count without actually performing the query once without Page?
Reactive Spring Data MongoDB repositories do not provide paging in the sense of paging how it's designed for imperative repositories. Imperative paging requires additional details while fetching a page. In particular:
The number of returned records for a paging query
Optionally, total count of records the query yields if the number of
returned records is zero or matches the page size to calculate the
overall number of pages
Both aspects do not fit to the notion of efficient, non-blocking resource usage. Waiting until all records are received (to determine the first chunk of paging details) would remove a huge part of the benefits you get by reactive data access. Additionally, executing a count query is rather expensive, and increases the lag until you're able to process data.
You can still fetch chunks of data yourself by passing a Pageable (PageRequest) to repository query methods:
Flux<MyClass> findByX(String x, Pageable pageable);
Spring Data will apply pagination to the query by translating Pageable to LIMIT and OFFSET.
Spring Documentation says:
You can try the below code, I believe it should work for you to get the count
Mono<Long> countByXxxx(String x);
This is how I did it:
public Mono<Page<Role>> findByTenantId(TenantId id, Pageable page) {
return roleRepository.countByTenantId(id)
.zipWith(roleRepository.findByTenantId(id, page).collectList())
.map(countAndItems -> new PageImpl<Role>(countAndItems.getT2(),
page, countAndItems.getT1()));
}
It is true that in this way both streaming of result items only starts after the complete page has been loaded and we need to count toal results first.
I have a use case where i want to record data in rows and display to the user.
Multiple users can add these records and they have to be displayed in order of insertion AND - MOST IMPORTANTLY - with a sequence number starting from 1.
I have a Spring boot microservice architecture at the backend, which obviously means i cannot hold state in my boot application as i'm gonna have multiple running instances.
Another method was to fetch all existing records in the db,count them,increment the count by 1 and use that as my sequence. I need to do this every time i am doing an insert.
But the problem with the second approach is with parallel requests, which could result in same sequence number being given to 2 records.
Third approach is to configure the counter in a db , but since i am using cosmos DB, apparently that is also not an option.
Any suggestions as to how i can implement a static, shared counter ?
findAll(ListU ..) in spring jpa is called by passing UUID's list of size n, when sql logs are enabled i found n number of sql getting logged ,which i assume DB is being called n times(size of list) to fetch records.
can it be possible to call DB only once to fetch all the records at once so that performance can be improved
Spring Data uses an IN close for this method except when your entity has a composite key. That is already just one query.
So the multiple queries you see is most likely your JPA implementation deciding to return a proxy with just the id and then lazy loading the attributes by demand. See the documentation of the implementation you are using for how to prevent/control that.
I am looking to retrieve a large dataset with a JpaRepository, backed by Oracle
table. The choices are to return a collection (List) or a Page of the entity and then step through the results. Please note - I have to consume every record in this set, exactly once. This is not a "look-for-the-first-one-from-a-large-dataset-and-return" operation.
While the paging idea is appealing, the performance will be horrible (n^2) because for each page queried, oracle will have to pull up previous n-1 pages, making the performance progressively worse as I get deeper in the result set.
My understanding of the List alternative is that the entire result-set will be loaded in memory. For oracle JPA spring does not have a backing result-set.
So here are my questions
Is my understanding of the way List works with Spring Data correct? If it's not then I will just use List.
If I am correct, is there an alternative that streams Oracle/JPA result-sets?
Is there a third way that I am not aware of.
Pageable methods in SDJ call additional select count(*) from ... every request. I think this is reason of the problem.
To avoid it you can use Slice instead of Page as return parameter, for example:
Slice<User> getAllBy(Pageable pageable);
Or you can use even List of entities with pagination:
List<User> getAllBy(Pageable pageable);
Additional info