In spring Data, I can easily perform queries such as:
Page<MyClass> findByX(String x, Pageable pageable);
In reactive Spring data (MongoDB), I couldn't find a valid way to paginate the result.
Mono<Page<MyClass>> findByX(String x, Pageable pageable);
seems like a good candidate but fails with an error that requires usage of Flux.buffer(size, skip)
If there is no valid way to do it, Is there a way to get the query total count without actually performing the query once without Page?
Reactive Spring Data MongoDB repositories do not provide paging in the sense of paging how it's designed for imperative repositories. Imperative paging requires additional details while fetching a page. In particular:
The number of returned records for a paging query
Optionally, total count of records the query yields if the number of
returned records is zero or matches the page size to calculate the
overall number of pages
Both aspects do not fit to the notion of efficient, non-blocking resource usage. Waiting until all records are received (to determine the first chunk of paging details) would remove a huge part of the benefits you get by reactive data access. Additionally, executing a count query is rather expensive, and increases the lag until you're able to process data.
You can still fetch chunks of data yourself by passing a Pageable (PageRequest) to repository query methods:
Flux<MyClass> findByX(String x, Pageable pageable);
Spring Data will apply pagination to the query by translating Pageable to LIMIT and OFFSET.
Spring Documentation says:
You can try the below code, I believe it should work for you to get the count
Mono<Long> countByXxxx(String x);
This is how I did it:
public Mono<Page<Role>> findByTenantId(TenantId id, Pageable page) {
return roleRepository.countByTenantId(id)
.zipWith(roleRepository.findByTenantId(id, page).collectList())
.map(countAndItems -> new PageImpl<Role>(countAndItems.getT2(),
page, countAndItems.getT1()));
}
It is true that in this way both streaming of result items only starts after the complete page has been loaded and we need to count toal results first.
Related
I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.
findAll(ListU ..) in spring jpa is called by passing UUID's list of size n, when sql logs are enabled i found n number of sql getting logged ,which i assume DB is being called n times(size of list) to fetch records.
can it be possible to call DB only once to fetch all the records at once so that performance can be improved
Spring Data uses an IN close for this method except when your entity has a composite key. That is already just one query.
So the multiple queries you see is most likely your JPA implementation deciding to return a proxy with just the id and then lazy loading the attributes by demand. See the documentation of the implementation you are using for how to prevent/control that.
I am looking to retrieve a large dataset with a JpaRepository, backed by Oracle
table. The choices are to return a collection (List) or a Page of the entity and then step through the results. Please note - I have to consume every record in this set, exactly once. This is not a "look-for-the-first-one-from-a-large-dataset-and-return" operation.
While the paging idea is appealing, the performance will be horrible (n^2) because for each page queried, oracle will have to pull up previous n-1 pages, making the performance progressively worse as I get deeper in the result set.
My understanding of the List alternative is that the entire result-set will be loaded in memory. For oracle JPA spring does not have a backing result-set.
So here are my questions
Is my understanding of the way List works with Spring Data correct? If it's not then I will just use List.
If I am correct, is there an alternative that streams Oracle/JPA result-sets?
Is there a third way that I am not aware of.
Pageable methods in SDJ call additional select count(*) from ... every request. I think this is reason of the problem.
To avoid it you can use Slice instead of Page as return parameter, for example:
Slice<User> getAllBy(Pageable pageable);
Or you can use even List of entities with pagination:
List<User> getAllBy(Pageable pageable);
Additional info
I'd like to prefix this with the fact that this issue is due to PostgresSQL and its known problem with row counting.
With only tens of millions of rows, a call to localhost:8080/myObject takes a significant amount of time to execute because of the extra call to count all rows.
Given that, is there any way to disable the count call from the base collection resource in Spring Data REST / JPA, without writing custom Repository implementations that use List / Iterable / Slice return types for the Pageable methods?
I'm using lucene.net to produce an index and search it. I'm actually using the API indirectly through the Examine project on codeplex. I currently have everything working and the paging logic in place, however the current logic pages the results after the search has been completed. I don't like this because it means the search will possibly return thousands of records and only then does my code take the 10-20 records it needs and discards the rest which is a major waste of resources. Even if each SearchResult item is just a tiny 3KB the amount of memory to execute these searches will grow with time and become a huge memory hog. My shared host is only guaranteeing 1GB of dedicated memory so this is a big concern for my website.
So the question is: How do i limit the results of the results in a paged manner using lucene query language alone? I looked at the apache lucene project, which lucene.net is ported from, and I don't see any syntax that lets me do what I'm looking for. Basically I want the equivalent of what sql server has to limit the rows at the query language level.
E.g. (this is how we would do paging in sql and it only returns 20 records not every record that matches the where clause)
Select * from (select Row_Number() OVER (ORDER BY OrderDate) as RoNum,
OrderID,
OrderDate
FROM SalesOrders
WHERE OrderCustomerName like 'Davis%') O
WHERE RowNum BETWEEN 1 and 20
I don't think that there is a major waste of resources, since search is (making it simple) nothing more than calculating the Bitvector & scores. What is costly is the reading of docs from the index. (Except the deprecated Hits class) search results don't read the docs, instead just return the docid's, so there isn't much overhead in skipping the first N result.
The exception for this is when you want to sort the result according to some field. Then all docs in the search result list must be read from the index, to be able to return them in correct order.