I am looking to retrieve a large dataset with a JpaRepository, backed by Oracle
table. The choices are to return a collection (List) or a Page of the entity and then step through the results. Please note - I have to consume every record in this set, exactly once. This is not a "look-for-the-first-one-from-a-large-dataset-and-return" operation.
While the paging idea is appealing, the performance will be horrible (n^2) because for each page queried, oracle will have to pull up previous n-1 pages, making the performance progressively worse as I get deeper in the result set.
My understanding of the List alternative is that the entire result-set will be loaded in memory. For oracle JPA spring does not have a backing result-set.
So here are my questions
Is my understanding of the way List works with Spring Data correct? If it's not then I will just use List.
If I am correct, is there an alternative that streams Oracle/JPA result-sets?
Is there a third way that I am not aware of.
Pageable methods in SDJ call additional select count(*) from ... every request. I think this is reason of the problem.
To avoid it you can use Slice instead of Page as return parameter, for example:
Slice<User> getAllBy(Pageable pageable);
Or you can use even List of entities with pagination:
List<User> getAllBy(Pageable pageable);
Additional info
Related
In spring Data, I can easily perform queries such as:
Page<MyClass> findByX(String x, Pageable pageable);
In reactive Spring data (MongoDB), I couldn't find a valid way to paginate the result.
Mono<Page<MyClass>> findByX(String x, Pageable pageable);
seems like a good candidate but fails with an error that requires usage of Flux.buffer(size, skip)
If there is no valid way to do it, Is there a way to get the query total count without actually performing the query once without Page?
Reactive Spring Data MongoDB repositories do not provide paging in the sense of paging how it's designed for imperative repositories. Imperative paging requires additional details while fetching a page. In particular:
The number of returned records for a paging query
Optionally, total count of records the query yields if the number of
returned records is zero or matches the page size to calculate the
overall number of pages
Both aspects do not fit to the notion of efficient, non-blocking resource usage. Waiting until all records are received (to determine the first chunk of paging details) would remove a huge part of the benefits you get by reactive data access. Additionally, executing a count query is rather expensive, and increases the lag until you're able to process data.
You can still fetch chunks of data yourself by passing a Pageable (PageRequest) to repository query methods:
Flux<MyClass> findByX(String x, Pageable pageable);
Spring Data will apply pagination to the query by translating Pageable to LIMIT and OFFSET.
Spring Documentation says:
You can try the below code, I believe it should work for you to get the count
Mono<Long> countByXxxx(String x);
This is how I did it:
public Mono<Page<Role>> findByTenantId(TenantId id, Pageable page) {
return roleRepository.countByTenantId(id)
.zipWith(roleRepository.findByTenantId(id, page).collectList())
.map(countAndItems -> new PageImpl<Role>(countAndItems.getT2(),
page, countAndItems.getT1()));
}
It is true that in this way both streaming of result items only starts after the complete page has been loaded and we need to count toal results first.
I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.
findAll(ListU ..) in spring jpa is called by passing UUID's list of size n, when sql logs are enabled i found n number of sql getting logged ,which i assume DB is being called n times(size of list) to fetch records.
can it be possible to call DB only once to fetch all the records at once so that performance can be improved
Spring Data uses an IN close for this method except when your entity has a composite key. That is already just one query.
So the multiple queries you see is most likely your JPA implementation deciding to return a proxy with just the id and then lazy loading the attributes by demand. See the documentation of the implementation you are using for how to prevent/control that.
Question: How can I process (read in) batches of records 1000 at a time and ensure that only the current batch of 1000 records is in memory? Assume my primary key is called 'ID' and my table is called Customer.
Background: This is not for user pagination, it is for compiling statistics about my table. I have limited memory available, therefore I want to read my records in batches of 1000 records at a time. I am only reading in records, they will not be modified. I have read that StatelessSession is good for this kind of thing and I've heard about people using ScrollableResults.
What I have tried: Currently I am working on a custom made solution where I implemented Iterable and basically did the pagination by using setFirstResult and setMaxResults. This seems to be very slow for me but it allows me to get 1000 records at a time. I would like to know how I can do this more efficiently, perhaps with something like ScrollableResults. I'm not yet sure why my current method is so slow; I'm ordering by ID but ID is the primary key so the table should already be indexed that way.
As you might be able to tell, I keep reading bits and pieces about how to do this. If anyone can provide me a complete way to do this it would be greatly appreciated. I do know that you have to set FORWARD_ONLY on ScrollableResults and that calling evict(entity) will take an entity out of memory (unless you're doing second level caching, which I do not yet know how to check if I am or not). However I don't see any methods in the JavaDoc to read in say, 1000 records at a time. I want a balance between my lack of available memory and my slow network performance, so sending records over the network one at a time really isn't an option here. I am using Criteria API where possible. Thanks for any detailed replies.
May useing of ROWNUM feature of oracle will hepl you.
Lets say we need to fetch 1000 rows(pagesize) of table CUSTOMERS and we need to fetch second page(pageNumber)
Creating and Calling some query like this may be the answer
select * from
(select rownum row_number,customers.* from Customer
where rownum <= pagesize*pageNumber order by ID)
where row_number >= (pagesize -1)*pageNumber
Load entities as read-only.
For HQL
Query.setReadOnly( true );
For Criteria
Criteria.setReadOnly( true );
http://docs.jboss.org/hibernate/orm/3.6/reference/en-US/html/readonly.html#readonly-api-querycriteria
Stateless session quite different with State-Session.
Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session
http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-statelesssession
Use flash() and clear() to clean up session cache.
session.flush();
session.clear();
Question about Hibernate session.flush()
ScrollableResults should works that you expect.
Do not forget that each item that you loaded takes memory space unless you evict or clear and need to check it really works well.
ScrollableResults in Mysql J/Connecotr works fake, it loads entire rows, but I think oracle connector works fine.
Using Hibernate's ScrollableResults to slowly read 90 million records
If you find alternatives, you may consider to use this way
1. Select PrimaryKey of every rows that you will process
2. Chopping them into PK chunk
3. iterate -
select rows by PK chunk (using in-query)
process them what you want
I use JBoss EJB 3.0 implementation (JBoss 4.2.3 server)
At the beginning I created native query all the time using construction like
Query query = entityManager.createNativeQuery("select * from _table_");
Of couse it is not that efficient, I performed some tests and found out that it really takes a lot of time... Then I found a better way to deal with it, to use annotation to define native queries:
#NamedNativeQuery( name = "fetchData", value = "select * from _table_", resultClass=Entity.class )
and then just use it
Query query = entityManager.createNamedQuery("fetchData");
the performance of code line above is two times better than where I started from, but still not that good as I expected... then I found that I can switch to Hibernate annotation for NamedNativeQuery (anyway, JBoss's implementation of EJB is based on Hibernate), and add one more thing:
#NamedNativeQuery( name = "fetchData2", value = "select * from _table_", resultClass=Entity.class, readOnly=true)
readOnly - marks whether the results are fetched in read-only mode or not. It sounds good, because at least in this case of mine I don't need to update data, I wanna just fetch it for report. When I started server to measure performance I noticed that query without readOnly=true (by default it is false) returns result with each iteration better and better, and at the same time another one (fetchData2) works like "stable" and with time difference between them is shorter and shorter, and after 5 iterations speed of both was almost the same...
The questions are:
1) is there any other way to speed query using up? Seems that named queries should be prepared once, but I can't say it... In fact if to create query once and then just use it it would be better from performance point of view, but it is problematic to cache this object, because after creating query I can set parameters (when I use ":variable" in query), and it changes query object (isn't it?). well, is here any way to cache them? Or named query is the best option I can use?
2) any other approaches how to make results retrieveng faster. I mean, for instance I don't need those Entities to be attached, I won't update them, all I need is just fetch collection of data. Maybe readOnly is the only available way, so I can't speed it up, but who knows :)
P.S. I don't ask about DB performance, all I need now is how not to create query object all the time, so use it efficient, and to "allow" EJB to do less job with the same result concerning data returning.
Added 15.03.2010:
By query I mean query object (so how to cache this object to reuse); and to cache query results is not a solution for me because of where cause in query can be almost unique for each querying because of float-pointing parameters there. Cache just will not understand that "a > 50.0001" and "a > 50.00101" can give the same result, but also can not.
You could use second level cache and query cache to avoid hitting the database (works especially well with read-only objects). Second level cache is supported by Hibernate (with a third party cache provider) but is an extension to JPA 1.0 though.