I need to query a big dataset from DB. Actually I'm gonna use pagination parameters (limit and offset) to avoid loading large dataset into heap. For that purpose I'm trying to fetch rows with RowCallBackHadler interface, because docs say An interface used by JdbcTemplate for processing rows of a ResultSet on a per-row basis. and also I've read advices to use that interface to deal with rows one by one.
But something goes wrong every time when I try to fetch data. Here my code below and also screenshot from visualVM with heap space graphic which indicates that all rows were loaded into memory. Query, which I'm trying to execute, returns something around 1.5m rows in DB.
// here just sql query, map with parameters for query, pretty simple RowCallbackHandler
jdbcTemplate.query(queryForExecute, params, new RowCallbackHandler() {
#Override
public void processRow(ResultSet rs) throws SQLException {
while (rs.next()) {
System.out.println("test");
}
}
});
heap via visualVM:
update: I made a mistake when called rs.next(), but removing this line didn't change the situation with loading rows into memory at all
The main problem was with understanding documentation. Doc says
An interface used by JdbcTemplate for processing rows of a ResultSet on a per-row basis.
Actually my code does things in right way: returns me a ResultSet which contains all row (because limit is not defined). I had no confidence that adding LIMIT to any sql query will work good and decided to implement LIMIT via RowCallbackHandler and it was a bad idea, because LIMIT works great with all type of sql queries (complex and simple).
Related
Created a scheduler to delete the older records from mongoDB, which runs once a day but getting OOM issue while deleting the records from DB.
We are getting more than 50K records to get deleted
The method is like below
#Override
public void purgeLteTimeBlock(Long timeBlock) {
Query query = new Query();
query.addCriteria(Criteria.where(Constants.TIME_BLOCK).lte(timeBlock));
mongoTemplate.findAllAndRemove(query, abcEntity.class);
}
From our observation we have found that mongoTemplate provides 3 findAllAndRemove methods and each of them returning the list of objects which are getting deleted.
so we thought that this might be the reason of getting OOM(out of memory issue) because it's giving back more than 50K records in a code
So is there any solution to handle this kind of delete operations from mongoDB?
I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.
I am looking to retrieve a large dataset with a JpaRepository, backed by Oracle
table. The choices are to return a collection (List) or a Page of the entity and then step through the results. Please note - I have to consume every record in this set, exactly once. This is not a "look-for-the-first-one-from-a-large-dataset-and-return" operation.
While the paging idea is appealing, the performance will be horrible (n^2) because for each page queried, oracle will have to pull up previous n-1 pages, making the performance progressively worse as I get deeper in the result set.
My understanding of the List alternative is that the entire result-set will be loaded in memory. For oracle JPA spring does not have a backing result-set.
So here are my questions
Is my understanding of the way List works with Spring Data correct? If it's not then I will just use List.
If I am correct, is there an alternative that streams Oracle/JPA result-sets?
Is there a third way that I am not aware of.
Pageable methods in SDJ call additional select count(*) from ... every request. I think this is reason of the problem.
To avoid it you can use Slice instead of Page as return parameter, for example:
Slice<User> getAllBy(Pageable pageable);
Or you can use even List of entities with pagination:
List<User> getAllBy(Pageable pageable);
Additional info
Question: How can I process (read in) batches of records 1000 at a time and ensure that only the current batch of 1000 records is in memory? Assume my primary key is called 'ID' and my table is called Customer.
Background: This is not for user pagination, it is for compiling statistics about my table. I have limited memory available, therefore I want to read my records in batches of 1000 records at a time. I am only reading in records, they will not be modified. I have read that StatelessSession is good for this kind of thing and I've heard about people using ScrollableResults.
What I have tried: Currently I am working on a custom made solution where I implemented Iterable and basically did the pagination by using setFirstResult and setMaxResults. This seems to be very slow for me but it allows me to get 1000 records at a time. I would like to know how I can do this more efficiently, perhaps with something like ScrollableResults. I'm not yet sure why my current method is so slow; I'm ordering by ID but ID is the primary key so the table should already be indexed that way.
As you might be able to tell, I keep reading bits and pieces about how to do this. If anyone can provide me a complete way to do this it would be greatly appreciated. I do know that you have to set FORWARD_ONLY on ScrollableResults and that calling evict(entity) will take an entity out of memory (unless you're doing second level caching, which I do not yet know how to check if I am or not). However I don't see any methods in the JavaDoc to read in say, 1000 records at a time. I want a balance between my lack of available memory and my slow network performance, so sending records over the network one at a time really isn't an option here. I am using Criteria API where possible. Thanks for any detailed replies.
May useing of ROWNUM feature of oracle will hepl you.
Lets say we need to fetch 1000 rows(pagesize) of table CUSTOMERS and we need to fetch second page(pageNumber)
Creating and Calling some query like this may be the answer
select * from
(select rownum row_number,customers.* from Customer
where rownum <= pagesize*pageNumber order by ID)
where row_number >= (pagesize -1)*pageNumber
Load entities as read-only.
For HQL
Query.setReadOnly( true );
For Criteria
Criteria.setReadOnly( true );
http://docs.jboss.org/hibernate/orm/3.6/reference/en-US/html/readonly.html#readonly-api-querycriteria
Stateless session quite different with State-Session.
Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session
http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-statelesssession
Use flash() and clear() to clean up session cache.
session.flush();
session.clear();
Question about Hibernate session.flush()
ScrollableResults should works that you expect.
Do not forget that each item that you loaded takes memory space unless you evict or clear and need to check it really works well.
ScrollableResults in Mysql J/Connecotr works fake, it loads entire rows, but I think oracle connector works fine.
Using Hibernate's ScrollableResults to slowly read 90 million records
If you find alternatives, you may consider to use this way
1. Select PrimaryKey of every rows that you will process
2. Chopping them into PK chunk
3. iterate -
select rows by PK chunk (using in-query)
process them what you want
Using Linq is there a more efficient way to do this?
IDataReader reader = qSeats.ExecuteReader();
var seats = new List<int>();
using (IDataReader reader = qSeats.ExecuteReader())
{
while (reader.Read())
{
seats.Add(Convert.ToInt32(reader.GetInt32(0)));
}
}
I saw: How do I load data to a list from a dataReader?
However this is the same code I have above, and it seems like there could be faster ways.
Like using Linq or seats.AddRange() or some kind of ToList()
DataReader is read-only, forward-only stream of data from a database. The way you are doing it is probably the fastest.
Using the DataReader can increase application performance both by
retrieving data as soon as it is available, rather than waiting for
the entire results of the query to be returned, and (by default)
storing only one row at a time in memory, reducing system overhead.
Retrieving Data Using the DataReader