I'm trying to develop a Scala microservice for data management for an Oracle database. I'm using JDBC drivers to connect to it.
Reading the answers to the performance questions regarding JDBC driver compared to the .NET one, I've understood that one of the more effective vehicle to tune the JDBC reading performance is to set the Fetch Size through the method ResultSet.setFetchSize.
I've tried connecting to an Oracle database to fetch real data for a real business case, with a fixed number of record returned by the DB, and I've measured an exponential behavior of the elapsed time. In particular, fetching 10,000 rows from the database without setting the fetch size resulting in a ridicolously large amount of fetch time, but specifying a fetch size larger than 1,000 resulting in a little amount of time gained (roughly 100 ms over 1 s).
Here's my questions regarding this topic:
I suppose that incrementing too much the fetch size would consume resources inopportunely for a little gain, so is there an even rough method to estimate the size of the ResultSet before actually fetching it? I've read about the following technique:
result.last();
result.getRow();
but this would mean scroll the entire ResultSet, and I was wondering if there's any even rough accurate technique to evaluate the count;
I've estimated that a good fetch size would be 1/10th of the number of record selected, but is there a documented rule to try to automatically estimate the correct fetch size for the largest number of cases?
Please do not set fetch size too large, unless you have network bottleneck between application and database. The larger the fetch size, the more memory consumed.
In my experience, 1024 - 2048 will lead to best performance most of the time. See
https://docs.oracle.com/javase/tutorial/jdbc/basics/retrieving.html discussing some details, but the default setting is usually best.
Do not try to get the total numbers of rows in the result set, it is not the best practice.
And finally, I want to point out that based on the hundreds of thousands of time optimize about JVM and jit, the bottleneck seems never happens on fetch size of JDBC after you set it with 1000-2000, but on the SQL performance, applications or resource limit and etc.
Related
I have been searching for an answer to this today, and it seems the best approach divides opinion somewhat.
I have 150,000 records that I need to retrieve from an Oracle database using JDBC. Is it better to retrieve the data using one select query and allowing the JDBC driver to take care of transferring the records from the database using Oracle cursor and default fetchSize - OR to split up the query into batches using LIMIT / OFFSET?
With the LIMIT / OFFSET option, I think the pros are that you can take control over the number of results you return in each chunk. The cons are that the query is executed multiple times, and you also need to run a COUNT(*) up front using the same query to calculate the number of iterations required.
The pros of retrieving all at once are that you rely on the JDBC driver to manage the retrieval of data from the database. The cons are that the setFetchSize() hint can sometimes be ignored meaning that we could end up with a huge resultSet containing all 150,000 records at once!!
Would be great to hear some real life experiences solving similar issues, and recommendations would be much appreciated.
The native way in Oracle JDBC is to use the prepareStatement for the query, executeQuery and fetch
in a loop the results with defined fetchSize
Yes, of course the details are Oracle Database and JDBC Driver Version dependent and in some case the required fetchSize
can be ignored. But the typical problem is that the required fetch size is reset to fetchSize = 1 and you effectively makes a round trip for each record. (not that you get all records at once).
Your alternative with LIMIT seems to be meaningfull on the first view. But if you investigate the implementation you will probably decide to not use it.
Say you will divide the result set in 15 chunks 10K each:
You open 15 queries, each of them on average with a half of the resource consumption as the original query (OFFSET select the data and skips them).
So the only think you will reach is that the processing will take aproximatly 7,5x more time.
Best Practice
Take your query, write a simple script with JDBC fetch, use 10046 trace to see the effective used fetch size.
Test with a range of fetch sizes and observe the perfomance; choose the optimal one.
my preference is to maintain a safe execution time with the ability to continue if interrupted. i prefer this approach because it is future proof and respects memory and execution time limits. remember you're not planning for today, you're planning for 6m down the road. what may be 150,000 today may be 1.5m in 6 months.
i use a length + 1 recipe to know if there is more to fetch, although the count query will enable you to do a progress bar in % if that is important.
when considering 150,000 record result set, this is a memory pressure question. this will depend on the average size of each row. if it is a row with three integers, that's small. if it is a row with a bunch of text elements to store user profile details then that's potentially very large. so be prudent with what fields you're pulling.
also need to ask - you may not need to pull all the records all the time. it may be useful to apply a sync pattern. to only pull records with an updated date newer than your last pull.
According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields
In JDBC the default fetch size is 10, but I guess that's not the best fetch size when I have a million rows. I understand that a fetch size too low reduces performance, but also if the fetch size is too high.
How can I find the optimal size? And does this have an impact on the DB side, does it chew up a lot of memory?
If your rows are large then keep in mind that all the rows you fetch at once will have to be stored in the Java heap in the driver's internal buffers. In 12c, Oracle has VARCHAR(32k) columns, if you have 50 of those and they're full, that's 1,600,000 characters per row. Each character is 2 bytes in Java. So each row can take up to 3.2MB. If you're fetching rows 100 by 100 then you'll need 320MB of heap to store the data and that's just for one Statement. So you should only increase the row prefetch size for queries that fetch reasonably small rows (small in data size).
As with (almost) anything, the way to find the optimal size for a particular parameter is to benchmark the workload you're trying to optimize with different values of the parameter. In this case, you'd need to run your code with different fetch size settings, evaluate the results, and pick the optimal setting.
In the vast majority of cases, people pick a fetch size of 100 or 1000 and that turns out to be a reasonably optimal setting. The performance difference among values at that point are generally pretty minimal-- you would expect that most of the performance difference between runs was the result of normal random variation rather than being caused by changes in the fetch size. If you're trying to get the last iota of performance for a particular workload in a particular configuration, you can certainly do that analysis. For most folks, though, 100 or 1000 is good enough.
The default value of JDBC fetch size property is driver specific and for Oracle driver it is 10 indeed.
For some queries fetch size should be larger, for some smaller.
I think a good idea is to set some global fetch size for whole project and overwrite it for some individual queries where it should be bigger.
Look at this article:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
there is description on how to set up fetch size globally and overwrite it for carefully selected queries using different approaches: Hibernate, JPA, Spring jdbc templates or core jdbc API. And some simple benchmark for oracle database.
As a rule of thumb you can:
set fetchsize to 50 - 100 as global setting
set fetchsize to 100 - 500 (or even more) for individual queries
JDBC does have default prefetch size of 10. Check out
OracleConnection.getDefaultRowPrefetch in JDBC Javadoc
tl;dr
How to figure out the optimal fetch size for the select query
Evaluate some maximal amount of memory (bytesInMemory)
4Mb, 8Mb or 16Mb are good starts.
Evaluate the maximal size of each column in the query and sum up
those sizes (bytesPerRow)
...
Use this formula: fetch_size = bytesInMemory / bytesPerRow
You may adjust the formula result to have predictable values.
Last words, test with different bytesInMemory values and/or different queries to appreciate the results in your application.
The above response was inspired by the (as of this writing attic) Apache MetaModel project. They found an answer for this exact question. To do so, they built a class for calculating a fetch size given a maximal memory amount. This class is based on an Oracle whitepaper explaining how Oracle JDBC drivers manage memory.
Basically, the class is constructed with a maximal memory amount (bytesInMemory). Later, it is asked a fetch size for a Query (an Apache Metamodel class). The Query class helps find the number of bytes (bytesPerRow) a typical query results row would have. The fetch size is then calculated with the below formula:
fetch_size = bytesInMemory / bytesPerRow
The fetch size is also adjusted to stay in this range : [1,25000]. Other adjustments are made along during the calculation of bytesPerRow but that's too much details for here.
This class is named FetchSizeCalculator. The link leads to the full source code.
I read about Bulk Collect and wrote some code using it (not deployed yet). The total amount of rows returned is in the vicinity of 80.000. I limited the amount of rows returned in one batch to 10.000, but there is no basis for using this number, I simply improvised.
What would be a good method for determining how to limit the Bulk Collect?
As with anything, the best approach would be to benchmark the different options.
Realistically, though, in the vast majority of cases, there isn't any appreciable benefit to a limit much higher than 100. With a limit of 100, you're eliminating 99% of the context shifts. It's relatively unlikely that the remaining 1% of the context shifts account for a meaningful fraction of the execution time of your code. Reducing the context shifts further probably does nothing for performance and just causes you to use more valuable PGA memory.
Could you please tell me whether the cost of a query is dependent on the amount of data available in the database at that time?
means, does the cost varies with the variation in the amount of data?
Thanks,
Savitha
The answer is, Yes, the data size will influence the query execution plan, that is why you must test your queries with real amounts of data (and if possible realistic data as the distribution of the data is also important and will influence the query cost).
Any Database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.
Of course, mass data will Slow down the process, In fact If u are firing a query, it need to traverse and search into the database. For more data it ll take time, The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes and Joins..