JDBC Update millions of records on Oracle database - oracle

I have a design issue. I have a database with millions of records which I need to update.
We will use JDBC because we have to do some processing to calculate new fields values.
It is a one go, and I will not need it any more. So I was thinking about something simple. I wanted to create new tables and delete the old ones, but the DBA do not want to, because the need for storage would be huge.
I will have to process about 80 millions rows, and for each row to update 3 fields.
Would a simple jdbc approach, with a setFetchSize(1000) for example, would work?
I mean select a, b, c from mutable for update;
then the update ...
Would a JDBC program be able to support the workload?
I was also thinking about using SpringBatch or EasyBatch. But I am wondering if it is worth investigating time in this for just one go (and some very short timelines).
What is your experience with this?

I think you can do this in JDBC. I would suggest something like the following:
Create two or three threads. Each thread does the following
Create a connection.
Create a prepared statement that retrieves a disjoint subset of the rows
Set the fetch size to 100 or so. Definitely less than 1000.
Create an update statement
Execute the query
Iterate over the result set
For each row add batch to update the row
After fetch size rows execute the batch
Lets assume the fetch size is 100. The first execute will do a round trip which takes time. While that's happening go run another thread. When an execute returns processing the next 100 rows does not do a database round trip. The rows have already been fetched and the updates are being batched so this will not do a database round trip. After 100 rows execute the batch which will do a round trip and so will switch threads. Then it will fetch 100 more rows which will switch threads. I'm not sure whether two or three threads would be optimal but if I had to guess I'd try three.
But the above assumes the machine only has a single hardware thread which is not true. Most CPUs support 12 or more hardware threads so I would actually use 30 or so threads depending on what the hardware can support. Even with multiple CPUs you probably don't want more than 50 or so threads as that will start to introduce contention in the database.
The above assumes the external service is fast, much faster than the database. If not then processing each row is going to wait for the external service. In that case more threads. Since the updates will hit the database more slowly thread contention in the database is less of a concern.
One way to partition the query results into disjoint subsets is as follows:
SELECT c1, c2, etc, row
FROM (SELECT c1, c2, etc, ROWNUM FROM ...)
WHERE MOD(row, number_of_partitions) = ?
Then set the query param from 0 to number_of_partitions - 1, one for each thread. You have to do this as a subquery to get ROWNUM to work right.
Do not use updatable result sets. The performance will be abysmal, guaranteed.

Related

Best way to retrieve 150,000 records from Oracle with JDBC

I have been searching for an answer to this today, and it seems the best approach divides opinion somewhat.
I have 150,000 records that I need to retrieve from an Oracle database using JDBC. Is it better to retrieve the data using one select query and allowing the JDBC driver to take care of transferring the records from the database using Oracle cursor and default fetchSize - OR to split up the query into batches using LIMIT / OFFSET?
With the LIMIT / OFFSET option, I think the pros are that you can take control over the number of results you return in each chunk. The cons are that the query is executed multiple times, and you also need to run a COUNT(*) up front using the same query to calculate the number of iterations required.
The pros of retrieving all at once are that you rely on the JDBC driver to manage the retrieval of data from the database. The cons are that the setFetchSize() hint can sometimes be ignored meaning that we could end up with a huge resultSet containing all 150,000 records at once!!
Would be great to hear some real life experiences solving similar issues, and recommendations would be much appreciated.
The native way in Oracle JDBC is to use the prepareStatement for the query, executeQuery and fetch
in a loop the results with defined fetchSize
Yes, of course the details are Oracle Database and JDBC Driver Version dependent and in some case the required fetchSize
can be ignored. But the typical problem is that the required fetch size is reset to fetchSize = 1 and you effectively makes a round trip for each record. (not that you get all records at once).
Your alternative with LIMIT seems to be meaningfull on the first view. But if you investigate the implementation you will probably decide to not use it.
Say you will divide the result set in 15 chunks 10K each:
You open 15 queries, each of them on average with a half of the resource consumption as the original query (OFFSET select the data and skips them).
So the only think you will reach is that the processing will take aproximatly 7,5x more time.
Best Practice
Take your query, write a simple script with JDBC fetch, use 10046 trace to see the effective used fetch size.
Test with a range of fetch sizes and observe the perfomance; choose the optimal one.
my preference is to maintain a safe execution time with the ability to continue if interrupted. i prefer this approach because it is future proof and respects memory and execution time limits. remember you're not planning for today, you're planning for 6m down the road. what may be 150,000 today may be 1.5m in 6 months.
i use a length + 1 recipe to know if there is more to fetch, although the count query will enable you to do a progress bar in % if that is important.
when considering 150,000 record result set, this is a memory pressure question. this will depend on the average size of each row. if it is a row with three integers, that's small. if it is a row with a bunch of text elements to store user profile details then that's potentially very large. so be prudent with what fields you're pulling.
also need to ask - you may not need to pull all the records all the time. it may be useful to apply a sync pattern. to only pull records with an updated date newer than your last pull.

Optimize ResultSet iteration

I have a SQL query that return 92000 rows, and when i use the while with ResultSet.next(), it spend a lot of time.I found that the source of the problem is the condition of iteration ResultSet.next(). Have you an idea how can i ameliorate the performance and reduce spending time.
ResultSet.next() actually works with networking undeneath and communicates with the server to bring more data once you have iterated over previous rows.
So two tips:
Increase the prefech size in the results query
Create indexes in the database. This will increase Database Performance which will make
Have a look also at these two links that deal with your issue and fetch size:
http://www.precisejava.com/javaperf/j2ee/JDBC.htm#JDBC112
-http://www.eclipse.org/eclipselink/api/2.3/org/eclipse/persistence/config/QueryHints.html#JDBC_FETCH_SIZE
Increasing the fetch size means less times to take data from the database. Resultset is like a buffer that fetches X number of rows and refetches once you have iterated over them.
Finally, you could attempt to use threads while splitting your one query into 4-5 queries that are done in separate threads concurrently.

How do I find out the byte size of the results returned by all distinct SQL queries executed in an Oracle db?

For some sizing project, I need to find out the byte size of the results returned by all distinct SQL queries executed in an Oracle db. Any suggestions?
What constitutes a "distinct SQL query"? If I have a SQL statement
SELECT ename
FROM emp
WHERE empno = :1
which is executed 1000 times with 1000 different bind variable values, does that count as 1 distinct SQL query? Or 1000? If that same query was executed 200 times with the same bind variable value, does that count as 1 distinct SQL query? Or 200?
Are you counting recursive SQL? SQL generated by background processes and jobs?
What are you attempting to size that would depend on the total size of the results returned from all SQL statements. That doesn't seem like a sensible metric. Since we're talking about a database hosting company, are you sure that they're not looking for reasonable metrics like how much CPU you're using, how much I/O you're doing, how much RAM you're consuming, how much data you're transferring over the network, etc.?
I cannot fathom, for example, how you would possibly handle capacity planning for a database knowing that query 1 returns 10 bytes of data on average per execution. It makes far more sense to ask how much work that query has to do to produce its results, how much of the data has to be sent over the network, etc. And it makes far more sense to get aggregate data (say, CPU usage during the peak snapshot interval) rather than trying to get data for each SQL statement.
If you really wanted to, assuming you're licensed to use the AWR, I suppose you could query DBA_HIST_SQLSTAT to compute the average number of fetches per execution for a given SQL_ID, get the text from DBA_HIST_SQLTEXT to get the text of the SQL statement, write a bit of code using DBMS_SQL to parse the query and describe the results to determine the maximum size of a single row of the result and multiply by the average number of rows fetched per execution to get the number of bytes fetched per execution for each distinct SQL statement. That's a pretty hefty amount of effort for some metrics that don't seem particularly useful but it's definitely possible.

Optimizing massive insert performance...?

Given: SQL Server 2008 R2. Quit some speedin data discs. Log discs lagging.
Required: LOTS LOTS LOTS of inserts. Like 10.000 to 30.000 rows into a simple table with two indices per second. Inserts have an intrinsic order and will not repeat, as such order of inserts must not be maintained in short term (i.e. multiple parallel inserts are ok).
So far: accumulating data into a queue. Regularly (async threadpool) emptying up to 1024 entries into a work item that gets queued. Threadpool (custom class) has 32 possible threads. Opens 32 connections.
Problem: performance is off by a factor of 300.... only about 100 to 150 rows are inserted per second. Log wait time is up to 40% - 45% of processing time (ms per second) in sql server. Server cpu load is low (4% to 5% or so).
Not usable: bulk insert. The data must be written as real time as possible to the disc. THis is pretty much an archivl process of data running through the system, but there are queries which need access to the data regularly. I could try dumping them to disc and using bulk upload 1-2 times per second.... will give this a try.
Anyone a smart idea? My next step is moving the log to a fast disc set (128gb modern ssd) and to see what happens then. The significant performance boost probably will do things quite different. But even then.... the question is whether / what is feasible.
So, please fire on the smart ideas.
Ok, anywering myself. Going to give SqlBulkCopy a try, batching up to 65536 entries and flushing them out every second in an async fashion. Will report on the gains.
I'm going through the exact same issue here, so I'll go through the steps i'm taking to improve my performance.
Separate the log and the dbf file onto different spindle sets
Use basic recovery
you didn't mention any indexing requirements other than the fact that the order of inserts isn't important - in this case clustered indexes on anything other than an identity column shouldn't be used.
start your scaling of concurrency again from 1 and stop when your performance flattens out; anything over this will likely hurt performance.
rather than dropping to disk to bcp, and as you are using SQL Server 2008, consider inserting multiple rows at a time; this statement inserts three rows in a single sql call
INSERT INTO table VALUES ( 1,2,3 ), ( 4,5,6 ), ( 7,8,9 )
I was topping out at ~500 distinct inserts per second from a single thread. After ruling out the network and CPU (0 on both client and server), I assumed that disk io on the server was to blame, however inserting in batches of three got me 1500 inserts per second which rules out disk io.
It's clear that the MS client library has an upper limit baked into it (and a dive into reflector shows some hairy async completion code).
Batching in this way, waiting for x events to be received before calling insert, has me now inserting at ~2700 inserts per second from a single thread which appears to be the upper limit for my configuration.
Note: if you don't have a constant stream of events arriving at all times, you might consider adding a timer that flushes your inserts after a certain period (so that you see the last event of the day!)
Some suggestions for increasing insert performance:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits (e.g. autoinc column)
Insert into a temporary heap table first, then issue one big "insert-by-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Choose "Bulk Logged" recovery model instad of "Full" recovery model
Place a table lock before inserting (if your business scenario allows for it)
Taken from Tips For Lightning-Fast Insert Performance On SqlServer

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.
I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.
Thanks
Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.
You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.
Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.
I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.
An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.
I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.
When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.
Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.
Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:
http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx
From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.
For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.
The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.
Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

Resources