Oracle SQL*loader running in direct mode is much slower than conventional path load - performance

In the past few days I've playing around with Oracle's SQL*Loader in attempt to bulk load data into Oracle. After trying out different combination of options I was surprised to found the conventional path load runs much quicker than direct path load.
A few facts about the problem:
Number of records to load is 60K.
Number of records in target table, before load, is 700 million.
Oracle version is 11g r2.
The data file contains date, character (ascii, no conversion required), integer, float. No blob/clob.
Table is partitioned by hash. Hash function is same as PK.
Parallel of table is set to 4 while server has 16 CPU.
Index is locally partitioned. Parallel of index (from ALL_INDEXES) is 1.
There's only 1 PK and 1 index on target table. PK constraint built using index.
Check on index partitions revealed that records distribution among partitions are pretty even.
Data file is delimited.
APPEND option is used.
Select and delete of the loaded data through SQL is pretty fast, almost instant response.
With conventional path, loading completes in around 6 seconds.
With direct path load, loading takes around 20 minutes. The worst run takes 1.5 hour to
complete yet server was not busy at all.
If skip_index_maintenance is enabled, direct path load completes in 2-3 seconds.
I've tried quite a number of options but none of them gives noticeable improvement... UNRECOVERABLE, SORTED INDEXES, MULTITHREADING (I am running SQL*Loader on a multiple CPU server). None of them improve the situation.
Here's the wait event I kept seeing during the time SQL*Loader runs in direct mode:
Event: db file sequential read
P1/2/3: file#, block#, blocks (check from dba_extents that it is an index block)
Wait class: User I/O
Does anyone has any idea what has gone wrong with direct path load? Or is there anything I can further check to really dig the root cause of the problem? Thanks in advance.

I guess you are falling fowl of this
"When loading a relatively small number of rows into a large indexed table
During a direct path load, the existing index is copied when it is merged with the new index keys. If the existing index is very large and the number of new keys is very small, then the index copy time can offset the time saved by a direct path load."
from When to Use a Conventional Path Load in: http://download.oracle.com/docs/cd/B14117_01/server.101/b10825/ldr_modes.htm

Related

Most Efficient Way to Create an Index in Postgres

Is it more efficient to create an index after loading data is complete or before, or does it not matter?
For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:
Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.
The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:
CREATE INDEX idx_name ON table_name (column_name);
My data loading uses COPY FROM.
Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.
Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.
It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.
One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

Deletes Slow on a Oracle BIG Table

I have a table which has around 180 million records and 40 indexes. A nightly program, loads data into this table but due to certain business conditions we can only delete and load data into this table. The nightly program will bring new records or updates to existing records in the table from the source system.We have limited window i.e about 6 hours to complete the extract from the source system, perform business transformations and finally load the data into this target table and be ready for users to consume the data in the morning. The issue which we are facing is that the delete from this table takes a lot of time mainly due to the 40 indexes on the table(an average of 70000 deletes per hour). I did some digging on the internet and see the below options
a) Drop or disable indexes before delete and then rebuild indexes: The program which loads data into the target table after delete and loading the data needs to perform quite a few updates for which the indexes are critical. And to rebuild 1 index it takes almost 1.5 hours due to the enormous amount of data in the table. So this approach is not feasible due to the time it takes to rebuild indexes and due to the limited time we have to get the data ready for the users
b) Use bulk delete: Currently the program deletes based on rowid and deletes records one by one as below
DELETE
FROM <table>
WHERE rowid = g_wpk_tab(ln_i);
g_wpk_tab is the collection which holds rowids to be deleted which is read by looping via FOR ALL and I do an intermediate commit every 50000 row deletes.
Tom of AskTom says in this discussion over here says that the bulk delete and row by row delete will take almost the same amount of time
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5033906925164
So this wont be a feasible option as well
c)Regular Delete: Tom of AskTom suggests to use the regular delete and even that takes a long time probably due to the number of indexes on this table
d)CTAS: This approach is out of question because the program needs to recreate the table , create the 40 indexes and then proceed with the updates and I mentioned above an index will take atleast 1.5 hrs to create
If you could provide me any other suggestions I would really appreciate it.
UPDATE: As of now we have decided to go with the approach suggested by https://stackoverflow.com/users/409172/jonearles to archive instead of delete. Approach is to add a flag to the table to mark the records to be deleted as DELETE and then have a post delete program run during the day to delete off the records. This will ensure that the data is available for users at the right time. Since users consume via OBIEE we are planning to set content level filter on the table to not look at the archival column so that users needn't know about what to select and what to ignore.
Parallel DML alter session enable parallel dml;, delete /*+ parallel */ ...;, commit;. Sometimes it's that easy.
Parallel DDL alter index your_index rebuild nologging compress parallel;. NOLOGGING to reduce the amount of redo generated during the index rebuild. COMPRESS can significantly reduce the size of a non-unique index, which significantly reduces the rebuild time. PARALLEL can also make a huge difference in rebuild time if you have more than one CPU or more than one disk. If you're not already using these options, I wouldn't be surprised if using all of them together improves index rebuilds by an order of magnitude. And then 1.5 * 40 / 10 = 6 hours.
Re-evaluate your indexes Do you really need 40 indexes? It's entirely possible, but many indexes are only created because "indexes are magic". Make sure there's a legitimate reason behind each index. This can be very difficult to do, very few people document the reason for an index. Before you ask around, you may want to gather some information. Turn on index monitoring to see which indexes are really being used. And even if the index is used, see how it is used, perhaps through v$sql_plan. It's possible that an index is used for a specific statement but another index would have worked just as well.
Archive instead of delete Instead of deleting, just set a flag to mark a row as archived, invalid, deleted, etc. This will avoid the immediate overhead of index maintenance. Ignore the rows temporarily and let some other job delete them later. The large downside to this is that it affects any query on the table.
Upgrading is probably out of the question, but 12c has an interesting new feature called in-database archiving. It's a more transparent way of accomplishing the same thing.

Deleting large number of rows of an Oracle table

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

How does oracle db writer decide whether or not to do multiblock / sequential writes

We have a test system which matches our production system like for like. 6 months ago we did some testing on new hardware, and found the performance limit of our system.
However, now we are re-doing the testing with a view to adding further hardware, and we have found the system doesnt perform as it used to.
The reason for this is because on one specific volume we are now doing random I/O which used to be sequential. Further to this it has turned out that the activity on this volume by oracle which is 100% writes, is actually in 8k blocks, where before it was up to 128k.
So something has caused the oracle db writer to stop batching up it's writes.
We've extensively checked our config, and cannot see any difference between our test and production systems. We've also opened a call with Oracle but at this stage information is slow in forthcoming.
so; Ultimately this is 2 related questions:
Can you rely on oracle multiblock writes? Is that a safe thing to engineer/tune your system for?
Why would oracle change its behaviour?
We're not at this stage necessarily blaming oracle - it may well be reacting to something in the environment - but what?
The OS/arch is solaris/sparc.
Oh; I forgot to mention, the insert table has no indexes, and only a couple of foreign keys - it's designed as a bucket for as fast an insert as possible. It's also partitioned on the key field.
Thanks for any tips!
More description of the workload would allow some hypotheses.
If you are updating random blocks, then the DBWR process(es) are going to have little choice but to do single-block writes. Indexes especially are likely to have writes all over the place. If you have an index of character values and need to insert a new 'M' record where there isn't room, it will get a new block for the index and split the current block. You'll have some of those 'M' records in the original block, and some in the new block (while will be the last [used] block in the last extent).
I suspect you are most likely to get multi-block writes when bulk inserting into tables, as new blocks will be allocated and written to. Potentially, initially you had (say) 1GB of extents allocated and were writing into that space. Now you might have reached the limit of that and be creating new extents (say 50 Mb) which it may be getting from scattered file locations (eg other tables that have been dropped).

Resources