Deleting large number of rows of an Oracle table - oracle

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.

As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.

Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).

We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

Related

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

Most Efficient Way to Create an Index in Postgres

Is it more efficient to create an index after loading data is complete or before, or does it not matter?
For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:
Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.
The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:
CREATE INDEX idx_name ON table_name (column_name);
My data loading uses COPY FROM.
Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.
Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.
It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.
One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

Deletes Slow on a Oracle BIG Table

I have a table which has around 180 million records and 40 indexes. A nightly program, loads data into this table but due to certain business conditions we can only delete and load data into this table. The nightly program will bring new records or updates to existing records in the table from the source system.We have limited window i.e about 6 hours to complete the extract from the source system, perform business transformations and finally load the data into this target table and be ready for users to consume the data in the morning. The issue which we are facing is that the delete from this table takes a lot of time mainly due to the 40 indexes on the table(an average of 70000 deletes per hour). I did some digging on the internet and see the below options
a) Drop or disable indexes before delete and then rebuild indexes: The program which loads data into the target table after delete and loading the data needs to perform quite a few updates for which the indexes are critical. And to rebuild 1 index it takes almost 1.5 hours due to the enormous amount of data in the table. So this approach is not feasible due to the time it takes to rebuild indexes and due to the limited time we have to get the data ready for the users
b) Use bulk delete: Currently the program deletes based on rowid and deletes records one by one as below
DELETE
FROM <table>
WHERE rowid = g_wpk_tab(ln_i);
g_wpk_tab is the collection which holds rowids to be deleted which is read by looping via FOR ALL and I do an intermediate commit every 50000 row deletes.
Tom of AskTom says in this discussion over here says that the bulk delete and row by row delete will take almost the same amount of time
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5033906925164
So this wont be a feasible option as well
c)Regular Delete: Tom of AskTom suggests to use the regular delete and even that takes a long time probably due to the number of indexes on this table
d)CTAS: This approach is out of question because the program needs to recreate the table , create the 40 indexes and then proceed with the updates and I mentioned above an index will take atleast 1.5 hrs to create
If you could provide me any other suggestions I would really appreciate it.
UPDATE: As of now we have decided to go with the approach suggested by https://stackoverflow.com/users/409172/jonearles to archive instead of delete. Approach is to add a flag to the table to mark the records to be deleted as DELETE and then have a post delete program run during the day to delete off the records. This will ensure that the data is available for users at the right time. Since users consume via OBIEE we are planning to set content level filter on the table to not look at the archival column so that users needn't know about what to select and what to ignore.
Parallel DML alter session enable parallel dml;, delete /*+ parallel */ ...;, commit;. Sometimes it's that easy.
Parallel DDL alter index your_index rebuild nologging compress parallel;. NOLOGGING to reduce the amount of redo generated during the index rebuild. COMPRESS can significantly reduce the size of a non-unique index, which significantly reduces the rebuild time. PARALLEL can also make a huge difference in rebuild time if you have more than one CPU or more than one disk. If you're not already using these options, I wouldn't be surprised if using all of them together improves index rebuilds by an order of magnitude. And then 1.5 * 40 / 10 = 6 hours.
Re-evaluate your indexes Do you really need 40 indexes? It's entirely possible, but many indexes are only created because "indexes are magic". Make sure there's a legitimate reason behind each index. This can be very difficult to do, very few people document the reason for an index. Before you ask around, you may want to gather some information. Turn on index monitoring to see which indexes are really being used. And even if the index is used, see how it is used, perhaps through v$sql_plan. It's possible that an index is used for a specific statement but another index would have worked just as well.
Archive instead of delete Instead of deleting, just set a flag to mark a row as archived, invalid, deleted, etc. This will avoid the immediate overhead of index maintenance. Ignore the rows temporarily and let some other job delete them later. The large downside to this is that it affects any query on the table.
Upgrading is probably out of the question, but 12c has an interesting new feature called in-database archiving. It's a more transparent way of accomplishing the same thing.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

overcoming 'log file sync' by design?

Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.
Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.
The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.
Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.
I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.

Resources