I've to duplicate values from one table to another (identical table schemes). What is better (performance):
Drop table1 and create as select * from table2
Delete all rows from table1 and insert all rows from table2
Update:
I've made a small test on table with almost 3k rows.
Drop and create gives about 60ms vs Delete and insert - about 30ms.
I see four useful ways to replace the contents of the table. None of them is "obviously right", but it depends on your requirements.
(In a single transaction) DELETE FROM foo; INSERT INTO foo SELECT ...
Pro: Best concurrency: doesn't lock out other transactions accessing the table, as it leverages Postgres's MVCC.
Con: Probably the slowest if you measure the insert-speed alone. Causes autovacuum to clean up dead rows, thus creating a higher I/O load.
TRUNCATE foo; INSERT INTO foo SELECT ...
Pro: Fastest for smaller tables. Causes less write I/O than #1
Con: Excludes all other readers -- other transactions reading from the table will have to wait.
TRUNCATE foo, DROP all indexes on table, INSERT INTO foo SELECT ..., re-create all indexes.
Pro: Fastest for large tables, because creating indexes with CREATE INDEX is faster than updating them incrementally.
Con: Same as #2
The switcheroo. Create two identical tables foo and foo_tmp
TRUNCATE foo_tmp;
INSERT INTO foo_tmp SELECT ...;
ALTER TABLE foo RENAME TO foo_tmp1;
ALTER TABLE foo_tmp RENAME TO foo;
ALTER TABLE foo_tmp1 RENAME TO foo_tmp;
Thanks to PostgreSQL's transactional DDL capabilities, if this is done in a transaction, the rename is performed without other transactions noticing. You can also combine this with #3 and drop/create indexes.
Pro: Less I/O performed, like #2, and without locking out other readers (locks taken only during the rename part).
Con: The most complicated. Also you cannot have foreign keys or views pointing to the table, as they would point to the wrong table after renaming it.
Use TRUNCATE instead of DROP TABLE or DELETE when you have to get rid of all records in a table. With TRUNCATE you can still use triggers in PostgreSQL and permissions are easier to set and maintain.
Like a DROP, TRUNCATE also needs a table lock.
In case you are talking about executing the INSERTs manually, one by one, then DROP/CREATE will be much faster. Also, when using CREATE TABLE AS, it will only copy the column definitions. Indices, and other constraints will not be copied. This will speed up the copy process enormously. But you'll have to remember to re-create these on the new copy once you're finished.
The same goes for SELECT INTO. They are functionally identical. They just have different names.
In any case. When copying large tables, always disable triggers, indices, and constraints to gain performance.
Here is the (comparative) timings for the intgr's answer (see the code below):
delete/insert - 36 sec.
truncate/insert - 19 sec.
drop index/truncate/insert/create index - 13 sec.
-- preparations
drop table if exists temp_refresh_experiment;
-- million random strings
create table temp_refresh_experiment as
select
upper(substr(md5(random()::text), 0, 25)) as some_column
FROM
generate_series(1,1000000) i;
-- create index
create index temp_refresh_experiment_ix on temp_refresh_experiment(some_column)
;
-- 1. delete/insert
delete from temp_refresh_experiment;
insert into temp_refresh_experiment(some_column)
select
upper(substr(md5(random()::text), 0, 25)) as some_column
FROM
generate_series(1,1000000) i;
-- 36 secs
-- 2. truncate/insert
truncate temp_refresh_experiment;
insert into temp_refresh_experiment(some_column)
select
upper(substr(md5(random()::text), 0, 25)) as some_column
FROM
generate_series(1,1000000) i;
-- 19 sec
-- 3. drop index/truncate/insert/create index
drop index if exists temp_refresh_experiment_ix;
truncate temp_refresh_experiment;
insert into temp_refresh_experiment(some_column)
select
upper(substr(md5(random()::text), 0, 25)) as some_column
FROM
generate_series(1,1000000) i;
create index temp_refresh_experiment_ix on temp_refresh_experiment(some_column)
;
-- 13 sec
Related
I have a simple delete statement like this:
DELETE FROM MY_TABLE WHERE ATTR_NAME='Something'.
This has to delete around 6,00,000 rows which is taking more than half an hour.
I have three columns in the table in which the combination of ID,ATTR_NAME is a Primary Key.The third column is of CLOB type. The table contains around 21 million Records . There are no separate indexes for any column. There are no triggers and no foreign key references.
This is not the one time process. I need to do on regular intervals.
I doubt this is because of the primary key which is in turn creating the index and thus leading to more time. Please correct me if I was wrong. Should I try removing the PK, or disabling the index? I heard I should disable the indices while inserting and deleting. I can't simply test, because this is production machine and I need to ask permission to remove. Please share your valuable suggestions
And In General Does the Indexes affect all the DML statements?
If your index is id,attr_name, then that index cannot be used for your where clause, and the delete query has to do a full-table scan.
Index fields are used in left->right ordering, so your id,attr_name index would be used in these cases:
WHERE id = foo AND attr_name = bar
WHERE id = foo
WHERE attr_name = foo AND id = bar // ordering within the where doesn't matter, but USAGE does
but not
WHERE attr_name = bar
because id is not present in that where.
You'll have to add a dedicated index on attr_name, or re-arrange your index so it's defined as attr_name, id. And of course, if the id field is your primary key, it should ALREADY have a PK index on it, making id, attr_name redundant.
DBMS_PARALLEL_EXECUTE is an easy way to significantly improve performance
without altering any objects or significantly changing the process.
Sample Schema
--Create sample table.
create table my_table(id number, attr_name varchar2(100), a_clob clob);
--Insert 1 million rows. Takes 31 seconds on my PC.
begin
for i in 1 .. 10 loop
insert /*+ append */ into my_table
select level + i*100000, mod(level, 3), rpad('0', 100, '0')
from dual
connect by level <= 100000;
commit;
end loop;
end;
/
--Add primary key.
alter table my_table add constraint my_table_pk primary key (id, attr_name);
Simple DELETE
Deleting 1/3rd of the data with this simple method takes 86 seconds on my PC.
--Flush the cache.
alter system flush buffer_cache;
--Delete 1/3rd of the table.
delete from my_table where attr_name = 0;
rollback;
DBMS_PARALLEL_EXECUTE
The parallel method ran only slightly faster on my machine. Hopefully on a server with multiple CPUs and disks the difference will be larger. This code
is based on the example from the manual.
--Flush the cache.
alter system flush buffer_cache;
--Delete 1/3rd of the table. Finished in 80 seconds.
begin
--Create the TASK.
dbms_parallel_execute.create_task ('mytask');
--Chunk the table by ROWID.
dbms_parallel_execute.create_chunks_by_rowid(
task_name => 'mytask',
table_owner => user,
table_name => 'MY_TABLE',
by_row => true,
chunk_size => 1000);
--Execute the DML in parallel.
dbms_parallel_execute.run_task(
task_name => 'mytask',
sql_stmt =>
'delete /*+ rowid(my_table) */ from my_table
where rowid BETWEEN :start_id AND :end_id
and attr_name = 0',
language_flag => DBMS_SQL.NATIVE,
parallel_level => 16);
--Get the status.
dbms_output.put_line('Status: '||dbms_parallel_execute.task_status('mytask'));
--Done with processing; drop the task.
dbms_parallel_execute.drop_task('mytask');
end;
/
Pros and Cons
This method requires a bit more code to do a simple DELETE, but it avoids these issues with other approaches:
An index access path almost certainly won't help if the DELETE affects 29% of the data.
Dropping and re-creating a primary key takes time, locks the table, and it's not always trivial to get accurate DDL.
Regular parallel DML will not work because of the CLOB column.
Partitioning or soft-deletes require changing the table structure. (Although if possible these are probably the fastest methods.)
You have plenty of options to tune the statements.
Partition table
If ATTR_NAME column value is handful (which I guess is from your statement) you can consider partitioning the table (including CLOB - assume CLOB is not inline) and can simply drop the partition easily. You probably have to reorganize the indexes to be local indexes.
Disable Index and rebuild after DELETE
I suspect this really won't help - yes there is overhead on
maintaining index but 600K is not a lot. Dropping and recreating
index should be avoided.
CTAS + Parallelism + DROP/RENAME + RECREATE INDEX
The above would work if you have window to take DB offline for short period.
I wanted to try the option of updating the CLOB column to NULL for those records and issue delete subsequently. This is purely to measure if CLOB column is hogging the execution.
Objective
Verify if it is true that insert records without PK/index plus create thme later is faster than insert with PK/Index.
Note
The point here is not about indexing takes more time (it is obvious), but the total cost (Insert without index + create index) is higher than (Insert with index). Because I was taught to insert without index and create index later as it should be faster.
Environment
Windows 7 64 bit on DELL Latitude core i7 2.8GHz 8G memory & SSD HDD
Oracle 11G R2 64 bit
Background
I was taught that insert records without PK/Index and create them after insert would be faster than insert with PK/Index.
However 1 million record inserts with PK/Index was actually faster than creating PK/Index later, approx 4.5 seconds vs 6 seconds, with the experiments below. By increasing the records to 3 million (999000 -> 2999000), the result was the same.
Conditions
The table DDL is below. One bigfile table space for both data and
index.
(Tested a separate index tablespace with the same result & inferior overall perforemace)
Flush the buffer/spool before each run.
Run the experiment 3 times each and made sure the results
were similar.
SQL to flush:
ALTER SYSTEM CHECKPOINT;
ALTER SYSTEM FLUSH SHARED_POOL;
ALTER SYSTEM FLUSH BUFFER_CACHE;
Question
Would it be actually true that "insert witout PK/Index + PK/Index creation later" is faster than "insert with PK/Index"?
Did I make mistakes or missed some conditions in the experiment?
Insert records with PK/Index
TRUNCATE TABLE TBL2;
ALTER TABLE TBL2 DROP CONSTRAINT PK_TBL2_COL1 CASCADE;
ALTER TABLE TBL2 ADD CONSTRAINT PK_TBL2_COL1 PRIMARY KEY(COL1) ;
SET timing ON
INSERT INTO TBL2
SELECT i+j, rpad(TO_CHAR(i+j),100,'A')
FROM (
WITH DATA2(j) AS (
SELECT 0 j FROM DUAL
UNION ALL
SELECT j+1000 FROM DATA2 WHERE j < 999000
)
SELECT j FROM DATA2
),
(
WITH DATA1(i) AS (
SELECT 1 i FROM DUAL
UNION ALL
SELECT i+1 FROM DATA1 WHERE i < 1000
)
SELECT i FROM DATA1
);
commit;
1,000,000 rows inserted.
Elapsed: 00:00:04.328 <----- Insert records with PK/Index
Insert records without PK/Index and create them after
TRUNCATE TABLE TBL2;
ALTER TABLE &TBL_NAME DROP CONSTRAINT PK_TBL2_COL1 CASCADE;
SET TIMING ON
INSERT INTO TBL2
SELECT i+j, rpad(TO_CHAR(i+j),100,'A')
FROM (
WITH DATA2(j) AS (
SELECT 0 j FROM DUAL
UNION ALL
SELECT j+1000 FROM DATA2 WHERE j < 999000
)
SELECT j FROM DATA2
),
(
WITH DATA1(i) AS (
SELECT 1 i FROM DUAL
UNION ALL
SELECT i+1 FROM DATA1 WHERE i < 1000
)
SELECT i FROM DATA1
);
commit;
ALTER TABLE TBL2 ADD CONSTRAINT PK_TBL2_COL1 PRIMARY KEY(COL1) ;
1,000,000 rows inserted.
Elapsed: 00:00:03.454 <---- Insert without PK/Index
table TBL2 altered.
Elapsed: 00:00:02.544 <---- Create PK/Index
Table DDL
CREATE TABLE TBL2 (
"COL1" NUMBER,
"COL2" VARCHAR2(100 BYTE),
CONSTRAINT "PK_TBL2_COL1" PRIMARY KEY ("COL1")
) TABLESPACE "TBS_BIG" ;
The current test case is probably good enough for you to overrule the "best practices". There are too many variables involved to make a blanket statement that "it's always best to leave the indexes enabled". But you're probably close enough to say it's true for your environment.
Below are some considerations for the test case. I've made this a community wiki in the hopes that others will add to the list.
Direct-path inserts. Direct-path writes use different mechanisms and may work completely differently. Direct-path inserts can often be significantly faster than regular inserts, although they have some complicated restrictions (for example, triggers must be disabled) and disadvantages (the data is not immediately backed-up). One particular way it affects this scenario is that NOLOGGING for indexes only applies during index creation. So even if a direct-path insert is used, an enabled index will always generate REDO and UNDO.
Parallelism. Large insert statements often benefit from parallel DML. Usually it's not worth worrying about the performance of bulk loads until it takes more than several seconds, which is when parallelism starts to be useful.
Bitmap indexes are not meant for large DML. Inserts or updates to a table with a bitmap index can lock the whole table and lead to disastrous performance. It might be helpful to limit the test case to b-tree indexes.
Add alter system switch logfile;? Log file switches can sometimes cause performance issues. The tests would be somewhat more consistent if they all started with empty logfiles.
Move data generation logic into a separate step. Hierarchical queries are useful for generating data but they can have their own performance issues. It might be better to create in intermediate table to hold the results, and then only test inserting the intermediate table into the final table.
It's true that it is faster to modify a table if you do not also have to modify one or more indexes and possibly perform constraint checking as well, but it is also largely irrelevant if you then have to add those indexes. You have to consider the complete change to the system that you wish to effect, not just a single part of it.
Obviously if you are adding a single row into a table that already contains millions of rows then it would be foolish to drop and rebuild indexes.
However, even if you have a completely empty table into which you are going to add several million rows it can still be slower to defer the indexing until afterwards.
The reason for this is that such an insert is best performed with the direct path mechanism, and when you use direct path inserts into a table with indexes on it, temporary segments are built that contain the data required to build the indexes (data plus rowids). If those temporary segments are much smaller than the table you have just loaded then they will also be faster to scan and to build the indexes from.
the alternative, if you have five index on the table, is to incur five full table scans after you have loaded it in order to build the indexes.
Obviously there are huge grey areas involved here, but well done for:
Questioning authority and general rules of thumb, and
Running actual tests to determine the facts in your own case.
Edit:
Further considerations -- you run a backup while the indexes are dropped. Now, following an emergency restore, you have to have a script that verifies that all indexes are in place, when you have the business breathing down your neck to get the system back up.
Also, if you absolutely were determined to not maintain indexes during a bulk load, do not drop the indexes -- disable them instead. This preserves the metadata for the indexes existence and definition, and allows a more simple rebuild process. Just be careful that you do not accidentally re-enable indexes by truncating the table, as this will render disabled indexes enabled again.
Oracle has to do more work while inserting data into table having an index. In general, inserting without index is faster than inserting with index.
Think in this way,
Inserting rows in a regular heap-organized table with no particular row order is simple. Find a table block with enough free space, put the rows randomly.
But, when there are indexes on the table, there is much more work to do. Adding new entry for the index is not that simple. It has to traverse the index blocks to find the specific leaf node as the new entry cannot be made into any block. Once the correct leaf node is found, it checks for enough free space and then makes the new entry. If there is not enough space, then it has to split the node and distribute the new entry into old and new node. So, all this work is an overhead and consumes more time overall.
Let's see a small example,
Database version :
SQL> SELECT banner FROM v$version where ROWNUM =1;
BANNER
--------------------------------------------------------------------------------
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
OS : Windows 7, 8GB RAM
With Index
SQL> CREATE TABLE t(A NUMBER, CONSTRAINT PK_a PRIMARY KEY (A));
Table created.
SQL> SET timing ON
SQL> INSERT INTO t SELECT LEVEL FROM dual CONNECT BY LEVEL <=1000000;
1000000 rows created.
Elapsed: 00:00:02.26
So, it took 00:00:02.26. Index details:
SQL> column index_name format a10
SQL> column table_name format a10
SQL> column uniqueness format a10
SQL> SELECT index_name, table_name, uniqueness FROM user_indexes WHERE table_name = 'T';
INDEX_NAME TABLE_NAME UNIQUENESS
---------- ---------- ----------
PK_A T UNIQUE
Without Index
SQL> DROP TABLE t PURGE;
Table dropped.
SQL> CREATE TABLE t(A NUMBER);
Table created.
SQL> SET timing ON
SQL> INSERT INTO t SELECT LEVEL FROM dual CONNECT BY LEVEL <=1000000;
1000000 rows created.
Elapsed: 00:00:00.60
So, it took only 00:00:00.60 which is faster compared to 00:00:02.26.
I've two tables say STOCK and ITEM. We have a query to delete some records from ITEM table,
delete from ITEM where item_id not in(select itemId from STOCK)
And now I've more than 15,00,000 records to delete, the query was taking much time to do the operation.
When I searched, I found some efficient ways to do this action.
One way:
CREATE TABLE ITEM_TEMP AS
SELECT * FROM ITEM WHERE item_id in(select itemId from STOCK) ;
TRUNCATE TABLE ITEM;
INSERT /+ APPEND +/ INTO ITEM SELECT * FROM ITEM_TEMP;
DROP TABLE ITEM_TEMP;
Secondly instead of truncating just drop the ITEM and then rename the ITEM_TEMP to ITEM. But in this case I've to re create all the indexes.
Can anyone please suggest which one of the above is more efficient, as I could not check this in Production.
I think the correct approach depends on your environment, here.
If you have privileges on the table that must not be affected, or at least must be restored if you drop the table, then the INSERT /*+ APPEND */ may simply be more reliable. Triggers, similarly, or foreign keys, or any objects that will be automatically dropped when the base table is dropped (foreign keys complicate the truncate, of course).
I would usually go for the truncate and insert method based on that. don't worry about the presence on indexes on the table -- a direct path insert is very efficient at building them.
However, if you have a simple table without dependent objects then there's nothing wrong with the drop-and-rename approach.
I also would not rule out just running multiple deletes of a limited number of rows, especially if this is in a production environment.
Best way from used space (and high watermark) and performance is to drop table and then rename ITEM_TEMP table. But, as you mentioned, after that you need to recreate indexes (also grants, triggers, constraints). Also all depending objects will be invalidated.
Some times I try to delete by portions:
begin
loop
delete from ITEM where item_id not in(select itemId from STOCK) and rownum < 10000;
exit when SQL%ROWCOUNT = 0;
commit;
end loop;
end;
Since you have very high number of rows, it better use partition table , may be List partition on "itemId". Then you can easily drop a partition.
Also if your application could run faster. This need design change but it will give benefit in long run.
I have a big table with lot of data partitioned into multiple partitions. I want to keep a few partitions as they are but delete the rest of the data from the table. I tried searching for a similar question and couldn't find it in stackoverflow. What is the best way to write a query in Oracle to achieve the same?
It is easy to delete data from a specific partition: this statement clears down all the data for February 2012:
delete from t23 partition (feb2012);
A quicker method is to truncate the partition:
alter table t23 truncate partition feb2012;
There are two potential snags here:
Oracle won't let us truncate partitions if we have foreign keys referencing the table.
The operation invalidates any partitioned Indexes so we need to rebuild them afterwards.
Also, it's DDL, so no rollback.
If we never again want to store data for that month we can drop the partition:
alter table t23 drop partition feb2012;
The problem arises when we want to zap multiple partitions and we don't fancy all that typing. We cannot parameterise the partition name, because it's an object name not a variable (no quotes). So leave only dynamic SQL.
As you want to remove most of the data but retain the partition structure truncating the partitions is the best option. Remember to invalidate any integrity constraints (and to reinstate them afterwards).
declare
stmt varchar2(32767);
begin
for lrec in ( select partition_name
from user_tab_partitions
where table_name = 'T23'
and partition_name like '%2012'
)
loop
stmt := 'alter table t23 truncate partition '
|| lrec.partition_name
;
dbms_output.put_line(stmt);
execute immediate stmt;
end loop;
end;
/
You should definitely run the loop first with execute immediate call commented out, so you can see which partitions your WHERE clause is selecting. Obviously you have a back-up and can recover data you didn't mean to remove. But the quickest way to undertake a restore is not to need one.
Afterwards run this query to see which partitions you should rebuild:
select ip.index_name, ip.partition_name, ip.status
from user_indexes i
join user_ind_partitions ip
on ip.index_name = i.index_name
where i.table_name = 'T23'
and ip.status = 'UNUSABLE';
You can automate the rebuild statements in a similar fashion.
" I am thinking of copying the data of partitions I need into a temp
table and truncate the original table and copy back the data from temp
table to original table. "
That's another way of doing things. With exchange partition it might be quite quick. It might also be slower. It also depends on things like foreign keys and indexes, and the ratio of zapped partitions to retained ones. If performance is important and/or you need to undertake this operation regularly then you should to benchmark the various options and see what works best for you.
You must very be careful in drop partition from a partition table. Partition table usually used for big data tables and if (and only if) you have a global index on the table, drop partition make your global index invalid and you should rebuild your global index in a big table, this is disaster.
For minimum side effect for queries on the table in this scenario, I first delete records in the partition and make it empty partition, then with
ALTER TABLE table_name DROP PARTITION partition_name UPDATE GLOBAL INDEXES;
drop empty partition without make my global index invalid.
I have a very large table (5mm records). I'm trying to obfuscate the table's VARCHAR2 columns with random alphanumerics for every record on the table. My procedure executes successfully on smaller datasets, but it will eventually be used on a remote db whose settings I can't control, so I'd like to EXECUTE the UPDATE statement in batches to avoid running out of undospace.
Is there some kind of option I can enable, or a standard way to do the update in chunks?
I'll add that there won't be any distinguishing features of the records that haven't been obfuscated so my one thought of using rownum in a loop won't work (I think).
If you are going to update every row in a table, you are better off doing a Create Table As Select, then drop/truncate the original table and re-append with the new data. If you've got the partitioning option, you can create your new table as a table with a single partition and simply swap it with EXCHANGE PARTITION.
Inserts require a LOT less undo and a direct path insert with nologging (/+APPEND/ hint) won't generate much redo either.
With either mechanism, there would probably sill be 'forensic' evidence of the old values (eg preserved in undo or in "available" space allocated to the table due to row movement).
The following is untested, but should work:
declare
l_fetchsize number := 10000;
cursor cur_getrows is
select rowid, random_function(my_column)
from my_table;
type rowid_tbl_type is table of urowid;
type my_column_tbl_type is table of my_table.my_column%type;
rowid_tbl rowid_tbl_type;
my_column_tbl my_column_tbl_type;
begin
open cur_getrows;
loop
fetch cur_getrows bulk collect
into rowid_tbl, my_column_tbl
limit l_fetchsize;
exit when rowid_tbl.count = 0;
forall i in rowid_tbl.first..rowid_tbl.last
update my_table
set my_column = my_column_tbl(i)
where rowid = rowid_tbl(i);
commit;
end loop;
close cur_getrows;
end;
/
This isn't optimally efficient -- a single update would be -- but it'll do smaller, user-tunable batches, using ROWID.
I do this by mapping the primary key to an integer (mod n), and then perform the update for each x, where 0 <= x < n.
For example, maybe you are unlucky and the primary key is a string. You can hash it with your favorite hash function, and break it into three partitions:
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=0
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=1
UPDATE myTable SET a=doMyUpdate(a) WHERE MOD(ORA_HASH(ID), 3)=2
You may have more partitions, and may want to put this into a loop (with some commits).
If I had to update millions of records I would probably opt to NOT update.
I would more likely create a temp table and then insert data from old table since insert doesnt take up a lot of redo space and takes less undo.
CREATE TABLE new_table as select <do the update "here"> from old_table;
index new_table
grant on new table
add constraints on new_table
etc on new_table
drop table old_table
rename new_table to old_table;
you can do that using parallel query, with nologging on most operations generating very
little redo and no undo at all -- in a fraction of the time it would take to update the
data.