create index before adding columns vs. create index after adding columns - does it matter? - oracle

In Oracle 10g, does it matter what order create index and alter table comes in?
Say i have a query Q with a where clause on column C in table T. Now i perform one of the following scenarios:
I create index I(C) and then add columns X,Y,Z.
Add columns X,Y,Z then create index I(C).
Q is 'select * from T where C = whatever'
Between 1 and 2 will there be a significant difference in performance of Q on table T when T contains a very large number of rows?
I personally make it a practice to do #2 but others seem to have a different opinion.
thanks

It makes no difference if you add columns to a table before or after creating an index. The optimizer should pick the same plan for the query and the execution time should be unchanged.
Depending on the physical storage parameters of the table, it is possible that adding the additional columns and populating them with data may force quite a bit of row migration to take place. That row migration will generate changes to the indexes on the table. If the index exists when you are populating the three new columns with data, it is possible that populating the data in X, Y, and Z will take a bit longer because of the additional index maintenance.

If you add columns without populating them, then it is pretty quick as it is just a metadata change. Adding an index does require the table to be read (or potentially another index) so that can be very time consuming and of much greater impact than the simple metadata change of recording the new index details.
If the new columns are going to be populated as part of the ALTER TABLE, it is a different matter.
The database may undergo an unplanned shutdown during the course of adding that data to every row of the table data
The server memory may not have room to record every row changed in that table
Therefore those row changes may be written to datafiles before commit, and are therefore written as dirty blocks
The next read of those blocks, after the ALTER table has successfully completed will do a delayed block cleanout (ie record the fact that the change has been committed)
If you add the columns (with data) first, then the create index will (probably) read the table and do the added work of the delayed block cleanout.
If you create the index first then add the columns, the create index may be faster but the delayed block cleanout won't happen and that housekeeping will be picked up by the application later (potentially by the select * from T where C = whatever)

Related

Change schema in an Impala/Hive table with a very large amount of data?

We have a Hive table stored on HDFS with 800+ columns and >65 billion rows (and growing) and need to:
Remove a column with a complex type (small array)
Add a column with a complex type (small array)
Possibly add a handful of other columns (simple type, e.g. string or int)
Modify the contents of 3 columns for every row in the database (effectively read it in, make a simple change, write it back out to the same column and row that it came from). I realise this is probably a separate operation to the other three requirements above.
We could set up a new empty table with the new schema and copy the data over (using CREATE TABLE xxxxx FROM SELECT ... or INSERT INTO xxxx SELECT ...) but tests suggest it would take 1 - 3 weeks running non stop. And it's possible we may need to make further minor similar modifications in future.
Is there an efficient, sensible alternative to copying the whole table? Would ALTER TABLE work (at least for the structural changes, items 1 - 3 above)? What are the pros and cons of either option(s)?
Table is going to be queried using Impala, in a Zeppelin-based interface.
Thanks for any advice.

MERGE INTO Performance as table grows

This is a general question about the Oracle MERGE INTO statement with a particular scenario, on Oracle RDBMS 12c.
Daily data will be loaded to StagingTableA - about 10m rows.
This will be MERGEd INTO TableA.
TableA will vary between 0 to 10m rows (matcing StagingTableA).
There may be times when TableA will be pruned/emptied and left with 0 rows.
Clearly, when TableA is empty, a straight INSERT will do the job, but the procedure has been written to use a MERGE INTO method to handle all scenarios.
The MERGE .. MATCH is on a indexed column.
My question is an uncertainty about how the MERGE handles the MATCH in circumstances where TableA will start empty, and then grow hugely during the MERGE execution. The MATCH on indexed columns will use a FTS as the stats will show the table has 0 rows.
At some point during the MERGE transaction, this will become inefficient.
Is the MERGE statement clever enough to detect this and change the execution plan, and start using the index instead of the FTS?
If this was done the old way with CURSOR, UPDATE and INSERT then we could potentially introduce a ANALYZE at a appropriate point (say after 50,000 processed) on the TableA to switch to a optimal plan.
I haven't been able to find any documentation dealing with this specific question.
Hopefully you've got a UNIQUE index on that table, which is based on the incoming data. If I was you, rather than using a simple MERGE I'd:
Mark all indexes on the table as UNUSABLE, except for the unique index.
INSERT all records
Catch the DUPLICATE VALUE ON INDEX exception at the time of INSERT and issue the appropriate UPDATE.
DELETE processed rows from the input record.
Commit every N records (1000? 10000? 100000? Your choice...), calling DBMS_STATS.GATHER_TABLE_STATS for the table you've inserted into after each COMMIT.
Best of luck.

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

How and when are indexes used in INSERT and UPDATE operations?

Consider this Oracle docs about indexes, this about speed of insert and this question on StackOverflow lead me to conclusion that:
Indexes helps us locate information faster
Primary and Unique Keys are indexed automatically
Inserting with indexes can cause worse performance
However every time indexes are discussed there are only SELECT operations shown as examples.
My question is: are indexes used in INSERT and UPDATE operations? When and how?
My suggestions are:
UPDATE can use index in WHERE clause (if the column in the clause has index)
INSERT can use index when uses SELECT (but in this case, index is from another table)
or probably when checking integrity constraints
but I don't have such deep knowledge of using indexes.
For UPDATE statements, index can be used by the optimiser if it deems the index can speed it up. The index would be used to locate the rows to be updated. The index is also a table in a manner of speaking, so if the indexed column is getting updated, it obviously needs to UPDATE the index as well. On the other hand if you're running an update without a WHERE clause the optimiser may choose not to use an index as it has to access the whole table, a full table scan may be more efficient (but may still have to update the index). The optimiser makes those decisions at runtime based on several parameters such as if there are valid stats against the tables and indexes in question, how much data is affected, what type of hardware, etc.
For INSERT statements though the INSERT itself does not need the index, the index will also need to be 'inserted into', so will need to be accessed by oracle. Another case where INSERT can cause the index to be used is an INSERT like this:
INSERT INTO mytable (mycolmn)
SELECT mycolumn + 10 FROM mytable;
Insert statement has no direct benefit for index. But more index on a table cause slower insert operation. Think about a table that has no index on it and if you want to add a row on it, it will find table block that has enough free space and store that row. But if that table has indexes on it database must make sure that these new rows also found via indexes, So to add new rows on a table that has indexes, also need to entry in indexes too. That multiplies the insert operation. So more index you have, more time you need to insert new rows.
For update it depends on whether you update indexed column or not. If you are not updating indexed column then performance should not be affected. Index can also speed up a update statements if the where conditions can make use of indexes.

Can I use one Trigger and one sequence (as auto_increment of pk) for several tables?

And for what is option CACHE?
CREATE SEQUENCE Race_SEQ
START WITH 1
INCREMENT BY 1
CACHE 10;
CREATE OR REPLACE TRIGGER RACE_INC
BEFORE INSERT ON RACE
FOR EACH ROW
BEGIN
:NEW.RACE_ID := RACE_SEQ.NEXTVAL;
END;
You can use the same sequence to generate the primary key for many tables. It generally doesn't make a lot of sense to do so, but you can. It doesn't cost you anything to have many different sequences and it generally makes for a more sensible application design when race_seq is used to populate the race table and foo_seq is used to populate the foo table rather than having all the tables use the same sequence.
You would need one trigger per table. A trigger can only be defined on a single table.
The cache attribute specifies how many values for the sequence should be cached (per node if you happen to be using RAC). That makes generating new values more efficient at the expense of increasing the number of gaps that are created. In general, you'd use larger caches for sequences the more frequently rows are inserted.

Resources