Am workin in oracle 11g. Does ALTER INDEX REBUILD ONLINE render indexes invalid when executed parallely?
A new index is built in parallel and while the rebuild is in progress the old index is still available. However, when the new index is swapped in for the old index there will be a period (hopefully short) when neither index is available.
From http://www.oracle-base.com/articles/9i/HighAvailabilityEnhancements9i.php#OnlineIndexRebuilds:
When the ONLINE keyword is used as
part of the CREATE or ALTER syntax
the current index is left intact while
a new copy of the index is built,
allowing DML to access the old index.
Any alterations to the old index are
recorded in a Index Organized Table
known as a "journal table". Once the
rebuild is complete the alterations
from the journal table are merged into
the new index. This may take several
passes depending on the frequency of
alterations to the index. The process
will skip any locked rows and commit
every 20 rows. Once the merge
operation is complete the data
dictionary is updated and the old
index is dropped. DML access is only
blocked during the data dictionary
updates, which complete very quickly.
I watched an online rebuild with toad and executed a query that used the specified index.
Don't see it becoming invalid.
Maybe for a very short time at the precise moment the 'new' index is activated.
Haven't found any docs about it yet though.
Related
For existing table i have added the index to check the performance. Table has 1.5 million records. The existing cost is "58645". Once created the index the cost is reduced to "365". So that often time I have made the index as "unusable". Then I alter and rebuild the index to check. For yesterday known the index is being used by explain plan in oracle. But today when I unusable the index and rebuild, in explain plan the index scan was not working. But performance remains fast than older. I have dropped and created again. But still the issue is remaining. Fetching is fast. But the explain plan showing that the index is not being used and the cost is showing "58645". Am stuck with this.
Many times when you create the new index or rebuild it from scratch it doesn't show up in explain plan and sometime is not used for a while as well. To correct the explain plan the stats should be gathered on index.
EXEC DBMS_STATS.GATHER_INDEX_STATS should be used or use DBMS_STATS.GATHER_TABLE_STATS with cascade option.
Blocks of data are cached in the BUFFER_POOL, which will affect your results such that:
Run Query;
Change Index;
Run Query; - buffered data from 1 will skew the preformance
Flush buffer pool
Run Query - now you get a truer measure of how "fast" the query is.
Did you flush the buffer?
ALTER SYSTEM FLUSH BUFFER_POOL;
I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.
Is it more efficient to create an index after loading data is complete or before, or does it not matter?
For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:
Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.
The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:
CREATE INDEX idx_name ON table_name (column_name);
My data loading uses COPY FROM.
Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.
Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.
It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.
One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).
I have a table which has around 180 million records and 40 indexes. A nightly program, loads data into this table but due to certain business conditions we can only delete and load data into this table. The nightly program will bring new records or updates to existing records in the table from the source system.We have limited window i.e about 6 hours to complete the extract from the source system, perform business transformations and finally load the data into this target table and be ready for users to consume the data in the morning. The issue which we are facing is that the delete from this table takes a lot of time mainly due to the 40 indexes on the table(an average of 70000 deletes per hour). I did some digging on the internet and see the below options
a) Drop or disable indexes before delete and then rebuild indexes: The program which loads data into the target table after delete and loading the data needs to perform quite a few updates for which the indexes are critical. And to rebuild 1 index it takes almost 1.5 hours due to the enormous amount of data in the table. So this approach is not feasible due to the time it takes to rebuild indexes and due to the limited time we have to get the data ready for the users
b) Use bulk delete: Currently the program deletes based on rowid and deletes records one by one as below
DELETE
FROM <table>
WHERE rowid = g_wpk_tab(ln_i);
g_wpk_tab is the collection which holds rowids to be deleted which is read by looping via FOR ALL and I do an intermediate commit every 50000 row deletes.
Tom of AskTom says in this discussion over here says that the bulk delete and row by row delete will take almost the same amount of time
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5033906925164
So this wont be a feasible option as well
c)Regular Delete: Tom of AskTom suggests to use the regular delete and even that takes a long time probably due to the number of indexes on this table
d)CTAS: This approach is out of question because the program needs to recreate the table , create the 40 indexes and then proceed with the updates and I mentioned above an index will take atleast 1.5 hrs to create
If you could provide me any other suggestions I would really appreciate it.
UPDATE: As of now we have decided to go with the approach suggested by https://stackoverflow.com/users/409172/jonearles to archive instead of delete. Approach is to add a flag to the table to mark the records to be deleted as DELETE and then have a post delete program run during the day to delete off the records. This will ensure that the data is available for users at the right time. Since users consume via OBIEE we are planning to set content level filter on the table to not look at the archival column so that users needn't know about what to select and what to ignore.
Parallel DML alter session enable parallel dml;, delete /*+ parallel */ ...;, commit;. Sometimes it's that easy.
Parallel DDL alter index your_index rebuild nologging compress parallel;. NOLOGGING to reduce the amount of redo generated during the index rebuild. COMPRESS can significantly reduce the size of a non-unique index, which significantly reduces the rebuild time. PARALLEL can also make a huge difference in rebuild time if you have more than one CPU or more than one disk. If you're not already using these options, I wouldn't be surprised if using all of them together improves index rebuilds by an order of magnitude. And then 1.5 * 40 / 10 = 6 hours.
Re-evaluate your indexes Do you really need 40 indexes? It's entirely possible, but many indexes are only created because "indexes are magic". Make sure there's a legitimate reason behind each index. This can be very difficult to do, very few people document the reason for an index. Before you ask around, you may want to gather some information. Turn on index monitoring to see which indexes are really being used. And even if the index is used, see how it is used, perhaps through v$sql_plan. It's possible that an index is used for a specific statement but another index would have worked just as well.
Archive instead of delete Instead of deleting, just set a flag to mark a row as archived, invalid, deleted, etc. This will avoid the immediate overhead of index maintenance. Ignore the rows temporarily and let some other job delete them later. The large downside to this is that it affects any query on the table.
Upgrading is probably out of the question, but 12c has an interesting new feature called in-database archiving. It's a more transparent way of accomplishing the same thing.
Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.