Most Efficient Way to Create an Index in Postgres - performance

Is it more efficient to create an index after loading data is complete or before, or does it not matter?
For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:
Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.
The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:
CREATE INDEX idx_name ON table_name (column_name);
My data loading uses COPY FROM.
Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.

Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.
It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.
One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).

Related

Importing a large amount of data into Elasticsearch every time by dropping existing data

Currently, there's a denormalized table inside a MySQL database that contains hundreds of columns and millions of records.
The original source of the data does not have any way to track the changes so the entire table is dropped and rebuilt every day by a CRON job.
Now, I would like to import this data into Elaticsearch. What is the best way to approach this? Should I use logstash to connect directly to the table and import it or is there a better way?
Exporting the data into JSON or similar is an expensive process since we're talking about gigabytes of data every time.
Also, should I drop the index in elastic as well or is there a way to make it recognize the changes?
In any case - I'd recommend using index templates to simplify index creation.
Now for the ingestion strategy, I see two possible options:
Rework your ETL process to do a merge instead of dropping and recreating the entire table. This would definitely be slower but would allow shipping only deltas to ES or any other data source.
As you've imagined yourself - you should be probably fine with Logstash using daily jobs. Create a daily index and drop the old one during the daily migration.
You could introduce buffers, such as Kafka to your infrastructure, but I feel that might be an overkill for your current use case.

Is there an advantage to using a local index on a partitioned table in Oracle?

I assume the answer is "no" in this scenario, but I figured I'd ask and see if there was something I was missing:
I have an Oracle table which is partitioned for ease of data loading -- data is loaded into six separate tables and then partition-switched into the main table. The only thing differentiating these loading tables is the source of the data, so each one has a unique datasource column which is used to partition the main table. We occasionally have some ad hoc queries which look at this datasource in the main table, but the standard reports querying this table ignore this column entirely. Nothing insert/update/deletes individual records from this table, so there's no concern about updating any indexes.
In this case, is there any reason to use local indexes instead of global ones?
A local index makes a lot of sense - if you use partitioning for performance reasons.
If your queries always contain the partition key then a Oracle will only scan that specific partition (that is known as "partition pruning").
If you then have additional conditions that would benefit from an index lookup, the database only needs to check the local index which is much smaller then a global index and thus the lookup will be faster.
In your case, if you never (or almost never) include the partition key in the queries, you are right that the local index wouldn't be helpful.

Informatica 9.5.1, huge table (scd1)

I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.

Deletes Slow on a Oracle BIG Table

I have a table which has around 180 million records and 40 indexes. A nightly program, loads data into this table but due to certain business conditions we can only delete and load data into this table. The nightly program will bring new records or updates to existing records in the table from the source system.We have limited window i.e about 6 hours to complete the extract from the source system, perform business transformations and finally load the data into this target table and be ready for users to consume the data in the morning. The issue which we are facing is that the delete from this table takes a lot of time mainly due to the 40 indexes on the table(an average of 70000 deletes per hour). I did some digging on the internet and see the below options
a) Drop or disable indexes before delete and then rebuild indexes: The program which loads data into the target table after delete and loading the data needs to perform quite a few updates for which the indexes are critical. And to rebuild 1 index it takes almost 1.5 hours due to the enormous amount of data in the table. So this approach is not feasible due to the time it takes to rebuild indexes and due to the limited time we have to get the data ready for the users
b) Use bulk delete: Currently the program deletes based on rowid and deletes records one by one as below
DELETE
FROM <table>
WHERE rowid = g_wpk_tab(ln_i);
g_wpk_tab is the collection which holds rowids to be deleted which is read by looping via FOR ALL and I do an intermediate commit every 50000 row deletes.
Tom of AskTom says in this discussion over here says that the bulk delete and row by row delete will take almost the same amount of time
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5033906925164
So this wont be a feasible option as well
c)Regular Delete: Tom of AskTom suggests to use the regular delete and even that takes a long time probably due to the number of indexes on this table
d)CTAS: This approach is out of question because the program needs to recreate the table , create the 40 indexes and then proceed with the updates and I mentioned above an index will take atleast 1.5 hrs to create
If you could provide me any other suggestions I would really appreciate it.
UPDATE: As of now we have decided to go with the approach suggested by https://stackoverflow.com/users/409172/jonearles to archive instead of delete. Approach is to add a flag to the table to mark the records to be deleted as DELETE and then have a post delete program run during the day to delete off the records. This will ensure that the data is available for users at the right time. Since users consume via OBIEE we are planning to set content level filter on the table to not look at the archival column so that users needn't know about what to select and what to ignore.
Parallel DML alter session enable parallel dml;, delete /*+ parallel */ ...;, commit;. Sometimes it's that easy.
Parallel DDL alter index your_index rebuild nologging compress parallel;. NOLOGGING to reduce the amount of redo generated during the index rebuild. COMPRESS can significantly reduce the size of a non-unique index, which significantly reduces the rebuild time. PARALLEL can also make a huge difference in rebuild time if you have more than one CPU or more than one disk. If you're not already using these options, I wouldn't be surprised if using all of them together improves index rebuilds by an order of magnitude. And then 1.5 * 40 / 10 = 6 hours.
Re-evaluate your indexes Do you really need 40 indexes? It's entirely possible, but many indexes are only created because "indexes are magic". Make sure there's a legitimate reason behind each index. This can be very difficult to do, very few people document the reason for an index. Before you ask around, you may want to gather some information. Turn on index monitoring to see which indexes are really being used. And even if the index is used, see how it is used, perhaps through v$sql_plan. It's possible that an index is used for a specific statement but another index would have worked just as well.
Archive instead of delete Instead of deleting, just set a flag to mark a row as archived, invalid, deleted, etc. This will avoid the immediate overhead of index maintenance. Ignore the rows temporarily and let some other job delete them later. The large downside to this is that it affects any query on the table.
Upgrading is probably out of the question, but 12c has an interesting new feature called in-database archiving. It's a more transparent way of accomplishing the same thing.

Deleting large number of rows of an Oracle table

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

Resources