TSQL Merge Performance - performance

Scenario:
I have a table with roughly 24 million records. The table has pricing history related to individual customers and is computed daily. There are on average 6 million records for each day. Every morning a the price list is generated and a merge statement is ran to reflect the changes in their pricing.
The merge statement begins with the previous day's previous data being inserted into a variable table, that table is then merged into the actual table. The main problem is that the merge statement takes pretty long.
My real question centers around the performance of using a variable table vs physical table vs temp table. What is the best practice for large merges like this?

Thoughts
I'd consider a temp table: these have statistics which will help. A table variable is always assumed to have one row. Also, the IO can be shunted onto separate drives (assuming you have tempdb separately)
If a single transaction is not required, I'd split the MERGE too into a DELETE, UPDATE, INSERT sequence to reduce the amount of work needed in each action (which reduces the amount of rollback info needed and the amount of locking etc

Temp tables often perform better than table variables for large data sets. Additionally you can put the data into the temp table and then index it.

Check if you indexes on the tables. Indexes would be updated every time you add/delete records on that table.
Try removing the indexes before merging the records and then re-create it again after the merge.

Related

Can we reduce High water mark by using Gather stats in oracle?

I am very new to oracle and I have below query.
I have one table which has almost 6 L records.
In daily batch i need to delete almost 5.7 L record and insert it again in from another table. Note that i can not use truncate table because 30000 records are constant one which I should not delete.
Issue here is if I delete daily 5.67 L record, it may cause for High WaterMark issue.
So my query is can Gather Stats helps to reduce HWM?
I can do Oracle Gather Stats Daily.
You can use the SHRINK command to recover the space and reset the high water mark:
alter table your_table shrink space;
However, you should only do this if you need to. In your case it seems likely you need only do this if you are inserting your 567,000 records using the /* APPEND */ hint; this hint tells the optimiser to insert records above the HWM, which in your scenario would cause the table to grow, with vast amounts of empty space. Shrinking would definitely be useful here.
If you're just inserting records without the hint then they will mainly reuse the empty space vacated by the prior deletion, so you don't need to concern yourself with the HWM.
Incidentally, deleting and re-inserting 5.67L records every day sounds rather poor practice. There are probably better solutions (such as MERGE), depending on the underlying business rules you're trying to satisfy.
If you reinsert the same amount of data, the table should be roughly the same size. Therefore, I would not worry about the high water mark too much.
Having said that, it is good practice to gather statistics if the data has changed substantially, so I would recommend that.

How to append the data to existing hive table without partition

I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.
As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.
You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).

When is the right time to create Indexes in Oracle?

A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.

Populating a star-schema from a single staging table.

What is the best way of populating a star-schema from a single staging table.
The dataload is in number of millions of rows and the star-schema is one fact table with 10 associated dimension tables.
Scenario 1. Doing sequential inserts into dimensions first and afterwards a big insert into the fact table where I join the staging table with the updated dimension tables. My biggest concern here is the locking that might occur due to the concurrent inserts into dimension/fact tables due to the huge amount of data.
Scenario 2. Splitting the data load into smaller batches (10k rows) and looping through the entire staging table and inserting the batches in the same manner as described in Scenarion 1. The problem I'm seeing here is looping through a big table with cursors. Plus in case one of the batches fails to insert the data I would need to rollback changes for all of the inserts done previously.
Scenario 3. Write a big INSERT ALL statement and lock the star-schema for the whole duration of the insert. Moreover to the locking problems I would have a complex insert statment that will have to hold all of the business logic for the insert statements (a nightmare to debug and maintain)
You can try DBMS_PARALLEL_EXECUTE in 11g Release 2(!)
http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/d_parallel_ex.htm#ARPLS233
Works fine to separate a big table in smaller chunks and it lets you define the grade of parallelism very easyly. Don't use parallel hints or insert append inside the processing of the chunks though.
Your assumption that you can load the dimension tables without problems seem me too optimistic. In my experience you must cater for the situation that not all informations for the dimension data are valid at load time.

Oracle : table with unused columns impact performance?

I have a table in my oracle db with 100 columns. 50 columns in this table are not used by the program accessing this table. (i.e. the select queries only select the relevant columns and NOT using '*')
My question is this :
If I recreate the same table with only the columns I need will it improve queries performance using the same query I used with the original table (remember that only the relevant columns are selected)
It is well worth mentioning the the program makes these queries a reasonable amount of times per second!
P.S. :
This is an existing project I am working on and the table design was made a long time ago for other products as well (thats why we have unused columns now)
So the effect of this will be that the average row will be smaller, if the extra columns have got data that will no longer be in the table. Therefore the table can be smaller, and not only will it use less space on disk it will use less memory space in the SGA, and caching will be more efficient.
Therefore, if you access the table via a full table scan then it will be faster to read the segment, but if you use index-based access mechanisms then the only performance improvement is likely to be through an improved chance of fetching the block from cache.
[Edited]
This SO thread suggests "it always pulls a tuple...". Hence, you are likely to see some performance improvement, not sure major or minor, as already mentioned.

Resources