Update delta records in hive table - hadoop

I have a table with history data which is more than a TB size and I would be receiving delta (updated info) records on daily basis which will be in GB size and stored in delta table. Now I want to compare the delta records with the history records and update the History table with the latest data from Delta table.
What is the best approach to do this in Hive since I would be dealing with millions of rows. I have searched the web and found the below approach.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive
But I don't think it would a be best approach in the aspect of performance.

In Latest hive (0.14), you can do updates. You need to keep the table in ORC format and bucket by the searching key.
Oh, and I need to add this link for more information:
Hive Transactions
In addition:
Do you have a good partitioning key so that the updates will only have to work on latest partitions? it can be good to do the following:
get data from required partitions to a temp table (T1)
let's say T2 is the new table with update records. need to be partitioned the same way as T1
Join T1 and T2 with key(s) and take the ones only present in T1 and not in T2. Let's say this table is T3
Union T2 and T3 to create table T4
Drop the previously taken partitions from T1
Insert T4 into T1
Remember, the operations may not be atomic and during the time step 5 and 6 happens, any query running on T1 can have intermediate results.

Related

indexed view vs temp table to improve performance of a seldom executed query

i have a slow query whose structure is
select
fields
from
table
join
manytables
join (select fields from tables) as V1 on V1 on V1.field = ....
join (select fields1 from othertables) as V2
join (select fields2 from moretables) as V3
The select subqueries in the last 3 joins are relatively simple but joins agains the, take time. If they were phisical tables it would be much better.
So i found out that i could turn the subqueries to indexed views or to temp tables.
By temp table i do not mean a table who is written hourly like explained here,
but a temp table who is created before the query execution
Now my doubt comes from the fact that indexed views are ok in datawarehouses since they impact the performance. This db is not a datawarehouse but a production db of a non data intense application.
But in my case the above query is executed not often, even if the underlaying tables (the tables whose data would become part of the indexed view) are used more often.
In this case is it ok to use indexed views? Or shuold i favor temp table?
Also table variable with primary key keyword is an alternative.

Oracle Partition pruning not happening

I have a fact table with millions of records. The table is range partitioned on a date column.
FACT_AUM (ACCOUNT_ID VARCHAR2(30),MARKET_VALUE NUMBER(20,6), POSTING_DATE DATE);
I have another temp table
ACCOUNT_TMP (ACCOUNT_ID VARCHAR2(30), POSTING_DATE DATE);
When I run this query by hard coding the date I see partition pruning happens and the results come back quickly
SELECT A.ACCOUNT_ID, SUM(A.MARKET_VALUE) FROM
FACT_AUM A JOIN ACCOUNT_TMP B ON A.ACCOUNT_ID = B.ACCOUNT_ID
AND A.POSTING_DATE=TO_DATE('30-DEC-2016',DD-MON-YYYY') GROUP BY
A.ACCOUNT_ID;
when I run the following, I don't see partition pruning and the query keeps spinning
SELECT A.ACCOUNT_ID, SUM(A.MARKET_VALUE) FROM
FACT_AUM A JOIN ACCOUNT_TMP B ON A.ACCOUNT_ID = B.ACCOUNT_ID
AND A.POSTING_DATE = B.POSTING_DATE GROUP BY
A.ACCOUNT_ID;
Any insights on this would be helpful.
Oracle used partition pruning while you hard coded the value, because Oracle felt it would get benefit of doing the partition pruning there.
When you joined the fact table with your temporary ( i would reword it to staging) table, Oracle wouldn't be able to guess which all partitions would it have to hit for computing the answer. Please note Oracle will assess what would be the range of values available in the staging table.
But unless you provide stats of the tables involved, i couldn't dwell into more important topics of the table ordering and tables joins. For quick fix use an Order hint or nested loop hint.

Delete rows from partition table - Best way

I want to delete around 1 million records from a table which is partitioned and table size is around 10-13 millions , As of now only 2 partition exist in the table containining July month data and august month data, and i want to delete from July month.Can you please let me know if a simple delete from table paritition (0715) is ok to do ? Possibilities of fragmentation ? or any best way out?
Thank you
DELETE is rather costly operation on large partitioned tables (but 10M is not realy large). Typically you try to avoid it and remove the data partition-wise using drop partition.
The simplest schema is rolling window, where you define a range partitioning schema by dropping the oldest partitian after the retention interval.
If you need more controll you may use CTAS and exchange back approach.
Instead of deleting a large part of a partition create a copy of it
create table TMP as
select * from TAB PARTITION (ppp)
where <predicate to filter out records to be ommited for partition ppp>
Create indexes on the TMP table in the same structure as the LOCAL indexes of the partitioned table.
Than exchange the temporary table with the partition
ALTER TABLE TAB
EXCHANGE PARTITION ppp WITH TABLE TMP including indexes
WITHOUT VALIDATION
Note no fragmenatation as a result, in contrary you may use it to reorganize the partition data (e.g. with ORDER BY in CTAS or with COMPRESS etc.)
You can delete truncate the partition from the given table. Delete also you can perform if you want to delete few rows from the partition. Plz share your table structure along with the partition details so that it will be easy for people here to assist you.

Optimizing a delete... where query with rownum

I'm working with an application that has a large amount of outdated data clogging up a table in my databank. Ideally, I'd want to delete all entries in the table whose reference date is too old:
delete outdatedTable where referenceDate < :deletionCutoffDate
If this statement were to be run, it would take ages to complete, so I'd rather break it up into chunks with the following:
delete outdatedTable where referenceData < :deletionCutoffDate and rownum <= 10000
In testing, this works suprisingly slowly. The following query, however, runs dramatically faster:
delete outdatedTable where rownum <= 10000
I've been reading through multiple blogs and similar questions on StackOverflow, but I haven't yet found a straightforward description of how/whether using rownum affects the Oracle optimizer when there are other Where clauses in the query. In my case, it seems to me as if Oracle checks
referenceData < :deletionCutoffDate
on every single row, executes a massive Select on all matching rows, and only then filters out the top 10000 rows to return. Is this in fact the case? If so, is there any clever way to make Oracle stop checking the Where clause as soon as it's found enough matching rows?
How about a different approach without so much DML on the table. As a permanent solution for future you could go for table partitioning.
Create a new table with required partition(s).
Move ONLY the required rows from your existing table to the new partitioned table.
Once the new table is populated, add the required constraints and indexes.
Drop the old table.
In future, you would just need to DROP the old partitions.
CTAS(create table as select) is another way, however, if you want to have a new table with partition, you would have to go for exchange partition concept.
First of all, you should read about SQL statement's execution plan and learn how to explain in. It will help you to find answers on such questions.
Generally, one single delete is more effective than several chunked. It's main disadvantage is extremal using of undo tablespace.
If you wish to delete most rows of table, much faster way usially a trick:
create table new_table as select * from old_table where date >= :date_limit;
drop table old_table;
rename table new_table to old_table;
... recreate indexes and other stuff ...
If you wish to do it more than once, partitioning is a much better way. If table partitioned by date, you can select actual date quickly and you can drop partion with outdated data in milliseconds.
At last, paritioning if a way to dismiss 'deleting outdated records' at all. Sometimes we need old data, and it's sad if we delete it by own hands. With paritioning you can archive outdated partitions outside of the database, but connects them when you need to access old data.
This is an old request, but I'd like to show another approach (also using partitions).
Depending on what you consider old, you could create corresponding partitions (optimally exactly two; one current, one old; but you could just as well make more), e.g.:
PARTITION BY LIST ( mod(referenceDate,2) )
(
PARTITION year_odd VALUES (1),
PARTITION year_even VALUES (0)
);
This could as well be months (Jan, Feb, ... Dec), decades (XX0X, XX1X, ... XX9X), half years (first_half, second_half), etc. Anything circular.
Then whenever you want to get rid of old data, truncate:
ALTER TABLE mytable TRUNCATE PARTITION year_even;
delete from your_table
where PK not in
(select PK from your_table where rounum<=...) -- these records you want to leave

Wrong index is chosen by Oracle

I have a problem in indexing in Oracle. Will try to explain my problem with an instance as follows.
I have a table TABLE1 with columns A,B,C,D
another table TABLE2 with columns A,B,C,E,F,H
I have created Indexes for TABLE1
IX_1 A
IX_2 A,B
IX_3 A,C
IX_4 A,B,C
I have created Indexes for TABLE1
IY_1 A,B,C
IY_2 A
when i gave query similar to this
SELECT * FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
When i give Explain Plan i got its not getting IX_1 nor IY_2
Its taking IX_4 nor IY_1
why this is not picking right index?
EDITED:
Can anyone help me to know difference between INDEX RANGE SCAN,INDEX UNIQUE SCAN, INDEX SKIP SCAN
I guess SKIP SCAN means when a column is skipped in Composite Index by Oracle
what about others i dont have idea!
The best benefit of indexes is that you can select a few rows from a table without scanning the entire table.
If you ask for too many rows(let's say 30% - depends of many things) the engine will prefer to scan the entire table for those rows.
That's because reading a row using an index is gets an overhead : reading some index blocks, and after that reading table blocks.
In your case, in order to join tables T1 and T2, Oracle needs all the rows from those table. Reading(full) the index will be an unsefull operation, adding unnecesary cost.
UPDATE: A step forward: if you run:
SELECT T1.B, T2.B FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
Oracle probably will use the indexes(IX2, IY2), because it does not need to read anything from table, because the values T1.B, T2.B, are in indexes.

Resources