Hive: Fast concatenate two tables into one? - hadoop

I have two Hive tables of the same structure (schema). What would be an efficient SQL request to concatenate them into a single table with the same structure?
Update, this works quite fast in my case:
CREATE TABLE xy AS SELECT *
FROM (
SELECT *
FROM x
UNION ALL
SELECT *
FROM y
) tmp;

If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

"union all" is a right solution but might be expensive, resource/time wise. I'd recommend creating a table with two partitions, one for table A and another for Table B. This way, no need to merge (or union all). The merged table is available as soon as both partitions get populated.

Related

inserting records from two different tables into a single table in oracle

I want to insert data from two different tables (say table A and table B ) into a third table (table C) in oracle.
I have written two different cursors for fetching data from table A and B separately, and populated two collections based on these two tables.
Now, i want to insert the data in those two collections into the third table (table C), how can i get this done.
Now there are two common columns that are present in both the columns, say for example ID and YEARMONTH, these two columns are there in all tables (A, B and C).
I have tried doing a merge based on these two fields.
but i am looking for an efficient and more convenient way to do this.
You didn't provide code you wrote, so I'll guess: cursors mean PL/SQL. If you're doing it in a loop, row-by-row, it'll be slow-by-slow.
As there are common columns in both tables (A and B), I'd suggest doing it in pure SQL: join those two tables and insert the result into C. Something like
insert into c (id, yearmonth, ...)
select a.id, a.yearmonth, ...
from a join b on a.id = b.id;
Make sure that indexes exist on columns you use to join tables. Or, even better, compare explain plans in both cases (with and without indexes) and choose an option which seems to be the best.
insert into tableC
select * from tableA where ...
union
select * from tableB where ...

Combine Multiple Hive Tables as single table in Hadoop

Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried like this..
create table new as
select * from table_a
union all
select * from table_b
Is there any other way to combine all the tables more efficient. Any help will be appreciated.
Hive would be processing in parallel if you set "hive.exec.parallel" as true. With "hive.exec.parallel.thread.number" you can specify the number of parallel threads. This would increase the overall efficiency.
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

How to use Oracle Materialzed View in a Dimensional Model

I have a dimensional model with a large fact table (millions of rows) which is range partitioned by date and smaller dimensional tables that are not partitioned. I came across materialized views which is often used in these scenarios to improve query performance.
Now, I want to know which way is better of the following two to utilize these materialized views to get aggregated reports.
A. Create one with the by joining the whole fact table with each of the dimension tables required.
create materialized view my_mview execute immediate query rewrite
select
fact.col1, dim1.col2, dim2.col3, sum(fact.col4)
from
my_fact fact
inner join
my_dim1 dim1
on fact.dim1_key = dim1.dim1_key
inner join
my_dim2 dim2
on fact.dim2_key = dim2.dim2_key group by fact.col1, dim1.col2, dim2.col3
This seems like the most basic way of using them. But it seems
rather limiting and I would require a new materialzed view for each
variation of the query I want to create.
B. Create it over the aggregation of the fact table and utilize the query rewrite when doing a dimensional join back.
create materialized view my_mview execute immediate query rewrite
select
col1, dim1.dim2_key, dim2.dim_key, sum(fact.col4)
from
my_fact fact
And do the join as above in case A, which will use this aggregated materialzed view for the join and not the whole fact table.
Can anyone tell me when I would use each case or the other?
Your first example works exactly as you described.
For the second example the query should be:
create materialized view my_mview execute immediate query rewrite
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
This will automatically speed up aggregates such as
select sum(fact.col4)
from fact
select fact.dim_key,sum(fact.col4)
from fact
group by fact.dim_key
select fact.dim2_key,sum(fact.col4)
from fact
group by fact.dim2_key
I don't think Oracle will rewrite your first type of query to this MV automatically because in the MV the join columns are already grouped by (They also should be grouped in your second example). It never happened for us. This however may also depend on if there are relationships defined between dim and fact table and the value of QUERY_REWRITE_INTEGRITY parameter, so there is still some room for testing here.
You may still get a performance gain by writing a query in a specific way
WITH preaggr as (
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
)
select
dim2.col1,
sum(preaggr.col4)
from
preaggr
join
dim2
on
preaggr.dim2_key = fact.dim2_key
group by
dim2.col1

How to select row data as column in Oracle

I have two tables like bellow shows figures
I need to select records as bellow shown figure. with AH_ID need to join in second table and ATT_ID will be the column header and ATT_DTL_STR_VALUE need to get as that column relevant value
Required output
Sounds like you have an Entity-Attribute-Value data model which relational DBs aren't the best at modeling. You may want to look into a key-value store.
However, as Justin suggested, if you're using 11g you can use th pivot clause as follows:
SELECT *
FROM (
SELECT T1.AH_ID, T1.AH_DESCRIPTION, T2.ATT_ID, T2.ATT_DTL_STR_VALUE
FROM T1
LEFT OUTER JOIN T2 ON T1.AH_ID = T2.AH_ID
)
PIVOT (MAX(ATT_DTL_STR_VALUE) FOR (ATT_ID) IN (1));
This statement requires you to hard-code in ATT_ID however there are ways to do it dynamically. More info can be found here.

Wrong index is chosen by Oracle

I have a problem in indexing in Oracle. Will try to explain my problem with an instance as follows.
I have a table TABLE1 with columns A,B,C,D
another table TABLE2 with columns A,B,C,E,F,H
I have created Indexes for TABLE1
IX_1 A
IX_2 A,B
IX_3 A,C
IX_4 A,B,C
I have created Indexes for TABLE1
IY_1 A,B,C
IY_2 A
when i gave query similar to this
SELECT * FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
When i give Explain Plan i got its not getting IX_1 nor IY_2
Its taking IX_4 nor IY_1
why this is not picking right index?
EDITED:
Can anyone help me to know difference between INDEX RANGE SCAN,INDEX UNIQUE SCAN, INDEX SKIP SCAN
I guess SKIP SCAN means when a column is skipped in Composite Index by Oracle
what about others i dont have idea!
The best benefit of indexes is that you can select a few rows from a table without scanning the entire table.
If you ask for too many rows(let's say 30% - depends of many things) the engine will prefer to scan the entire table for those rows.
That's because reading a row using an index is gets an overhead : reading some index blocks, and after that reading table blocks.
In your case, in order to join tables T1 and T2, Oracle needs all the rows from those table. Reading(full) the index will be an unsefull operation, adding unnecesary cost.
UPDATE: A step forward: if you run:
SELECT T1.B, T2.B FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
Oracle probably will use the indexes(IX2, IY2), because it does not need to read anything from table, because the values T1.B, T2.B, are in indexes.

Resources