Combine Multiple Hive Tables as single table in Hadoop - hadoop

Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried like this..
create table new as
select * from table_a
union all
select * from table_b
Is there any other way to combine all the tables more efficient. Any help will be appreciated.

Hive would be processing in parallel if you set "hive.exec.parallel" as true. With "hive.exec.parallel.thread.number" you can specify the number of parallel threads. This would increase the overall efficiency.

If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

Related

Oracle `partition_by` in select clauses, does it create these partitions permantly?

I only have a superficial understanding on partitions in Oracle, but, I know you can create persistent partitions on Oracle, for example within a create table statement, but, when using partition by clauses within a select statement? Will Oracle create a persistent partition, for caching reasons or whatever, or will the partition be "temporary" in some sense (e.g., it will be removed at the end of the session, the query, or after some time...)?
For example, for a query like
SELECT col1, first_value(col2)
over (partition by col3 order by col2 nulls last) as colx
FROM tbl
If I execute that query, will Oracle create a partition to speed up the execution if I execute it again, tomorrow or three months later? I'm worry about that because I don't know if it could cause memory exhaustion if I abuse that feature.
partition by is used in the query(windows function) to fetch the aggregated result using the windows function which is grouped by the columns mentioned in the partition by. It behaves like group by but has ability to provide grouped result for each row without actually grouping the final outcome.
It has nothing to do with table/index partition.
scope of this partition by is just this query and have no impact on table structure.

How to populate columns of a new hive table from multiple existing tables?

I have created a new table in hive (T1) with columns c1,c2,c3,c4. I want to populate data into this table by querying from other existing tables(T2,T3).
E.g c1 and c2 come from a query run on T2 & the other columns c3 and c4 come from a query run on T3.
Is this possible in hive ? I have done immense research but still am unable to find a solution to this
Didn't something like this work?
create table T1 as
select t2.c1, t2.c2, t3.c3, t3.c4 from (some query against T2) t2 JOIN (some query against T3) t3
Obviously replace JOIN with whatever is needed. I assume some join between T2 and T3 is possible or else you wouldn't be putting their columns alongside each other in T1.
According to the hive documentation, you can use the following syntax to insert data:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Be careful that:
Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
So, I would make a JOIN between the two existing table, and then insert only the needed values in the target table playing around with SELECT. Or maybe creating a temporary table would allow you to have more control over the data. Just remember to handle the problem with NULL, as stated in the official documentation. This is just an idea, I guess there are other ways to achieve what you need, but could be a good place to start from.

How to use Oracle Materialzed View in a Dimensional Model

I have a dimensional model with a large fact table (millions of rows) which is range partitioned by date and smaller dimensional tables that are not partitioned. I came across materialized views which is often used in these scenarios to improve query performance.
Now, I want to know which way is better of the following two to utilize these materialized views to get aggregated reports.
A. Create one with the by joining the whole fact table with each of the dimension tables required.
create materialized view my_mview execute immediate query rewrite
select
fact.col1, dim1.col2, dim2.col3, sum(fact.col4)
from
my_fact fact
inner join
my_dim1 dim1
on fact.dim1_key = dim1.dim1_key
inner join
my_dim2 dim2
on fact.dim2_key = dim2.dim2_key group by fact.col1, dim1.col2, dim2.col3
This seems like the most basic way of using them. But it seems
rather limiting and I would require a new materialzed view for each
variation of the query I want to create.
B. Create it over the aggregation of the fact table and utilize the query rewrite when doing a dimensional join back.
create materialized view my_mview execute immediate query rewrite
select
col1, dim1.dim2_key, dim2.dim_key, sum(fact.col4)
from
my_fact fact
And do the join as above in case A, which will use this aggregated materialzed view for the join and not the whole fact table.
Can anyone tell me when I would use each case or the other?
Your first example works exactly as you described.
For the second example the query should be:
create materialized view my_mview execute immediate query rewrite
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
This will automatically speed up aggregates such as
select sum(fact.col4)
from fact
select fact.dim_key,sum(fact.col4)
from fact
group by fact.dim_key
select fact.dim2_key,sum(fact.col4)
from fact
group by fact.dim2_key
I don't think Oracle will rewrite your first type of query to this MV automatically because in the MV the join columns are already grouped by (They also should be grouped in your second example). It never happened for us. This however may also depend on if there are relationships defined between dim and fact table and the value of QUERY_REWRITE_INTEGRITY parameter, so there is still some room for testing here.
You may still get a performance gain by writing a query in a specific way
WITH preaggr as (
select
col1, fact.dim2_key, fact.dim_key, sum(fact.col4)
from
my_fact fact
group by
col1, fact.dim2_key, fact.dim_key
)
select
dim2.col1,
sum(preaggr.col4)
from
preaggr
join
dim2
on
preaggr.dim2_key = fact.dim2_key
group by
dim2.col1

Hive: Fast concatenate two tables into one?

I have two Hive tables of the same structure (schema). What would be an efficient SQL request to concatenate them into a single table with the same structure?
Update, this works quite fast in my case:
CREATE TABLE xy AS SELECT *
FROM (
SELECT *
FROM x
UNION ALL
SELECT *
FROM y
) tmp;
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
"union all" is a right solution but might be expensive, resource/time wise. I'd recommend creating a table with two partitions, one for table A and another for Table B. This way, no need to merge (or union all). The merged table is available as soon as both partitions get populated.

Querying multiple Oracle databases performance issue

I have over million records in these tables in both the databases.
I am trying to figure out data in both the tables acros databases.
SELECT COUNT(*) FROM DB1.MYTABLE WHERE SEQ_NO NOT IN(SELECT SEQ_NO FROM DB2.MYTABLE) AND FILENAME NOT LIKE '%{%'
and PT_TYPE NOT IN(15,24,268,284,285,286,12,17,9,290,214,73) AND STTS=1
The query is taking ages. Is there any way I can make it fast?
Appreciate your help in advance
Do you actually mean different databases? Or do you mean different schemas? You talk about different databases but the syntax appears to be using tables in two different schemas, not two different databases. I don't see any references to a database link which would be needed if there were two different databases but perhaps DB2.MYTABLE is supposed to be a synonym for MYTABLE#DB2.
It would be helpful if you could post the query plan that is generated. It would also be useful to indicate what indexes exist and how selective each of these predicates is. My guess is that modifying the query to be
SELECT count(*)
FROM schema1.mytable a
WHERE NOT EXISTS (
SELECT 1
FROM schema2.mytable b
WHERE a.seq_no = b.seq_no )
AND a.filename NOT LIKE '%{%'
AND a.pt_type NOT IN (15,24,268,284,285,286,12,17,9,290,214,73)
AND a.stts = 1
might be more efficient if most of the rows in SCHEMA1.MYTABLE are eliminated because the SEQ_NO exists in SCHEMA2.MYTABLE.

Resources