Hive Joins query - hadoop

I have two tables in hive:
Table 1:
1,Nail,maher,24,6.2
2,finn,egan,23,5.9
3,Hadm,Sha,28,6.0
4,bob,hope,55,7.2
Table 2 :
1,Nail,maher,24,6.2
2,finn,egan,23,5.9
3,Hadm,Sha,28,6.0
4,bob,hope,55,7.2
5,john,hill,22,5.5
6,todger,hommy,11,2.2
7,jim,cnt,99,9.9
8,will,hats,43,11.2
Is there any way in Hive to retrieve the new data in table 2 that doesn't exist in table 1??
In other Databases tools, you would use a inner left/right. But inner left/right doesn't exist in Hive and suggestions how this could be achieved?

If you are using Hive version >= 0.13 you can use this query:
SELECT * FROM A WHERE A.firstname, A.lastname ... IN (SELECT B.firstname, B.lastname ... FROM B);
But I'm not sure if Hive supports multiple coloumns in the IN clause.
If not something like this could work:
SELECT * FROM A WHERE A.firstname IN (SELECT B.firstname FROM B) AND A.lastname IN (SELECT b.lastname FROM B) ...;

It might be wiser to concatenate the fields together before testing for NOT IN:
SELECT *
FROM t2
WHERE CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING)) NOT IN
(SELECT CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING))
FROM t1)
Performing sequential NOT IN sub-queries may give you erroneous results.
From the above example, a new record with the values ('nail','egan',28, 7.2) would not show up as new with sequential NOT IN statements.

Related

How can I solve this hive sql problem? (Like join in Hive)

I know Hive only provide equi join. For example, below sql statement.
select *
from A join B
on A.c1 = B.c2
where 1=1;
But I want to execute Like join Query in Hive. For example, below sql statement.
select *
from A join B
on A.c1 like B.c2
where 1=1;
Please let me know if you know the solution in Hive.
How about -
ON A.c1 LIKE concat('%',B.c2,'%')
concat will concatenate % to the c2 data so like operator will work properly.
whole sql will be like -
select * from A join B on A.c1 LIKE concat('%',B.c2,'%') where 1=1;
version - hive 2.1.1

Oracle Performance issues on using subquery in an "In" orperator

I have two query that looks close to the same but Oracle have very different performance.
Query A
Create Table T1 as Select * from FinalView1 where CustomerID in ('A0000001','A000002')
Query B
Create Table T1 as Select * from FinalView1 where CustomerID in (select distinct CustomerID from CriteriaTable)
The CriteriaTable have 800 rows but all belongs to Customer ID 'A0000001' and 'A000002'.
This means the subquery: "select distinct CustomerID from CriteriaTable" also only returns the same two elements('A0000001','A000002') as manually entered in query A
Following is the query under the FinalView1
create or replace view FinalView1_20200716 as
select
Customer_ID,
<Some columns>
from
Table1_20200716 T1
INNER join Table2_20200716 T2 on
T1.Invoice_number = T2.Invoice_number
and
T1.line_id = T2.line_id
left join Table3_20200716 T3 on
T3.id = T1.Customer_ID
left join Table4_20200716 T4 on
T4.Shipping_ID = T1.Shipping_ID
left join Table5_20200716 Table5 on
Table5.Invoice_ID = T1.Invoice_ID
left join Table6_20200716 T6 on
T6.Shipping_ID = T4.Shipping_ID
left join First_Order first on
first.Shipping_ID = T1.Shipping_ID
;
Table1_20200716,Table2_20200716,Table3_20200716,Table4_20200716,Table5_20200716,Table6_20200716 are views to the corresponding table with temporal validity feature. For example
The query under Table1_20200716
Create or replace view Table1_20200716 as
select
*
from Table1 as for period of to_date('20200716,'yyyymmdd')
However table "First_Order" is just a normal table as
Following is the performance for both queries (According to explain plan):
Query A:
Cardinality: 102
Cost : 204
Total Runtime: 5 secs max
Query B:
Cardinality:27921981
Cost: 14846
Total Runtime:20 mins until user cancelled
All tables are indexed using those columns that used to join against other tables in the FinalView1. According to the explain plan, they have all been used except for the FirstOrder table.
Query A used uniquue index on the FirstOrder Table while Query B performed a full scan.
For query B, I was expecting the Oracle will firstly query the sub-query get the result into the in operator, before executing the main query and therefore should only have minor impact to the performance.
Thanks in advance!
As mentioned from my comment 2 days ago. Someone have actually posted the solution and then have it removed while the answer actually work. After waiting for 2 days the So I designed to post that solution.
That solution suggested that the performance was slow down by the "in" operator. and suggested me to replace it with an inner join
Create Table T1 as
Select
FV.*
from
FinalView1 FV
inner join (
select distinct
CustomerID
from
CriteriaTable
) CT on CT.customerid = FV.customerID;
Result from explain plan was worse then before:
Cardinality:28364465 (from 27921981)
Cost: 15060 (from 14846)
However, it only takes 17 secs. Which is very good!

Hash Join with Partition restriction from third table

my current problem is in 11g, but I am also interested in how this might be solved smarter in later versions.
I want to join two tables. Table A has 10 million rows, Table B is huge and has a billion of records across about a thousand partitions. One partition has around 10 million records. I am not joining on the partition key. For most rows of Table A, one or more rows in Table B will be found.
Example:
select * from table_a a
inner join table_b b on a.ref = b.ref
The above will return about 50 million rows, whereas the results come from about 30 partitions of table b. I am assuming a hash join is the correct join here, hashing table a and FTSing/index-scanning table b.
So, 970 partitions were scanned for no reason. And, I have a third query that could tell oracle which 30 partitions to check for the join.
Example of third query:
select partition_id from table_c
This query gives exactly the 30 partitions for the query above.
To my question:
In PL/SQL one can solve this by
select the 30 partition_ids into a variable (be it just a select listagg(partition_id,',') ... into v_partitions from table_c
Execute my query like so:
execute immediate 'select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in ('||v_partitions||')' into ...
Let's say this completes in 10 minutes.
Now, how can I do this in the same amount of time with pure SQL?
Just simply writing
select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in (select partition_id from table_c)
does not do the trick it seems, or I might be aiming at the wrong plan.
The plan I think I want is
hash join
table a
nested loop
table c
partition pruning here
table b
But, this does not come back in 10 minutes.
So, how to do this in SQL and what execution plan to aim at? One variation I have not tried yet that might be the solution is
nested loop
table c
hash join
table a
partition pruning here (pushed predicate from the join to c)
table b
Another feeling I have is that the solution might lie in joining table a to table c (not sure on what though) and then joining this result to table b.
I am not asking you to type everything out for me. Just a general concept of how to do this (getting partition restriction from a query) in SQL - what plan should I aim at?
thank you very much! Peter
I'm not an expert at this, but I think Oracle generally does the joins first, then applies the where conditions. So you might get the plan you want by moving the partition pruning up into a join condition:
select * from table_a a
inner join table_b b on a.ref = b.ref
and b.partition_id in (select partition_id from table_c);
I've also seen people try to do this sort of thing with an inline view:
select * from table_a a
inner join (select * from table_b
where partition_id in (select partition_id from table_c)) b
on a.ref = b.ref;
thank you all for your discussions with me on this one. In my case this was solved (not by me) by adding a join-path between table_c and table_a and by overloading the join conditions as below. In my case this was possible by adding column partition_id to table_a:
select * from
table_c c
JOIN table_a a ON (a.partition_id = c.partition_id)
JOIN table_b b ON (b.partition_id = c.partition_id and b.partition_id = a.partition_id and b.ref = a.ref)
And this is the plan you want:
leading(c,b,a) use_nl(c,b) swap_join_inputs(a) use_hash(a)
So you get:
hash join
table a
nested loop
table c
partition list iterator
table b

Oracle join select result

I 've got this problem:
I have a select statement, which is rather time consuming.
I have to join the result with itself.
I want to do something like this:
Select table1.*, table2.Consumption
from (heavy select statement) table1 left outer join
(same heavy statement) table2
on table1."id" = table2."id" and table1."Year" -1 = table2."Year"
I don't want to catch the same data 2 times. I would rather like to do something like table1 table2. Is this possible?
I need this for an application, which executes querys but isn't able to use create or something like this, otherwise i would store the data in a table.
You can use a common table expression (CTE) and materialize the results of the heavy select statement:
WITH heavy AS ( SELECT /*+ MATERIALIZE */ ... (heavy select statemenet) )
Select table1.*, table2.Consumption
from heavy table1 left outer join
heavy table2
on table1."id" = table2."id" and table1."Year" -1 = table2."Year"

Forcing an index in oracle

I have a query joining lots of fields. For some strange reason the index for one table is not being used at all( I use the index key clearly), instead it is doing a FULL table scan. I would like to force the index. We used to do optimizer hints in sybase. Is there a similar hint available in oracle?
For example, in sybase to join tables a, b, c and use myindex in table a, I would do :
SELECT a.*
FROM a(INDEX myindex),
b,
c
WHERE a.field1 = b.field1
AND b.field1 = c.field1
Question is how do I do this in oracle.
Thanks
Saro
Yes, there is a hint like that in Oracle. It looks something like this:
select /*+ index(a my_index) */ from a

Resources