Hive - OR condition with left outer join

Hive - OR condition with left outer join - hadoop

I have referred all the queries on SO for a similar case. Although the error may be common, I am looking for solution for the specific case. Please refrain from marking the question duplicate, unless you get exactly the same scenario with an accepted solution.
I have two tables
Main table:
c1 c2 c3 c4 c5
1 2 3 4 A
Other table
c1 c2 c3 c4 c5
1 8 5 6 B
8 2 8 9 C
8 7 3 9 C
8 7 9 4 C
5 6 7 8 D
Now, from the other table, I should only be able to pick only unique record across all the column. e.g. the last row (5,6,7,8, D) only.
Row 1 from other table rejected, because c1 value (1) is same as c1 value (1) in main table, Row 2 rejected because c2 value of other and main table matches and likewise...
In a nutshell, none of the columns from other table should have the same value (in corresponding column) in main table in the output of the query.
I tried creating below query
select t1.* from otherTable t1
LEFT OUTER JOIN mainTable t2
ON ( t1.c1 = t2.c1 OR t1.c2 = t2.c2 OR t1.c3 = t2.c3 OR t1.c4 = t2.c4 )
Where t2.c5 is null;
However, hive throws below exception
OR not supported in JOIN currently
I understand the hive limitation and many time I have used UNION (ALL | DISTINCT) with inner join to overcome this limitation; but not able to use the same strategy with this.
Please help.
EDIT 1 : I have hive version restriction - Can only use version 1.2.0

You can do cartesian product join (an inner join with no conditions):
select t1.* from otherTable t1
,mainTable t2
WHERE t1.c1 != t2.c1 AND t1.c2 != t2.c2
AND t1.c3 != t2.c3 AND t1.c4 != t2.c4 AND t1.c5 != t2.c5;
Assuming you have one row in the mainTable this query should be as efficient as one that uses OUTER JOIN
Another option is to break your proposed query into 5 different LEFT OUTER JOIN sub-queries:
select t1.* from (
select t1.* from (
select t1.* from (
select t1.* from (
select t1.* from otherTable t1
LEFT OUTER JOIN (select distinct c1 from mainTable) t2
ON ( t1.c1 = t2.c1) Where t2.c1 is null ) t1
LEFT OUTER JOIN (select distinct c2 from mainTable) t2
ON ( t1.c2 = t2.c2) Where t2.c2 is null ) t1
LEFT OUTER JOIN (select distinct c3 from mainTable) t2
ON ( t1.c3 = t2.c3) Where t2.c3 is null ) t1
LEFT OUTER JOIN (select distinct c4 from mainTable) t2
ON ( t1.c4 = t2.c4) Where t2.c4 is null ) t1
LEFT OUTER JOIN (select distinct c5 from mainTable) t2
ON ( t1.c5 = t2.c5) Where t2.c5 is null
;
Here, for each column, I first get the distinct columns from the mainTable and join it with what's left of otherTable. The downside is that I pass 5 times over mainTable - once for each column. If the values in main table are unique, you can remove the distinct from the subqueries.

Related

USING multiple 'OR' conditions in JOIN component in Oracle data Integrator 12c

I want to use multiple 'OR' conditions in JOIN component in Oracle data Integrator 12c.
Conditions to be taken care when doing the above task is:
Say table T1 and T2, I need to take left outer join on T1(i.e. I need all the records from T1 for multiple satisfied join conditions specified in JOIN component in ODI 12c)
For example:
a. For table T1, T2: say conditions c1, c2, c3. T1 Left outer join T2.
b. I want to get the data in table say T3: Ensuring all records from T1 PLUS all records from T2 for all the conditions satisfied(namely c1,c2,c3).
Sample query:
select T1.*
from T1 LEFT OUTER JOIN T2
ON (C1 OR C2 OR C3);
Kindly help me on this at the earliest.
Thanks in advance!

You can try either query both will get you all the rows from T1 that either matched with T2 columns respectively or didn't have any match with T2.
Using UNION
SELECT DISTINCT *
FROM (
SELECT T1.*
FROM T1
LEFT OUTER JOIN T2 ON T1.day = T2.day
UNION
SELECT T1.*
FROM T1
LEFT OUTER JOIN T2 ON T1.month = T2.month
UNION
SELECT T1.*
FROM T1
LEFT OUTER JOIN T2 ON T1.yearly = T2.yearly
) as T3;
Using OR (NOTE: displaying T2 columns just to show that LEFT JOIN is working on each condition)
SELECT T1.*, T2.*
FROM T1
LEFT OUTER JOIN T2 ON
(T1.day = T2.day OR T1.month = T2.month OR T1.yearly = T2.yearly)
Sample Run
I have 4 records in T1 and 3 records in T2. Records in T1 are such that 3 rows
match with exactly 1 column in T2 and 4th row doesnt match any records in T2.
Output of both the queries gets what you need.

Re-writing a join query

I have a question concerning Hive. Let me explain to you the scenario :
I am using a Hive action on Oozie; I have a query which is doing
succesive LEFT JOIN on different tables;
Total number of rows to be inserted is about 35 million;
First, the job was crashing due to lack of memory, so I set "set hive.auto.convert.join=false" the query was perfectly executed but it took 4 hours to be done;
I tried to rewrite the order of LEFT JOINs putting large tables at the end, but same result, about 4 hours to be executed;
Here is what the query look like:
INSERT OVERWRITE TABLE final_table
SELECT
T1.Id,
T1.some_field_name,
T1.another_filed_name,
T2.also_another_filed_name,
FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table
So, knowing the structure of the query is there a way to rewrite it so that I can avoid too many JOINs ?
Thanks in advance
PS: Even vectorization gave me the same timing

Too long for a comment, will be deleted later.
(1) Your current query won't compile.
(2) You are not selecting anything from T3 and T4, which makes no sense.
(3) Changing the order of tables is not likely to have any impact with cost based optimizer.
(4) Basically I would suggest to collect statistics on the tables, specifically on the id columns, but in your case I got a feeling that id is not unique in more than 1 table.
Add to your post the result of the following query:
select *
, case when cnt_1 = 0 then 1 else cnt_1 end
* case when cnt_2 = 0 then 1 else cnt_2 end
* case when cnt_3 = 0 then 1 else cnt_3 end
* case when cnt_4 = 0 then 1 else cnt_4 end as product
from (select id
,count(case when tab = 1 then 1 end) as cnt_1
,count(case when tab = 2 then 1 end) as cnt_2
,count(case when tab = 3 then 1 end) as cnt_3
,count(case when tab = 4 then 1 end) as cnt_4
from ( select 1 as tab,id from table1
union all select 2 as tab,id from table2
union all select 3 as tab,id from table3
union all select 4 as tab,id from table4
) t
group by id
having greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
) t
order by product desc
limit 10
;

Oracle: Select two different rows from one table and select value from another table if any of the entry does not exist

These are the two tables. I want to select created time of TABLE1 for type = 'PENDINGTIMESTAMP' and type = 'DISTRIBUTEDTIMESTAMP' for TABLE2ID.
TABLE1
+------+--------+--------------------+-------------------+
|ID |TABLE2ID|TYPE |CREATED |
+------+--------+--------------------+-------------------+
|156174|849118 |PENDINGTIMESTAMP |2016-09-09 03:33:11|
|156175|849118 |DISTRIBUTEDTIMESTAMP|2016-09-09 03:33:11|
|156176|849118 |PROCESSTIME |2016-09-09 03:33:11|
|156177|849119 |DISTRIBUTEDTIMESTAMP|2016-09-09 03:33:11|
|156178|849119 |PROCESSTIME |2016-09-09 03:33:11|
+------+--------+--------------------+-------------------+
TABLE2
+------+-------------------+
|ID |CREATED |
+------+-------------------+
|849118|2016-09-09 05:00:00|
|849119|2016-09-09 06:00:00|
+------+-------------------+
If any of the entry not exist in TABLE1 for TABLE2ID then i want select created time of TABLE2.CREATED where TABLE2.ID
Final Result would be
+--------+-------------------+-------------------+
|TABLE2ID|TIME1 |TIME2 |
+--------+-------------------+-------------------+
|849118 |2016-09-09 03:33:11|2016-09-09 03:33:01|
|849119 |2016-09-09 06:00:00|2016-09-09 03:33:01|
+--------+-------------------+-------------------+
For Highlighted entry -> Entry not exist in TABLE1 and created timestamp taken from TABLE2
TIME1 in the second row should be taken from TABLE2
I tried somethink like below. It is doing cartesian product and return two many rows
select
table2.id table2id,
case when t2.logtype = 'PENDINGTIMESTAMP' then t2.created else table2.created end as time1,
case when t1.logtype = 'NEWTIMESTAMP' then t1.created else table2.created end as time2
from
table2,
table1 t1,
table1 t2
where
table2.id(+) = t1.table2id
and table2.id(+) = t2.table2id

i assume now, that table2 contains every possible table2id.
so I would create 2 outer joins from table2 to table1, one for pending and one for distributed timestamps.
finally, on selecting we can use the NVL function to use the created timestamp as fallback value.
SELECT m.id AS table2id,
NVL(p.created, m.created) AS time1,
NVL(d.created, m.created) AS time2
FROM table2 m
LEFT OUTER JOIN table1 p ON (p.table2id = m.id AND p.type = 'PENDINGTIMESTAMP')
LEFT OUTER JOIN table1 d ON (d.table2id = m.id AND d.type = 'DISTRIBUTEDTIMESTAMP')
or with Oracle outer join syntax (I'm not sure if the IS NULL is really necessary to compensate missing rows):
SELECT m.id AS table2id,
NVL(p.created, m.created) AS time1,
NVL(d.created, m.created) AS time2
FROM table2 m,
table1 p,
table1 d
WHERE m.id = p.table2id(+)
AND p.type(+) = 'PENDINGTIMESTAMP'
AND m.id = d.table2id(+)
AND d.type(+) = 'DISTRIBUTEDTIMESTAMP'
please note: I do not have a Oracle System to test the statement at hand, and I haven't used Oracle SQL syntax for about 3 years now. So please excuse, if there are syntactical errors.
But I hope, you get the idea.

Return non-null value from two tables in Oracle

I have two tables, T1 and T2 with same set of columns. I need to issue a query which will return me value of columns from either table whichever is not null. If both columns are null return null as the value of that column.
The columns are c1,c2,c3,cond1.
I issued the following query. The problem is that if one subquery fails the whole query fails. Somebody please help me. Probably there is another simple way.
SELECT NVL(T1.c1, T2.c1) c1,NVL(T1.c2, T2.c2) c2,NVL(T1.c3, T2.c3) c3
FROM (SELECT c1,c2,c3
FROM T1
WHERE cond1 = 'T10') T1
,(SELECT c1,c2,c3
FROM T2
WHERE cond1 = 'T200') T2 ;

You need something like this:
SELECT NVL((SELECT T1.c1
FROM T1
WHERE T1.c2 = 'T10'),
(SELECT T2.c1
FROM T2
WHERE T2.c2 = 'T200')) AS c1
FROM dual
Or you may prefer a full outer join:
SELECT NVL(T1.c1, T2.c1) AS c1
FROM T1 FULL OUTER JOIN T2 ON 1=1
WHERE T1.c2 = 'T10'
AND T2.c2 = 'T200'
Your result is logical. If the first table is null no combination of values will exist in the natural join.
EDIT. After some new requirements we can use a hack to get the row. Lets get all three possibilities, T1, T2 or all nulls and select the first one:
SELECT *
FROM ( (SELECT T1.*
FROM T1
WHERE T1.c2 = 'T10')
UNION ALL
(SELECT T2.*
FROM T2
WHERE T2.c2 = 'T200')
UNION ALL
(SELECT T2.*
FROM dual
LEFT JOIN T1 ON 1 = 0 ) )
WHERE ROWNUM = 1

left outer join on nullable field with between in join condition (Oracle)

I have two tables as: table1 with fields c1 and dt(nullable); table2 with fields start_dt, end_dt and wk_id. Now I need to perform left outer join between the table1 and table2 to take wk_id such that dt falls between start_dt and end_dt. I applied following condition but some wk_id which shouldn't be NULL are pulled NULL and some rows get repeated.
where nvl(t1.dt,'x') between nvl(t2.start_dt(+), 'x') and nvl(t2.end_dt(+), 'x');
What is wrong with the condition?

select *
from table1 t1
left join table2 t2
on t1.dt between t2.start_dt and t2.end_dt
I recommend you try the new ANSI join syntax.
Also, are you just using 'x' as an example? Or are the dt columns really stored as strings?

It seems you are missing the part "table1 left outer join table2 on table1.some_field = table2.some_field"
Something like this:
select t1.c1, t1.dt, t2.start_dt, t2.end_dt, t2.wk_id
from table1 t1 left outer join table2 t2
on t1.some_field1 = t2.some_field1
where nvl(t1.dt,'x')
between nvl(t2.start_dt, 'x') and
nvl(t2.end_dt, 'x')

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive - OR condition with left outer join - hadoop

Related

USING multiple 'OR' conditions in JOIN component in Oracle data Integrator 12c

Re-writing a join query

Oracle: Select two different rows from one table and select value from another table if any of the entry does not exist

Return non-null value from two tables in Oracle

left outer join on nullable field with between in join condition (Oracle)

Categories

Resources