How to implement left join on data range in hive - hadoop

I want to convert the below oracle logic to hive.
Logic:
Select a.id,a.name,b.desc from table a left join table b on
a.num between b.min_num and b.max_num;
Could any one help me out to achieve the above logic in hive.

With this solution you have the control on the performance.
b ranges are being split to sub-ranges, small as you want (x).
Too big x will practically cause a CROSS JOIN.
Too small x might generate a huge set from b (x=1 will generate all b ranges' values).
set hivevar:x=100;
select a.id
,a.name
,b.desc
from table_a as a
left join
(select a.id
,b.desc
from table_a as a
inner join
(select b.min_num div ${hivevar:x} + pe.pos as sub_range_id
,b.*
from table_b as b
lateral view
posexplode(split(space(cast (b.max_num div ${hivevar:x} - b.min_num div ${hivevar:x} as int)),' ')) pe
) as b
on a.num div ${hivevar:x} =
b.sub_range_id
where a.num between b.min_num and b.max_num
) b
on b.id =
a.id
;

select a.id
,a.name
,b.desc
from table_a as a
left join (select a.id
,b.desc
from table_a as a
cross join table_b as b
where a.num between b.min_num and b.max_num
) b
on b.id =
a.id
;

select a.id
,a.name
,b.desc
from table_a as a
left join (select b.min_num + pe.pos as num
,b.desc
from table_b as b
lateral view
posexplode(split(space(b.max_num-b.min_num),' ')) pe
) b
on b.num =
a.num
;

Related

What is the purpose of (+) operator in a where clause, other than outer joins, in Oracle SQL?

I have some very old Oracle SQL code I need to review, as per below and am trying to understand what the (+) operator is doing in the where clause after the first use of it
select *
from table_a a,
table b b
where
a.id = b.id (+)
and b.seq_nb (+) = 1
and b.type_cd (+) = 'DOLLR'
I thought (+) was a outer join equivalent, so
from table_a a,
table b b
where
a.id = b.id (+)
would be the same as
from table a a left outer join table b b on a.id=b.id
so how can you have outer joins to hard coded variables as below?
b.seq_nb (+) = 1
and b.type_cd (+) = 'DOLLR'
Any help would be greatly appreciated, thank you!
It's the same as:
select *
from table_a a
left outer join table_b b
on a.id = b.id
and b.type_cd = 'DOLLR'
and b.seq_nb = 1
Sometimes also referred to as a "filtered outer join".
It is equivalent to an outer join with a derived table:
select *
from table_a a
left outer join (
select *
from table_b
where b.type_cd = 'DOLLR'
and b.seq_nb = 1
) b on a.id = b.id

I am trying to combine 3 tables for to get a distinct combination as below

SELECT TYPE_DETAILS(a.column1,c.column2,c.column3) BULK COLLECT INTO OUT_DETAILS
FROM TABLE1 a
INNER JOIN TABLE2 b ON a.column2 = b.column2
INNER JOIN TABLE3 c ON a.column3 = c.column3;
I only want combinations for distinct values of a.column1 . If I apply distinct as below i am getting error
SELECT TYPE_DETAILS(DISTINCT a.column1,c.column2,c.column3) BULK COLLECT INTO OUT_DETAILS
FROM TABLE1 a
INNER JOIN TABLE2 b ON a.column2 = b.column2
INNER JOIN TABLE3 c ON a.column3 = c.column3;
Why don't you use sub-query:
SELECT TYPE_DETAILS(column1,column2,column3)
BULK COLLECT INTO OUT_DETAILS FROM
(SELECT DISTINCT a.column1,c.column2,c.column3
FROM TABLE1 a
INNER JOIN TABLE2 b ON a.column2 = b.column2
INNER JOIN TABLE3 c ON a.column3 = c.column3);

Run native sql query in JPA give difference result vs when run the same query directly in SQL tool

I have a native query like this:
WITH SOURCE_A AS (
SELECT a.ID FROM A a
WHERE a.SOME_THING > 1000
),
SOURCE_B AS (
SELECT b.ID FROM B b
INNER JOIN A a ON a.B_ID = b.ID
WHERE a.ID IN (SELECT * FROM SOURCE_A)
AND call_to_a_procedure(b.SOME_THING) = 1
),
SOURCE_C AS (
SELECT c.ID FROM C c
INNER JOIN B b ON b.C_ID = c.ID
WHERE b.ID IN (SELECTT * FROM SOURCE_B)
AND call_to_a_procedure(c.SOME_THING) = 1
)
SELECT re.CODE FROM RESULT re
INNER JOIN A a ON a.ID = re.ID_A
WHERE a.ID IN (SELECT * FROM SOURCE_A)
UNION
SELECT re.CODE FROM RESULT re
INNER JOIN B b ON b.ID = re.ID_B
WHERE b.ID IN (SELECT * FROM SOURCE_B)
UNION
SELECT re.CODE FROM RESULT re
INNER JOIN C c ON c.ID = re.ID_C
WHERE c.ID IN (SELECT * FROM SOURCE_C)
When I reun this query with query.getResultList() . Only the result of :
SELECT re.CODE FROM RESULT re
INNER JOIN A a ON a.ID = re.ID_A
WHERE a.ID IN (SELECT * FROM SOURCE_A)
is returned. The two UNION are ignored. But if I run the query directly in SQL tool like Oracle SQL developer or DBeaver, I get full UNION result.
JPA just silently ignore the UNION part, no error or exception.
UPDATE: It maybe because the call to call_to_a_procedure didn't work in jpa because if i remove the call to procedure, I can get the expected result.

How count table a data for every data of table a

I have three table A,B,C.
A table:
id,name
B table:
id,a_id,date
C table:
id,b_id,type(value is 0/1)
I want to print all A.name,A.id and C.countingdata by counting C data where C.type=1 using B table which has A table id
Result look like below:
A.id A.name C.countingdata
1 abc 4
2 vfd 2
3 fdg 0
Well, you can first inner join B and C, do the group by and get C.countingdata using count(). Another join on this subquery with B itself to accommodate the a_id
in the result set.
Now, you can do an inner join between A and the above subquery to get your results.
SQL:
select A.id, A.name, derived.countingData
from A
inner join (
select B.id as b_id,B.a_id,sub_data.countingData
from B
inner join (
select B.id,count(B.id) as countingData
from B
inner join C
on B.id = C.b_id
where C.type=1
group by B.id
) sub_data
on B.id = sub_data.id
) derived
on A.id = derived.a_id
You can find query as below:
Select
A.id
,A.name
,COUNT(C.id)
FROM A
JOIN B ON A.id = B.a_id
JOIN C ON B.id = C.b_id ANd C.type = 1
GROUP BY
A.id
,A.name

Hive - how to reuse a sub-query in hive with optimal performance

What is the best way to structure/write a query in Hive when I have a complex sub-query that is repeated multiple times throughout the select statement?
I originally created a temporary table for the sub-query which was refreshed before each run. Then I began to use a CTE as part of the original query (discarding the temp table) for readability and noticed degraded performance. This made me curious about which implementation methods are best with respect to performance when needing to reuse sub-queries.
The data I am working with contains upwards of 10 million records. Below is an example of the query I wrote that made use of a CTE.
with temp as (
select
a.id,
x.type,
y.response
from sandbox.tbl_form a
left outer join sandbox.tbl_formStatus b
on a.id = b.id
left outer join sandbox.tbl_formResponse y
on b.id = y.id
left outer join sandbox.tbl_formType x
on y.id = x.typeId
where b.status = 'Completed'
)
select
a.id,
q.response as user,
r.response as system,
s.response as agent,
t.response as owner
from sandbox.tbl_form a
left outer join (
select * from temp x
where x.type= 'User'
) q
on a.id = q.id
left outer join (
select * from temp x
where x.type= 'System'
) r
on a.id = r.id
left outer join (
select * from temp x
where x.type= 'Agent'
) s
on a.id = s.id
left outer join (
select * from temp x
where x.type= 'Owner'
) t
on a.id = t.id;
There are issues in your query.
1) In the CTE you have three left joins without ON clause. This may cause serious performance problems because joins without ON clause are CROSS JOINS.
2) BTW where b.status = 'Completed' clause converts LEFT join with table b to the inner join though still without ON clause it multiplicates all records from a by all records from b with a where.
3) Most probably you do not need CTE at all. Just join correctly with ON clause and use case when type='User' then response end + aggregate using min() or max() by id:
select a.id
max(case when x.type='User' then y.response end) as user,
max(case when x.type='System' then y.response end) as system,
...
from sandbox.tbl_form a
left outer join sandbox.tbl_formStatus b
on a.id = b.id
left outer join sandbox.tbl_formResponse y
on b.id = y.id
left outer join sandbox.tbl_formType x
on y.id = x.typeId
where b.status = 'Completed' --if you want LEFT JOIN add --or b.status is null
group by a.id

Resources