Rewrite Hive IN clause - hadoop

I am trying to execute this subquery in HIVE,but i am getting error that subquery is not supported in my HIVE version, unfortunately yes we are using the old version of HIVE.
select col1,col2 from t1 where col1 in (select x from t2 where y = 0)
Then I have rewritten the subquery using left semi join like this,
select a.col1,a.col2
FROM t1 a LEFT SEMI JOIN t2 b on (a.col1 =b.x)
WHERE b.y = 0
This query is running fine if i don't give the where condition, but its not recognising the table b when I try to use b.any column in where condition or use b.any column in select clause. Throwing this error -
Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 3:6 Invalid table alias or column reference 'b': (possible column names
Any help is much appreciated.

select a.col1,a.col2
FROM t2 b RIGHT OUTER JOIN t1 a on (b.x = a.col1)
WHERE b.y = 0
-- When you use LEFT SEMI JOIN, where condition is not work on right side table column. Please change your script to above condition.

Instead of t1 a LEFT SEMI JOIN t2 b, you can do something like this: t1 a LEFT SEMI JOIN (select * from t2 where y = 0) b.
select a.col1,a.col2
FROM t1 a LEFT SEMI JOIN (select * from t2 where y = 0) b on (a.col1 =b.x);
Please see below example.
Department table:
+--------------------+----------------------+--+
| department.deptid | department.deptname |
+--------------------+----------------------+--+
| D101 | sales |
| D102 | finance |
| D103 | HR |
| D104 | IT |
| D105 | staff |
+--------------------+----------------------+--+
Employee tabe:
+-----------------+------------------+------------------+--+
| employee.empid | employee.salary | employee.deptid |
+-----------------+------------------+------------------+--+
| 1001 | 1000 | D101 |
| 1002 | 2000 | D101 |
| 1003 | 3000 | D102 |
| 1004 | 4000 | D104 |
| 1005 | 5000 | D104 |
+-----------------+------------------+------------------+--+
hive> SELECT
dept.deptid, dept.deptname
FROM
department dept
LEFT SEMI JOIN
(SELECT * FROM employee WHERE salary > 3000) emp
ON (dept.deptid = emp.deptid);
+--------------+----------------+--+
| dept.deptid | dept.deptname |
+--------------+----------------+--+
| D104 | IT |
+--------------+----------------+--+

Related

Insert data to table from another table containing null values and replace null values with the original table 1 values

I want to match first column of both table and insert table 2 values to table 1 . But if Table 2 values are null leave table 1 vlaues as it is .I am using Hive to dothis .Please help.
You need to use coalesce to get non null value to populate b column and case statement to make decision to populate c column.
Example:
hive> select t1.a,
coalesce(t2.y,t1.b)b,
case when t2.y is null then t1.c
else t2.z
end as c
from table1 t1 left join table2 t2 on t1.a=t2.x;
+----+-----+----+--+
| a | b | c |
+----+-----+----+--+
| a | xx | 5 |
| b | bb | 2 |
| c | zz | 7 |
| d | dd | 4 |
+----+-----+----+--+

Is an anti-join more efficient than a left outer join?

A comment on this answer notes that anti-joins may have been optimized to be more efficient that outer joins in Oracle. I'd be interested to see what explanations/evidence might support or disprove this claim.
When you use "not exists" or "not in" in your SQL query, you let Oracle to choose merge anti-join or hash anti-join access paths.
Quick Explanation
For example, given join betwen table A and B (from A join B on A.x = B.x) Oracle will fetch all relevant data from table A, and try to match them with corresponding rows in table B, so it's strictly dependent on selectivity of table A predicate.
When using anti-join optimization, Oracle can choose the table with higher selectivity and match it with the other one, which may result in much faster code.
It can't do that with regular join or subquery, because it can't assume that one match between tables A and B is enough to return that row.
Related hints: HASH_AJ, MERGE_AJ.
More:
This looks like a nice and detailed article on the subject.
Here is another, more dencet article.
If Oracle can transform left join + where is null into ANTI join then it's exactly the same.
create table ttt1 as select mod(rownum,10) id from dual connect by level <= 50000;
insert into ttt1 select 10 from dual;
create table ttt2 as select mod(rownum,10) id from dual connect by level <= 50000;
select ttt1.id
from ttt1
left join ttt2
on ttt1.id = ttt2.id
where ttt2.id is null;
select * from ttt1 where id not in (select id from ttt2);
If you have a look at
Final query after transformations:******* UNPARSED QUERY IS *******
in trace for event 10053 then you'll find two exactly the same queries (you can see "=" in predicate in the trace file because there is no special sign for ANTI join)
SELECT "TTT1"."ID" "ID" FROM "TTT2" "TTT2","TTT1" "TTT1" WHERE "TTT1"."ID"="TTT2"."ID"
And they have exactly the same plans
-----------------------------------
| Id | Operation | Name |
-----------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN ANTI | |
| 2 | TABLE ACCESS FULL| TTT1 |
| 3 | TABLE ACCESS FULL| TTT2 |
-----------------------------------
If you, however, put a hint to disable transformations then plan will be
select --+ no_query_transformation
ttt1.id
from ttt1, ttt2
where ttt1.id = ttt2.id(+) and ttt2.id is null;
------------------------------------
| Id | Operation | Name |
------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FILTER | |
| 2 | HASH JOIN OUTER | |
| 3 | TABLE ACCESS FULL| TTT1 |
| 4 | TABLE ACCESS FULL| TTT2 |
------------------------------------
and performance will degrade significantly.
If you will use ANSI join syntax with deisabled transformation it will be even worse.
select --+ no_query_transformation
ttt1.id
from ttt1
left join ttt2
on ttt1.id = ttt2.id
where ttt2.id is null;
select * from table(dbms_xplan.display_cursor(format => 'BASIC'));
--------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | VIEW | |
| 2 | FILTER | |
| 3 | MERGE JOIN OUTER | |
| 4 | TABLE ACCESS FULL | TTT1 |
| 5 | BUFFER SORT | |
| 6 | VIEW | VW_LAT_2131DCCF |
| 7 | TABLE ACCESS FULL| TTT2 |
--------------------------------------------------
So, in a nutshell, if Oracle can apply transformation to ANTI join then performance is exactly the same otherwise it can be worse. You can also use hint "--+ rule" to disable CBO transformations and see what happenes.
PS. On a separate note, SEMI join may be in some cases much better than inner join + distinct even with enabled CBO transformations.

Update between tables

I have two tables (ORACLE11) :
TABLE1: (ID_T1 is TABLE1 primary key)
| ID_T1 | NAME | DATEBEGIN | DATEEND |
| 10 | test | 01/01/2017 | 01/06/2017 |
| 11 | test | 01/01/2017 | null |
| 12 | test1 | 01/01/2017 | 01/06/2017 |
| 13 | test1 | 01/01/2017 | null |
TABLE2: (ID_T2 is TABLE2 primary key and ID_T1 is TABLE2 foreign key on TABLE1)
| ID_T2 | ID_T1 |
| 1 | 10 |
| 2 | 11 |
| 3 | 11 |
| 4 | 12 |
| 5 | 13 |
I need to delete all rows from TABLE1 where TABLE1.DATEEND = 'null'
But first I must update TABLE2 to modify TABLE2.ID_T1 to the remaining record in TABLE1 for the same NAME :
TABLE2:
| ID_T2 | ID_T1 |
| 1 | 10 |
| 2 | 10 |
| 3 | 10 |
| 4 | 12 |
| 5 | 12 |
I tried this:
UPDATE TABLE2
SET TABLE2.ID_T1 = (
SELECT TABLE1.ID_T1
FROM TABLE1
WHERE TABLE1.DATEBEGIN = '01/01/2017'
AND TABLE1.DATEEND IS NOT NULL
)
WHERE TABLE2.ID_T1 = (
SELECT TABLE1.ID_T1
FROM TABLE1
WHERE TABLE1.DATEBEGIN = '01/01/2017'
AND TABLE1.DATEEND IS NULL
);
But I don't know how to join on TABLE1.NAME and do it for all rows of TABLE2. Thanks in advance for your help.
First create a temp table to find out which id is to be retained for which name. In the case od multiple possible values, I have selected one by ascending order or id_t1.
create table table_2_update as
select id_t1, name from (select id_t1, name row_number() over(partition by
name order by id_t1) from table1 where name is not null) where rn=1;
Create next table to know which id of table2 connects to which name of table1.
create table which_to_what as
select t2.id_t2, t2.id_t1, t1.name from table1 t1 inner join table2 t2 on
t1.id_t1 = t2.id_t2 group by t2.id_t2, t2.id_t1, t1.name;
Since this newly created table now contains id and name of table1, and id of table2, merge into it to retain one to one case of id and name of table1.
merge into which_to_what a
using table_2_update b
on (a.name=b.name)
when matched then update set
a.id_t1=b.id_t1;
Now finally we have a table which contains the final correct values, you can either rename this tale to table2 or merge original table2 on the basis of id of new table and original table2.
merge into table2 a
using which_to_what a
on (a.id_t2=b.id_t2)
when matched then update set
a.id_t1=b.id_t1;
Finally delete null values from table1.
delete from table1 where dateend is null;
You can do this by joining table1 to itself, joining on the name column. Use the first table (a) to link to table2.id_t1 and the second table (b) to get the t1_id where dateend is not null.
UPDATE table2
SET table2.id_t1 = (
select b.id_t1
from table1 a, table1 b
where a.name = b.name
and b.dateend is not null
and a.id_t1 = table2.id_t1
)
WHERE EXISTS (
select b.id_t1
from table1 a, table1 b
where a.name = b.name
and b.dateend is not null
and a.id_t1 = table2.id_t1
);
This assumes that there will only be one table1 record where dateend is not null.

join order step by step in oracle

I have a trouble with below query.I want to join columns in step.
For example a.code first b.name1 if not match then join name2 etc.But i cant handle the query.
SELECT * FROM TABLE A
LEFT JOIN TABLE B
ON A.CODE = B.NAME1
OR A.CODE = B.NAME2
OR UPPER(B.NAME3) = UPPER(A.NAME)
Thanks
edit
with sample below,
if TABLEB.CODE = TABLEA.NAME1 match then dont want to look OR TABLEB.CODE = TABLEA.NAME2.
And if not matching any TABLEB.CODE = TABLEA.NAME1 then step by step match the columns on that order.
WITH TABLEA AS (SELECT 13445 AS ID,'A' AS TYPE,'DFSF' AS NAME1 , 'PCK' AS NAME2 FROM DUAL
UNION ALL
SELECT 13445 AS ID,'A' AS TYPE,'PCK' AS NAME1 , 'PCK' AS NAME2 FROM DUAL),
TABLEB AS (SELECT 56544 AS ID, 'PCK' AS CODE, 'PCK' AS FRST_NM FROM DUAL)
SELECT * FROM TABLEA
LEFT JOIN TABLEB
ON TABLEB.CODE = TABLEA.NAME1
OR TABLEB.CODE = TABLEA.NAME2
OR UPPER(TABLEB.FRST_NM) = UPPER(TABLEA.NAME2)
I believe your query will do what you want. There is no step by step in join conditions if you have already mentioned OR. So if either of the condition match, the rows will be matched. So if A.CODE <> B.NAME1 but A.CODE = B.NAME2, they will match. If both are true, then also they will match. I create a CTE with sample data to see the output. It will return 3 rows for 3 matches.
WITH TBL(SEQ,CODE,NAME,NAME1,NAME2,NAME3) AS
(
SELECT 1,'A1','B','A1','D','E' FROM DUAL UNION ALL -- as A.CODE(A1)=B.NAME1, it will match
SELECT 2, 'B2','B','D','B2','F' FROM DUAL UNION ALL --as A.CODE (B2) <> B.NAME1 , it will match for B.NAME2
SELECT 3, 'G','H1','D','A','H1' FROM DUAL UNION ALL --match for UPPER(B.NAME3) = UPPER(A.NAME)
SELECT 4,'P4','Q4','R4','S4','T4' FROM DUAL
)
SELECT * FROM TBL A
LEFT JOIN TBL B
ON (A.CODE = B.NAME1
OR A.CODE = B.NAME2
OR UPPER(B.NAME3) = UPPER(A.NAME)
)
Column SEQ is just to see what rows are returning.
Output
+-----+------+------+-------+-------+-------+-------+--------+--------+---------+---------+---------+
| SEQ | CODE | NAME | NAME1 | NAME2 | NAME3 | SEQ_1 | CODE_1 | NAME_1 | NAME1_1 | NAME2_1 | NAME3_1 |
+-----+------+------+-------+-------+-------+-------+--------+--------+---------+---------+---------+
| 1 | A1 | B | A1 | D | E | 1 | A1 | B | A1 | D | E |
| 2 | B2 | B | D | B2 | F | 2 | B2 | B | D | B2 | F |
| 3 | G | H1 | D | A | H1 | 3 | G | H1 | D | A | H1 |
| 4 | P4 | Q4 | R4 | S4 | T4 | | | | | | |
+-----+------+------+-------+-------+-------+-------+--------+--------+---------+---------+---------+

Get records from multiple Hive tables without join

I have 2 tables :
Table1 desc:
count int
Table2 desc:
count_val int
I get the fields count, count_val from the above tables and insert into the another Audit table(table3) .
Table3 desc:
count int
count_val int
I am trying to log the record count of these 2 tables into audit table for each job run.
Any of your suggestions are appreciated.Thanks!
If you want just aggregations (like sums), the solution comes with the use of UNION
INSERT INTO TABLE audit
SELECT
SUM(count),
SUM(count_val)
FROM (
SELECT
t1.count,
0 as count_val
FROM table1 t1
UNION ALL
SELECT
0 as count,
t2.count_val
FROM table2 t2
) unioned;
Otherwise join is required, because you should somehow match your lines, it's how relational algebra (the theory behind SQL) works.
==table1==
| count|
|------|
| 12 |
| 751 |
| 167 |
===table2===
| count_val|
|----------|
| 1991 |
| 321 |
| 489 |
| 7201 |
| 3906 |
===audit===
| count | count_val|
|-------|----------|
| ??? | ??? |

Resources