Hive Joining multiple tables to create a horizontal layout - hadoop

We have six hive tables with sample (example) structure like
(where each table has millions of merchant records)
Table1
MerchntId ,field1, field2
Table2
MerchantId, field3,field4
Table3
MerchantId, field5,field6,field7
Table4
MerchantId, field8,field9,field10
Table5
MerchantId, field11, field12, field13
and so on
Requirement is to create a horizantal layout to take all unique merchants where at least one field has value for a merchantId
A merchantId may or may not present in other tables.(for a merchant there may be records in other tables or may not be there)
Final Table
MerchntId, field1, field2, field3, field4,
field5,field6,field7,field8,field9,field10,field11, field12, field13
output should be like after joining
i) 101 abc def ghi
ii) 102 ghj fert hyu ioj khhh jjh ddd aas fff kkk fff vvv ff
for case (i) only three fields have values
for case (ii) all fields have values
For this we are doing FULL OUTER JOIN on merchantId for two tables and so on and then creating the final table
Is there any better approach doing this ?
for eg.
my current approach
SELECT distinct
(case when a.MerchntId IS NOT NULL then a.MerchntId else (case when
b.MerchntId IS NOT NULL
then b.MerchntId else '' end ) end ) as MerchntId,
(case when a.field1 IS NOT NULL then a.field1 else '' end ) as field1,
(case when a.field2 IS NOT NULL then a.field2 else '' end ) as field2,
(case when b.field3 IS NOT NULL then b.field3 else '' end ) as field3,
(case when b.field4 IS NOT NULL then b.field4 else '' end ) as field4
from Table1 a
full outer join Table2 b
ON a.MerchntId = c.MerchntId;
full outer join of table 3 and table 4
and then full outer join of these two tables to create a final table

I don't see any other option since your requirements explicitly translate to a full outer join. However, your query can be improved by using COALESCE and NVL:
SELECT
COALESCE(a.MerchntId, b.MerchntId) as MerchntId,
NVL(a.field1, '') as field1,
NVL(a.field2, '') as field2,
NVL(b.field3, '') as field3,
NVL(b.field4, '') as field4
from Table1 a
full outer join Table2 b
ON a.MerchntId = c.MerchntId;
Also, I'm not sure why you use distinct in your query.

Union all 6 table, substituting missed fields with nulls. Then Aggregate by MerchantId using min or max:
select MerchantId, max(field1) field1, max(field2) field2...max(field13) field13 from
(
select MerchntId field1, field2, null field3, null field4... null field13 from Table1
union all
select MerchntId null field1, null field2, field3, field4... null field13 from Table2
union all
...
select MerchantId, null field1, null field2... field11, field12, field13
from table6
)s group by MerchantId
After this you can apply your logic with replacing nulls with '' if necessary

Related

Oracle How to make SELECT INSIDE A SELECT work?

Just wondering why the following select isn't working:
SELECT
A.FIELD1
, (SELECT PCN FROM (select B.PRIORITY, B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY)
WHERE ROWNUM = 1) AS PCN
FROM TABLE1 A;
ERROR at line 2: ORA-00904: "A"."CUST": invalid identifier
Important to mention:
TABLE1 has as fields FIELD1, CUST.
TABLE2 has as fields PCN, PRIORITY, CUST.
Thanks in advance.
Your query shouldn't give you that error message, on when you remove the outer qiery this would happen
CREATE tABLE TABLE1 (FIELD1 int, CUST int)
INSERT INTO TABLE1 VALUES(1,1)
1 rows affected
CREATE TABLE TABLE2 (PCN int, PRIORITY int, CUST int)
INSERT INTO TABLE2 VALUES (1,1,1)
1 rows affected
SELECT
A.FIELD1
, (SELECT PCN FROM (select B.PRIORITY, B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY)
WHERE ROWNUM = 1) AS PCN
FROM TABLE1 A;
FIELD1
PCN
1
1
fiddle
You can't nest inline selects (more than one level) without losing the ability of the inner nested selects being able to reference the parent block. So your query on TABLE2 cannot see the columns from TABLE1 because of this nesting.
Try this:
SELECT a.field1,
pcn.pcn
FROM table1 a,
(SELECT b.cust,
b.priority,
b.pcn,
ROW_NUMBER() OVER (PARTITION BY b.cust ORDER BY b.priority DESC) seq
FROM table2 b) pcn
WHERE a.cust = pcn.cust(+)
AND pcn.seq(+) = 1
That will work well for report queries. If you end up adding a filter on a specific customer, then you would be better off using OUTER APPLY if you have a recent-enough version of Oracle that supports that.
You could try this:
SELECT
A.FIELD1
, (SELECT B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY
FETCH FIRST 1 ROWS ONLY) AS PCN
FROM TABLE1 A;
FETCH FIRST 1 ROWS ONLY gets you the first ordered record. Works on 12c and up and supports nesting, and no 2nd subquery needed.
Yet another option might be a CTE.
Sample data:
SQL> with
2 table1 (field1, cust) as
3 (select 1, 100 from dual union all
4 select 2, 200 from dual
5 ),
6 table2 (pcn, priority, cust) as
7 (select 10, 1, 100 from dual union all
8 select 20, 2, 100 from dual union all
9 select 30, 1, 200 from dual
10 ),
Query begins here. Rank rows by priority, and then fetch the ones that rank as the highest (line #20):
11 temp as
12 (select a.field1,
13 b.pcn,
14 rank() over (partition by a.field1 order by b.priority desc) rnk
15 from table1 a join table2 b on a.cust = b.cust
16 )
17 select field1,
18 pcn
19 from temp
20 where rnk = 1;
FIELD1 PCN
---------- ----------
1 20
2 30
SQL>
You may use first aggregate function to achieve the same (assuming that you have completely deterministic order by) functionality without nested subquery:
select
a.field1
, (
select max(b.pcn) keep(dense_rank first order by b.priority)
from table2 b
where b.cust = a.cust
) as pcn
from table1 a
which for this sample data
insert into table1 values(1,1);
insert into table1 values(2,2);
insert into table2 values(1,1,1);
insert into table2 values(2,2,1)
returns
FIELD1
PCN
1
1
2
(null)
SQL fiddle

Oracle join to get max data and a non-grouped column

Consider this part of my query:
SELECT field1, field2, field3, ...
LEFT JOIN (
SELECT field1, field2, MAX(field3) field3
FROM table
WHERE field2 IN ('1','2','3','4')
AND field4 > SYSDATE - 365
GROUP BY field1, field2) jointable ON other.fk= jointable.field1
So field4 is a date. I need the date from table. If I add it to the select list I must add it to the group by and as such it will no longer be grouped in a way to pull the MAX(field3).
I could join table again on their primary keys but that doesn't seem ideal. Is there a way to accomplish this?
You could use the aggregate keep dense_rank sytnax to get the date associated with the maximum field3 value for each field1/2 combination:
SELECT field1, field2, field3, ...
LEFT JOIN (
SELECT field1, field2, MAX(field3) field3,
MAX(field4) KEEP (DENSE_RANK LAST ORDER BY field3) field4
FROM table
WHERE field2 IN ('1','2','3','4')
AND field4 > SYSDATE - 365
GROUP BY field1, field2) jointable ON other.fk= jointable.field1
Quick demo of just the subquery, with a CTE for some simple data, where the highest field3 is not on the latest field4 date:
with your_table (field1, field2, field3, field4) as (
select 'A', '1', 1, date '2016-11-01' from dual
union all select 'A', '1', 2, date '2016-09-30' from dual
)
SELECT field1, field2, MAX(field3) field3,
MAX(field4) KEEP (DENSE_RANK LAST ORDER BY field3) field4
FROM your_table
WHERE field2 IN ('1','2','3','4')
AND field4 > SYSDATE - 365
GROUP BY field1, field2
/
F F FIELD3 FIELD4
- - ---------- ----------
A 1 2 2016-09-30
Seems like a window function would work well here...
SELECT field1, field2, field3, ...
LEFT JOIN (
SELECT field1, field2, MAX(field3) over (partition by field1, field2) field3, Field4
FROM table
WHERE field2 IN ('1','2','3','4')
AND field4 > SYSDATE - 365
GROUP BY field1, field2, field4) jointable ON other.fk= jointable.field1
Max of field 3 will now be independant of field 4 but still be dependant on fields 1 and 2.

return null if no rows found oracle query with IN clause

I have a table with three columns.
I query that table with IN clause.
select column1 from table1 where column1 in (1,2,3) order by column2, column3
The table1 contains only values 1 and 2 in column1. I want to return the not available value also in my result, and that should be sorted in the bottom.
example data
column1 column 2 column 3
1 100 11
2 101 50
output, the not available values should be in the last.
column1 column 2 column 3
1 100 11
2 101 50
3 null null
I tried with subquery with NVL, like select nvl((select.. in(1,2,3)),null) from dual, due to IN Clause, I am getting single row subquery returns more than one row issue, which is expected.
Also tried with the union but nothing works. Great if any help. Thanks
I think you can do it with a union all:
select column1 from table1 where column1 in (1,2,3) order by column2, column3
union all
select null from table1 where column1 not in (1,2,3) order by column2, column3
If you can't take 1,2,3 values from another table you can try with:
with t1 as (
select col1,col2,col3
from tab1
where cod_flusso in ('1','2','3')),
t2 as (
select '1' as col1,null,null
from dual
union
select '2',null,null
from dual
union
select '3',null,null
from dual)
select t2.col1,col2,col3
from t2
left outer join t1
on t1.col1= t2.col1
It's better if you can store 1,2,3 values in a second table, then use left outer join.

remove successive rows in hive

What is efficient way to remove successive row with duplicate values in specific fields in hive? for example:
Input:
ID field1 field2 date
1 a b 2015-01-01
1 a b 2015-01-02
2 e d 2015-01-03
output:
ID field1 field2 date
1 a b 2015-01-01
2 e d 2015-01-03
Thanks in advance
One way to remove successive duplicates is to use lag to check the previous id and only keep rows where the previous id is different:
select * from (
select * ,
lag(id) over (order by date) previous_id
from mytable
) t where t.previous_id <> t.id
or t.previous_id is null -- accounts for the 1st row
If you also need to check field1 and field2, then you can add separate lag statements for each field:
select * from (
select * ,
lag(id) over (order by date) previous_id,
lag(field1) over (order by date) previous_field1
from mytable
) t where (t.previous_id <> t.id and t.previous_field1 <> field1)
or t.previous_id is null

Self Join Oracle

I have a table table1 below is how the data looks like.
Column1 is my foreign key of another table.
Column1 Column2 Column3
1 A 06/MAY/14
1 A 05/MAY/14
1 B 06/MAY/14
1 B 01/JAN/00
1 A 01/JAN/00
Now i want to find distinct column1 values where it meets the following condition.
1.atleast one record where column2 should be A and column3 should be (sysdate - 1)
AND
2.atleast one record where column2 should be B and column3 should be (sysdate - 1)
Meaning alteast one A and B should have their column 3 populated with (sysdate - 1)
I have written the below query, please tell if i'm doing anything wrong.
I also want to know if i'm doing the right way of joining. The table contains around 50K records and performance should be fine i guess.
SELECT DISTINCT COLUMN1 FROM
TABLE1 A
JOIN
TABLE1 B ON (A.COLUMN1 = B.COLUMN1)
WHERE
((TRUNC(A.COLUMN3) - TRUNC(A.COLUMN3) = 0)
AND TRUNC(A.COLUMN3) = TRUNC(SYSDATE - 1)
AND TRUNC(B.COLUMN3) = TRUNC(SYSDATE - 1)
AND A.COLUMN2 = 'A'
AND B.COLUMN2 = 'B'
AND TO_CHAR(A.COLUMN3, 'DD-MON-YY') != '01-JAN-00'
AND TO_CHAR(B.COLUMN3, 'DD-MON-YY') != '01-JAN-00'
);
For performance-comparison one with subselects and group:
SELECT COLUMN1 FROM (
SELECT
COLUMN1,
COUNT(COLUMN2) CNT
FROM (
SELECT DISTINCT
COLUMN1,
COLUMN2
FROM TABLE1
WHERE TRUNCATE(COLUMN3) = SYSDATE - 1 AND
(COLUMN2 = 'A' OR COLUMN2 = 'B'))
GOUP BY COLUMN1)
WHERE CNT = 2
This should work
SELECT DISTINCT A.column1 -- Obtain distinct from A
FROM table1 A -- TableA
join table1 B -- TableB
ON A.column1 = B.column1 -- Joining them on Column1
WHERE A.column3 = SYSDATE - 1 -- Yesterdays data on Table A
AND A.column2 = 'A' -- A values
AND B.column2 = 'B'; -- B Values
Note: No distinctness in your test case. So try with a unique key.

Resources