Hive - Split delimited columns over multiple rows, select based on position - hadoop

I'm Looking for a way to split the column based on comma delimited data. Below is my dataset
id col1 col2
1 5,6 7,8
I want to get the result
id col1 col2
1 5 7
1 6 8
The position of the index should match because I need to fetch results accordingly.
I tried the below query but it returns the cartesian product.
Query:
SELECT col3, col4
FROM test ext
lateral VIEW explode(split(col1,'\002')) col1 AS col3
lateral VIEW explode(split(col2,'\002')) col2 AS col4
Result:
id col1 col2
1 5 7
1 5 8
1 6 7
1 6 8

You can use posexplode() to create position index columns for your split arrays. Then, select only those rows where the position indices are equal.
SELECT id, col3, col4
FROM test
lateral VIEW posexplode(split(col1,'\002')) col1 AS pos3, col3
lateral VIEW posexplode(split(col2,'\002')) col2 AS pos4, col4
WHERE pos3 = pos4;
Output:
id col3 col4
1 5 7
1 6 8
Reference: Hive language manual - posexplode()

Related

Retrieving based on specific condition

I have a little complex requirement on couple of tables which I am finding hard to crack.
There are 2 tables. TableA and TableB
TableA has a structure like:
-------------------------------------
ID COL1 COL2 CAT
-------------------------------------
1 RecAA RecAB 3
2 RecBA RecBB 3
3 RecCA RecCB 2
4 RecDA RecDB 2
5 RecEA RecEB 1
-------------------------------------
TableB has a structure like:
-----------------
COL3 TYPE
-----------------
RecAA 10
RecAA 11
RecAA 12
RecAB 10
RecAB 11
RecAB 12
RecAB 13
RecAB 14
RecBA 10
RecBA 11
RecBA 14
RecBA 15
RecBB 10
-----------------
Requirements:
Records in TableA should have CAT = 3.
Either COL1 or COL2 of TableA should be available in COL3 of TableB.
COL3 should definitely have TYPE in 10,11,12 and should have only that TYPE.
i.e As per the above requirements,
Of the records available in TableA, records with ID 1 and 2 have CAT = 3 in TableA
Both the records have atleast only value in COL3 of TableB. (Record with ID 1 in TableA has both COL1 and COL2 in TableB and record with ID 2 in TableA has COL1 in TableB)
RecAA record has Type 10,11,12 and only 10,11,12. So doesnt matter if RecAB has 10,11,12 or not. But RecBA and RecBB both does not have 10,11,12 types.
Therefore the result should be:
-------------------------------------
ID COL1 COL2 CAT
-------------------------------------
1 RecAA RecAB 3
-------------------------------------
What I tried:
WITH TEMP AS (SELECT COL3 FROM TableB GROUP BY COL3 HAVING SUM(CASE WHEN TYPE IN ('10','11','12') THEN 1 ELSE 0 END) = 0)
SELECT S.ID, S.COL1, S.COL2, S.CAT FROM TableA S
INNER JOIN TEMP T ON S.COL1 = T.COL3
WHERE S.CAT = 3;
Can someone please help on achieving this?
I think you're almost there, it's just your row selection in the CTE that seems problematic, and I think you need an OR:
WITH TEMP AS (
SELECT COL3
FROM TableB
GROUP BY COL3
HAVING SUM(POWER(2, TYPE - 10)) = 7 AND COUNT(*) = 3
)
SELECT
S.ID, S.COL1, S.COL2, S.CAT
FROM
TableA S
INNER JOIN TEMP T ON S.COL1 = T.COL3 OR S.COL2 = T.COL3
WHERE
S.CAT = 3;
I've subtracted 10 from each of your TYPEs to turn your 10,11,12 into 0,1,2 and then used POWER to turn them into 1, 2 and 4 which uniquely sum to 7 - (in other words your 10,11,12 became 2^(10-10), 2^(11-10) and 2^(12-10) which are 1, 2 and 4.. Which must then sum to 7).
I also mandate that there be a count of 3; the only way to get to 7 with three numbers that are powers of 2 is to have 1+2+4 which guarantees that 10,11,12 are present initially. If anything was missing, extra or repeated it wouldn't be 3 numbers that sum to 7
I think RecAB is excluded because even though it has 10,11,12 it also has 13,14 which cause it to be excluded..
You also seemed to be saying that COL3 should be present in either COL1 or COL2 of table A
You can use listagg analytic version to turn TYPE column into type_in_list column like below :
With temp_TableB (COL3, type_in_list) as (
SELECT distinct COL3, listagg(TYPE, ',') within group (order by TYPE)over(partition by COL3)
FROM TableB
)
select tA.*
--, tb.*
from tableA tA
INNER JOIN temp_TableB tB on (tA.COL1 = tB.COL3 or tA.COL2 = tB.COL3)
Where tA.CAT = 3
AND tB.TYPE_IN_LIST = '10,11,12'
;

PL/SQL Oracle :- Dynamically UNPIVOT ORACLE TABLE on passing a value

I have a table as below with data:-
Item COL1 COL2 COL3 COL4 COL5 COL6 .... COL 30
A 1 1 2 3 4 2 5 2
B 2 6 4 3 5 2 5 1
C 3 4 5 2 2 2 4 2
D 4 5 2 23 45 3 3 3
F 5 3 1 11 23 34 34 1
and need to UNPIVOT depending on the value I give... If I give 4, the table is unpivoted to COL4, If I give 7 the table is unpivoted till 7, making it dynamic. I have written a simple SQL but cant get a way to make it dynamic
SELECT * FROM (
WITH
WIDE AS (
SELECT
/*+ PARALLEL(128) */
ITEM, COL1, COL2, COL3, COL4, COL5, COL6, COL7
FROM TAB
WHERE ITEM = 'A'
)
SELECT
/*+ PARALLEL(128) */
ITEM
FROM WIDE
UNPIVOT INCLUDE NULLS
(QTY FOR SCOL IN
(COL1, COL2, COL3, COL4, COL5, COL6,
COL7
)
)
);
Why don't you unpivot all possible columns and then restrict the dataset with the where clause:
SELECT ITEM, SCOL, QTY
FROM WIDE
UNPIVOT INCLUDE NULLS
(QTY FOR SCOL IN (COL1, ..., COL 30))
WHERE TO_NUMBER(SUBSTR(SCOL,4)) <= 7 -- 7 Should be replaced with your parameter

How to filter 1 column on 'unique', and second column on 'most occuring' in Google Spreadsheets

I want to filter 2 large columns in Google Spreadsheets.
The outcome from column 1 must only show the unique values
The outcome from column 2 must show the most occuring value for each unique value from 1.
Example dataset:
NL 1
NL 1
NL 2
NL 3
BE 2
BE 2
BE 4
BE 2
USA 6
USA 5
USA 6
USA 6
FR 5
FR 4
FR 2
FR 3
FR 1
FR 2
LUX 2
the outcome would be:
NL 1
BE 2
USA 6
FR 2
LUX 2
The formula is:
=ArrayFormula(VLOOKUP(UNIQUE(FILTER(A:A,A:A<>"")),QUERY({A:B,A:A},"select Col1, Col2, count(Col3) where Col1 <> '' group by Col1, Col2 order by count(Col3) desc"),{1,2},0))
Sample file:
https://docs.google.com/spreadsheets/d/1LwRiKACY4Myp_1NtkFkvtTy1xRvNPx0u5e-Pi0SuzGI/edit?usp=sharing
If your local settings differs from the US then use formula:
=ArrayFormula(VLOOKUP(UNIQUE(FILTER(A:A;A:A<>""));QUERY({A:B\A:A};"select Col1, Col2, count(Col3) where Col1 <> '' group by Col1, Col2 order by count(Col3) desc");{1\2};0))
The number of times some values occur
This part of formula is counting all the data, putting the most frequent to the top:
QUERY({A:B,A:A},"select Col1, Col2, count(Col3) where Col1 <> '' group by Col1, Col2 order by count(Col3) desc")

how to use Posexplode function in hive

I am using posexplode to split single to multiple records in hive.
Along with multiple records as output i need to generate sequence number for each row.
col1, col2, col3 and col4 are defined as string because rarely we get alpha data as well.
col1 | col2| col3 | col4
---------------------------
7 | 9 | A | 3
5 | 6 | 9
Seq | Col
----------
1 | 7
2 | 9
3 | A
4 | 3
1 | 5
2 | 6
3 | 9
I am using below mentioned query but I am getting error
-bash: syntax error near unexpected token (
My query is :
SELECT
seq, col
FROM
(SELECT array( col1, col2 , col3,col4) as arr_r FROM srctable ) arrayrec
LATERAL VIEW posexplode(arrayrec) EXPLODED_rec as seq, col
How can this be resolved
I am able to run successfully this query :
SELECT col FROM
(SELECT array( col1, col2 , col3,col4)
as arr_r FROM srctable ) arrayrec
LATERAL VIEW explode(arrayrec) EXPLODED_rec as col
Which produces below output
Col
-----
7
9
A
3
5
6
9
I have checked the link : How to get first n elements in an array in Hive
Try
SELECT Seq, col FROM
(SELECT array( col1, col2 , col3,col4)
as arr_r FROM srctable ) arrayrec
LATERAL VIEW posexplode(arrayrec.arr_r) EXPLODED_rec as Seq, col;
Also check your hive version. posexplode() is available as of Hive 0.13.0.

Need to transform the rows into columns for the similar ID's in oracle

I need to transform the rows into columns for the similar ID's in oracle
e.g.
The following is the result I will get if i query my database
Col1 Col2 Col3
---- ---- ----
1 ABC Yes
1 XYZ NO
2 ABC NO
I need to transform this into
Col1 Col2 Col3 Col4 Col5
---- ---- ---- ---- ----
1 ABC Yes XYZ No
2 ABC NO NULL NULL
Someone please help me in solving this issue
Thanks,
Siv
Based on AskTom:
select Col1,
max( decode( rn, 1, Col2 ) ) Col_1,
max( decode( rn, 1, Col3 ) ) Col_2,
max( decode( rn, 2, Col2 ) ) Col_3,
max( decode( rn, 2, Col3 ) ) Col_4
from (
select Col1,
Col2,
Col3,
row_number() over (partition by Col1 order by Col2 desc nulls last) rn
from MyTable
)
group by Col1;
I don't have access to an Oracle db to test it but I think that will work. If there could be more than two records per ID then you could just add more rows to the select cause with the corresponding row number.
One solution is to use the 10g MODEL clause:
SQL> select col1
2 , col2
3 , col3
4 , col4
5 , col5
6 from t23
7 model
8 return updated rows
9 partition by ( col1 )
10 dimension by ( row_number() over ( partition by col1
11 order by col2 desc nulls last) rnk
12 )
13 measures (col2, col3, lpad(' ',4) col4, lpad(' ',4) col5)
14 rules upsert
15 (
16 col2 [0] = col2 [1]
17 , col3 [0] = col3 [1]
18 , col4 [0] = col2 [2]
19 , col5 [0] = col3 [2]
20 )
21 /
COL1 COL2 COL3 COL4 COL5
---------- ---- ---- ---- ----
1 ABC Yes ABC NO
2 XYZ NO
SQL>
It is an unfortunate truth about such solutions that we need to specify the number of columns in the query. That is, in regular SQL there is no mechanism for determining that the table contains three rows where COL1 = 1 so we need seven columns, which is not unreasonable. For situations in which the number of pivot values is unknown at the time of coding there is always dynamic sql.

Resources