How to select minimum values for duplicate ids using hive

How to select minimum values for duplicate ids using hive - hadoop

Can someone please help me on this.
I have data like this
**id,age,name**
10,25,abc
10,35,def
20,45,ghi
20,55,jkl
20,65,mno
30,40,pqr
30,50,stu
30,70,vwr
40,20,yza
40,25,fdf
40,25,dgh
40,20,sfs
Now I want to get the final result as below
+------+------+
| id | age |
+------+------+
| 10 | 25 |
| 20 | 45 |
| 30 | 40 |
| 40 | 20 |
| 40 | 20 |
+------+------+
I am able to do this in mysql but as hive do not support multiple arguments in sub query so I am not able to get desired result in hive.
I tried doing this using hive join but no success.
Thanks in advance for help!!

select id
,age
from (select id
,age
,rank () over
(
partition by id
order by age
) as rnk
from mytable
) t
where t.rnk = 1
+----+-----+
| id | age |
+----+-----+
| 10 | 25 |
| 20 | 45 |
| 30 | 40 |
| 40 | 20 |
| 40 | 20 |
+----+-----+

Other way to implement expected output.
SELECT id,
age
FROM
(SELECT id,
age
FROM tblname) a LEFT SEMI
JOIN
(SELECT id,
MIN(age) age
FROM tblName
GROUP BY id) b ON a.id=b.id
AND a.age=b.age

Related

Join 3 or more tables with Comma separated values in Oracle

I've 3 tables (One parent and 2 childs) as below
Student(Parent)
StudentID | Name | Age
1 | AA | 23
2 | BB | 25
3 | CC | 27
Book(child 1)
BookID | SID | BookName | BookPrice
1 | 1 | ABC | 20
2 | 1 | XYZ | 15
3 | 3 | LMN | 34
4 | 3 | DDD | 90
Pen(child 2)
PenID | SID | PenBrandName | PenPrice
1 | 2 | LML | 20
2 | 1 | PARKER | 15
3 | 2 | CELLO | 34
4 | 3 | LML | 90
I need to join the tables and get an output as Below
StudentID | Name | Age | BookNames | TotalBookPrice | PenBrands | TotalPenPrice
1 | AA | 23 | ABC, XYZ | 35 | PARKER | 15
2 | BB | 25 | null | 00 | LML, CELLO | 54
3 | CC | 27 | LMN, DDD | 124 | LML | 90
This is the code i tried :
Select s.studentID as "StudentID", s.name as "Name", s.age as "AGE",
LISTAGG(b.bookName, ',') within group (order by b.bookID) as "BookNames",
SUM(b.bookPrice) as "TotalBookPrice",
LISTAGG(p.penBrandName, ',') within group (order by p.penID) as "PenBrands",
SUM(p.penPrice) as "TotalPenPrice"
FROM Student s
LEFT JOIN BOOK b ON b.SID = s.StudentID
LEFT JOIN PEN p ON p.SID = s.StudentID
GROUP BY s.studentID, s.name, s.age
The result i get has multiple values of Book and Pen (cross product result in multiple values)
StudentID | Name | Age | BookNames | TotalBookPrice | PenBrands | TotalPenPrice
1 | AA | 23 | ABC,ABC,XYZ,XYZ | 35 | PARKER,PARKER | 15
Please let me know how to fix this.

Instead of Joining the tables and doing aggregation, You have to aggregate first and then join your tables -
Select s.studentID as "StudentID", s.name as "Name", s.age as "AGE",
"BookNames",
"TotalBookPrice",
"PenBrands",
"TotalPenPrice"
FROM Student s
LEFT JOIN (SELECT SID, LISTAGG(b.bookName, ',') within group (order by b.bookID) as "BookNames",
SUM(b.bookPrice) as "TotalBookPrice"
FROM BOOK
GROUP BY SID) b ON b.SID = s.StudentID
LEFT JOIN (SELECT SID, LISTAGG(p.penBrandName, ',') within group (order by p.penID) as "PenBrands",
SUM(p.penPrice) as "TotalPenPrice"
FROM PEN
GROUP BY SID) p ON p.SID = s.StudentID;

How to remove duplicate values from SQL inner join tables?

I have two tables:
Table 1:
+-----------+-----------+------------------+
| ID | Value | other |
+-----------+-----------+------------------+
| 123456 | 5 | 12 |
| 987654 | 7 | 15 |
| 456789 | 6 | 22 |
+-----------+-----------+------------------+
Table 2:
+-----------+-----------+------------------+
| ID | Type | other |
+-----------+-----------+------------------+
| 123456 | 00 | 2 |
| 123456 | 01 | 6 |
| 123456 | 02 | 4 |
| 987654 | 00 | 7 |
| 987654 | 01 | 8 |
| 456789 | 00 | 6 |
| 456789 | 01 | 16 |
+-----------+-----------+------------------+
Now I perform the inner join:
SELECT
table1.ID, table2.TYPE, table1.value, table2.other
FROM
table1 INNER JOIN table2 ON table1.ID = table2.ID
Here the SQLfiddle
Result Table:
+-----------+-----------+---------+------------------+
| ID | Type | Value | other |
+-----------+-----------+---------+------------------+
| 123456 | 00 | 5 | 2 |
| 123456 | 01 | 5 | 6 |
| 123456 | 02 | 5 | 4 |
| 987654 | 00 | 7 | 7 |
| 987654 | 01 | 7 | 8 |
| 456789 | 00 | 6 | 6 |
| 456789 | 01 | 6 | 16 |
+-----------+-----------+---------+------------------+
This is totally what I expected but not what I need.
Because if I now want to get the Value per ID the Value gets doubled or tripled for the first cause.
Desired Table:
+-----------+-----------+---------+------------------+
| ID | Type | Value | other |
+-----------+-----------+---------+------------------+
| 123456 | 00 | 5 | 2 |
| 123456 | 01 | - | 6 |
| 123456 | 02 | - | 4 |
| 987654 | 00 | 7 | 7 |
| 987654 | 01 | - | 8 |
| 456789 | 00 | 6 | 6 |
| 456789 | 01 | - | 16 |
+-----------+-----------+---------+------------------+
I tried to achieve a similar output by counting the rows per id and dividing the sum of Value by that count but it did not seem to work and is not the desired output.
Also, I tried grouping but this did not seem to achieve the desired output.
One thing to mention is that the DB I am working with is an ORACLE SQL DB.

How about this:
select table1.id
, table2.type
, case
when row_number() over (partition by table1.id order by table2.type) = 1
then table1.value
end as "VALUE"
, table2.other
from table1
join table2 on table1.id = table2.id
order by 1, 2;
(This is Oracle SQL syntax. Your SQL Fiddle (thanks!) was set to MySQL, which as far as I know doesn't have analytic functions like row_number().)

A way to get the result.
select t1.ID,
t2.type,
t1.value,
t2.other
from table1 t1 inner join table2 t2
ON t1.ID = t2.ID
inner join (select ID, min(type) mv
from table2
group by id) m
on t2.id = m.id
and t2.type = m.mv
union all
select t1.ID,
t2type,
null,
t2.other
from table1 t1 inner join table2 t2
ON t1.ID = t2.ID
and not exists (
select 1 from (
select ID, min(type) mv
from table2
group by id) m
where t2.id = m.id
and t2.type = m.mv
)
order by id,type

You can use a CASE block to display NULL for which it is NOT equal to MIN value of type
SELECT table1.ID,
table2.TYPE,
CASE
WHEN table2.TYPE =
MIN (table2.TYPE)
OVER (PARTITION BY table1.id ORDER BY table2.TYPE)
THEN
Table1.VALUE
END
VALUE,
table2.other
FROM table1 INNER JOIN table2 ON table1.ID = table2.ID;

Hive - same sequence of records in array

I have a table with data at hour level. I want to find the count of hours and the values for col1 and col2 for all hours in an array. Input Table
+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00 | 0.0 | a |
| 04 | 0.1 | b |
| 08 | 0.2 | c |
| 12 | 0.0 | d |
+-----+-----+-----+
I am using the below query to get the column values in an array
Query:
select count(hr), map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string)))))) as col1_arr, map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col2 as string)))))) as col2_arr from table;
Output that i am getting, values in col2_arr are not in the same sequence with col1_arr. Please suggest how can i get the values in array/list for different columns in same sequence.
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | b,a,c,d |
+----------+----------------+-----------+
Required output:
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | a,b,c,d |
+----------+----------------+-----------+
Thanks

select count(*) as cnt
,concat_ws(',',sort_array(collect_list(hour))) as hour
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,cast(col1 as string))))),'..:','') as col1
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,col2)))),'..:','') as col2
from mytable
;
+-----+-------------+-------------+---------+
| cnt | hour | col1 | col2 |
+-----+-------------+-------------+---------+
| 4 | 00,04,08,12 | 0,0.1,0.2,0 | a,b,c,d |
+-----+-------------+-------------+---------+

Hive Query for ROlling total based on 2 fields

I have a table a show below
Date | Customer | Count | Daily_Count | ITD_Count
d1 | A | 3 | 3 |
d2 | B | 4 | 4 |
d3 | A | 7 | 16 |
d3 | B | 9 | 16 |
d4 | A | 8 | 9 |
d4 | B | 1 | 9 |
Descrption of Fields:
Date : date
customer : name of customer
Count : # of customers
daily_Count : # of customers on daily basis calculated as
SUM(count) OVER (partition BY date )as Daily_Count
Question :
How do I calculate the Running Total or Rolling Total in the ITD_Count ?
The output should look like
Date | Customer | Count | Daily_Count | ITD_Count
d1 | A | 3 | 3 | 3
d2 | B | 4 | 4 | 7
d3 | A | 7 | 16 | 23
d3 | B | 9 | 16 | 23
d4 | A | 8 | 9 | 31
d4 | B | 1 | 9 | 31
I have tried several variations of using the Window functionality.. But hit a road-block in all my attempts.
Attempt 1 ;
SUM(daily_COunt) OVER (partition BY date order by date rows between unbounded preceding and current row ) as ITD_account_linking
Attempt 2 :
SUM(daily_COunt) OVER (partition BY date, daily_count order by date rows between unbounded preceding and current row ) as ITD_account_linking
and several more attempts following this. :(
Any possible suggestions to guide me in the right direction are welcome.
Please let me know if you need more details.

Use Hive Windowing and Analytics functions.
SELECT Date, Customer, Count, Daily_Count,
SUM(Daily_Count) OVER (ORDER BY Date ROWS UNBOUNDED PRECEDING) AS ITD_Count
FROM table;

Oracle select two (or more) adjacent rows having the same value for a given column

How do I do the following in Oracle:
I have a (simplified) table:
+-----+-----+-----+
| a | b | ... |
+-----+-----+-----+
| 1 | 7 | ... |
| 2 | 5 | ... |
| 1 | 7 | ... |
+-----+-----+-----+
Where a functions as a unique identifier for a person, and b is the field I am interested in matching across rows. How do I construct a query that basically says "give me the person-ID's where the person has multiple b values (i.e., duplicates)"?
So far I have tried:
SELECT a FROM mytable GROUP BY a HAVING COUNT(DISTINCT b) > 1;
This feels close except it just gives me the user IDs where the user has multiple unique b's, which I suspect is coming from the DISTINCT part, but I'm not sure how to change the query to achieve what I want.

Try
group by a,b having count(b) > 1
Yours would count 7,5,7 as 2 (one 7, one 5). This one one will count total Bs in any grouping, so you'll get 1,7 - > 2 and 1,5 -> 1

SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE mytable ( a, b ) AS
SELECT LEVEL, LEVEL FROM DUAL CONNECT BY LEVEL <= 2000
UNION ALL
SELECT LEVEL *2, LEVEL * 2 FROM DUAL CONNECT BY LEVEL <= 1000;
Query 1:
WITH data AS (
SELECT a
FROM mytable
GROUP BY a
HAVING COUNT(b) > COUNT( DISTINCT b )
ORDER BY a
),
numbered AS (
SELECT a,
ROWNUM AS rn
FROM data
)
SELECT a
FROM numbered
WHERE rn <= 20
Results:
| A |
|----|
| 2 |
| 4 |
| 6 |
| 8 |
| 10 |
| 12 |
| 14 |
| 16 |
| 18 |
| 20 |
| 22 |
| 24 |
| 26 |
| 28 |
| 30 |
| 32 |
| 34 |
| 36 |
| 38 |
| 40 |

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to select minimum values for duplicate ids using hive - hadoop

select id ,age from (select id ,age ,rank () over ( partition by id order by age ) as rnk from mytable ) t where t.rnk = 1 +----+-----+ | id | age | +----+-----+ | 10 | 25 | | 20 | 45 | | 30 | 40 | | 40 | 20 | | 40 | 20 | +----+-----+

Other way to implement expected output. SELECT id, age FROM (SELECT id, age FROM tblname) a LEFT SEMI JOIN (SELECT id, MIN(age) age FROM tblName GROUP BY id) b ON a.id=b.id AND a.age=b.age

Related

Join 3 or more tables with Comma separated values in Oracle

How to remove duplicate values from SQL inner join tables?

Hive - same sequence of records in array

Hive Query for ROlling total based on 2 fields

Oracle select two (or more) adjacent rows having the same value for a given column

Categories

Resources