only keep distinct rows when doing collect_set over a moving windowing function in hive - hadoop

Lets say I have a hive table that has 3 rows: merchant_id, week_id, acc_id. My goal is to collect the unique customers in the previous 4 weeks for each week and I am using a moving window to do this.
My codes:
create a test table:
CREATE TABLE table_test_test (merchant_id INT, week_id INT, acc_id INT);
INSERT INTO TABLE table_test_test VALUES
(1,0,8),
(1,0,9),
(1,0,10),
(1,2,1),
(1,2,2),
(1,2,4),
(1,4,1),
(1,4,3),
(1,4,4),
(1,5,1),
(1,5,3),
(1,5,5),
(1,6,1),
(1,6,5),
(1,6,6)
Then do the collect:
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND 0 preceding) as uniq_accs_prev_4_weeks
from
table_test_test
The result table is :
merchant_id week_id uniq_accs_prev_4_weeks
1 1 0 []
2 1 0 []
3 1 0 []
4 1 2 [9,8,10]
5 1 2 [9,8,10]
6 1 2 [9,8,10]
7 1 4 [9,8,10,1,2,4]
8 1 4 [9,8,10,1,2,4]
9 1 4 [9,8,10,1,2,4]
10 1 5 [1,2,4,3]
11 1 5 [1,2,4,3]
12 1 5 [1,2,4,3]
13 1 6 [1,2,4,3,5]
14 1 6 [1,2,4,3,5]
15 1 6 [1,2,4,3,5]
As you can see, there are redundant rows in the table. This is just an example, in my actual case this table is huge and the redundancy causes memory problem.
I have tried using distinct and group by but neither of these works.
Is there a good way to do it? Thanks a lot.

Distinct works good:
select distinct merchant_id, week_id, uniq_accs_prev_4_weeks
from
(
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND current row) as uniq_accs_prev_4_weeks
from
table_test_test
)s;
Result:
OK
1 0 [9,8,10]
1 2 [9,8,10,1,2,4]
1 4 [9,8,10,1,2,4,3]
1 5 [1,2,4,3,5]
1 6 [1,2,4,3,5,6]
Time taken: 98.088 seconds, Fetched: 5 row(s)
My Hive does not accept 0 preceding, I replaced with current row. It seems like this bug or this bug, my Hive version is 1.2. Yours should work fine with distinct added in the upper subquery.

Related

how to convert column value to multiple insert rown oracle cursor

I am trying to copy value from our old db to new db where there is a change in the table structure.
Below is the structure of the table
Table1
Table1ID, WheelCount, BlindCount, OtherCount
For eg values of table 1 is like below
Table1ID, 1,2,5
Table1 is now changed to TableNew with the below
TableNewID, DisableID , Quantity.
So the value should be
TableNewID1, 1,1 here 1= WheelCount from table1
TableNewID2, 2,2 here 2= BlindCount
TableNewID3, 3,5 here 5= OtherCount
how to write a cursor to transform table1 value to the new table tableNew structure.
Table1
Table1ID WheelCount BlindCount OtherCount
1 1 2 5
2 8 10 15
A master table defined to map disableid
DisableID Type
1 wheelCount
2 blindcount
3 otherCount
Expected structure
ID **Table1ID** **DISABLEID** QUANTITY
1 1 1 1
2 1 2 2
3 1 3 5
4 2 1 8
5 2 2 10
6 2 3 15
The simplest is a UNION ALL for each column you want to turn into a row.
insert into tablenew
select table1id,1,wheelcount from table1
union all
select table1id,2,blindcount from table1
union all
select table1id,3,othercount from table1
There are other, sleeker methods for avoiding multiple passes on the first table, in case it's huge.
This is how I understood it.
Current table contents:
SQL> SELECT * FROM table1;
TABLE1ID WHEELCOUNT BLINDCOUNT OTHERCOUNT
---------- ---------- ---------- ----------
1 1 2 5
2 8 10 15
Prepare new table:
SQL> CREATE TABLE tablenew
2 (
3 id NUMBER,
4 table1id NUMBER,
5 disableid NUMBER,
6 quantity NUMBER
7 );
Table created.
Sequence (will be used to populate tablenew.id column):
SQL> CREATE SEQUENCE seq_dis;
Sequence created.
Trigger (which actually populates tablenew.id):
SQL> CREATE OR REPLACE TRIGGER trg_bi_tn
2 BEFORE INSERT
3 ON tablenew
4 FOR EACH ROW
5 BEGIN
6 :new.id := seq_dis.NEXTVAL;
7 END;
8 /
Trigger created.
Copy data:
SQL> INSERT INTO tablenew (table1id, disableid, quantity)
2 SELECT table1id, 1 disableid, wheelcount AS quantity FROM table1
3 UNION ALL
4 SELECT table1id, 2 disableid, blindcount AS quantity FROM table1
5 UNION ALL
6 SELECT table1id, 3 disableid, othercount AS quantity FROM table1;
6 rows created.
Result:
SQL> SELECT *
2 FROM tablenew
3 ORDER BY table1id, disableid;
ID TABLE1ID DISABLEID QUANTITY
---------- ---------- ---------- ----------
1 1 1 1
3 1 2 2
5 1 3 5
2 2 1 8
4 2 2 10
6 2 3 15
6 rows selected.

Full Date Range

1st time posting here, so sorry if I messed something up.
I'm trying to figure out how many units pre day are being used across multiple products given a date range.
So if I had a table like this:
Product_id
Start_date
End_date
Units
1
07/07/2021
07/09/2021
2
2
07/08/2021
07/10/2021
4
3
07/12/2021
07/12/2021
7
The output should be something like:
Date
Units
07/07/2021
2
07/08/2021
6
07/09/2021
6
07/10/2021
4
07/11/2021
0
07/12/2021
7
Here's one option; read comments within code.
SQL> with
2 calendar as
3 -- all dates between the first START_DATE and the last END_DATE.
4 -- You need it for outer join with data you have.
5 (select mindat + level - 1 as datum
6 from (select min(start_date) mindat,
7 max(end_date) maxdat
8 from test
9 )
10 connect by level <= maxdat - mindat + 1
11 )
12 -- final query
13 select c.datum,
14 nvl(sum(t.units), 0) units
15 from calendar c left join test t on c.datum between t.start_date and t.end_date
16 group by c.datum
17 order by c.datum;
DATUM UNITS
---------- ----------
07/07/2021 2
07/08/2021 6
07/09/2021 6
07/10/2021 4
07/11/2021 0
07/12/2021 7
6 rows selected.
SQL>

How do I add alias based on values of table

As my title, for ex I have a table A and it has values from 1 to 10.
I want to Select value 1 and 2 as "First" column name, 3 and 4 as "Second" column name v.v.
Look like this:
|First| |Second|
1 3
2 4
1 4
Thanks!
Using CASE, perhaps?
SQL> with test as
2 (select level val from dual
3 connect by level <= 5
4 )
5 select case when val <= 2 then val end first,
6 case when val > 2 then val end second
7 from test;
FIRST SECOND
---------- ----------
1
2
3
4
5
SQL>
It would help, though, if you provided sample data and explained what to do with values that aren't contained in (1, 2, 3, 4).

How do you shift values down in a column in an Oracle table?

Given the following oracle database table:
group revision comment
1 1 1
1 2 2
1 null null
2 1 1
2 2 2
2 3 3
2 4 4
2 null null
3 1 1
3 2 2
3 3 3
3 null null
I want to shift the comment column one step down in relation to version, within its group, so that I get the following table:
group revision comment
1 1 null
1 2 1
1 null 2
2 1 null
2 2 1
2 3 2
2 4 3
2 null 4
3 1 null
3 2 1
3 3 2
3 null 3
I have the following query:
MERGE INTO example_table t1
USING example_table t2
ON (
(t1.revision = t2.revision+1 OR
(t2.revision = (
SELECT MAX(t3.revision)
FROM example_table t3
WHERE t3.group = t1.group
) AND t1.revision IS NULL)
)
AND t1.group = t2.group)
WHEN MATCHED THEN UPDATE SET t1.comment = t2.comment;
That does most of this (still need a separate query to cover revision = 1), but it is very slow.
So my question is, how do I use Max here as efficiently as possible to pull out the highest revision for each group?
I would use lag not max
create table example_table(group_id number, revision number, comments varchar2(40));
insert into example_table values (1,1,1);
insert into example_table values (1,2,2);
insert into example_table values (1,3,null);
insert into example_table values (2,1,1);
insert into example_table values (2,2,2);
insert into example_table values (2,3,3);
insert into example_table values (2,4,null);
select * from example_table;
merge into example_table e
using (select group_id, revision, comments, lag(comments, 1) over (partition by group_id order by revision nulls last) comments1 from example_table) u
on (u.group_id = e.group_id and nvl(u.revision,0) = nvl(e.revision,0))
when matched then update set comments = u.comments1;
select * from example_table;

How to find the nearest neighbor in Hive? Any windowing function?

Given a table
$cat data.csv
ID,State,City,Price,Flag
1,CA,A,95,0
2,CA,A,96,1
3,CA,A,195,1
4,NY,B,124,0
5,NY,B,128,1
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1
Expected Result:
ID0, ID1
1,2
4,5
6,7
8,7
for each ID with Flag=0 above, we want to find another ID from Flag=1, with the same "State" and "City", and the nearest Price.
I have two rough stupid ideas:
Method 1.
Use a left outer join with the table itself on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank
where rank=1
Method 2.
Use a left outer join with the table itself,
on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1
What's the best way to find the nearest neighbor in Hive?
Any valuable tips will be greatly appreciated!
select a.id, b.id , min(abs(b.price-a.price)) as delta
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city
group by a.id, b.id
order by delta asc;
This returns
1 2 1 <---
8 7 2 <---
6 7 3 <---
4 5 4 <---
8 9 10
6 9 15
1 3 100
The problem is that the last 3 rows have the same id used into the first 4.
select a.id as id0, b.id as id1, abs(b.price-a.price) as delta,
rank() over ( partition by a.country, a.city order by abs(b.price-a.price) )
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city;
This will return
id0 id1 prc rank
1 2 1 1 <---
1 3 100 2
4 5 4 1 <---
8 7 2 1 <---
6 7 3 2
8 9 10 3
6 9 15 4
We are missing 6,7 and this is somehow correct.
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1
The lowest price difference for (6,7),(6,9),(8,7),(8,9) is in (8,7). (ambiguous join)
I think you will love this video about this topic : Big Data Analytics Using Window Functions

Resources