How to find the nearest neighbor in Hive? Any windowing function? - hadoop

Given a table
$cat data.csv
ID,State,City,Price,Flag
1,CA,A,95,0
2,CA,A,96,1
3,CA,A,195,1
4,NY,B,124,0
5,NY,B,128,1
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1
Expected Result:
ID0, ID1
1,2
4,5
6,7
8,7
for each ID with Flag=0 above, we want to find another ID from Flag=1, with the same "State" and "City", and the nearest Price.
I have two rough stupid ideas:
Method 1.
Use a left outer join with the table itself on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank
where rank=1
Method 2.
Use a left outer join with the table itself,
on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1
What's the best way to find the nearest neighbor in Hive?
Any valuable tips will be greatly appreciated!

select a.id, b.id , min(abs(b.price-a.price)) as delta
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city
group by a.id, b.id
order by delta asc;
This returns
1 2 1 <---
8 7 2 <---
6 7 3 <---
4 5 4 <---
8 9 10
6 9 15
1 3 100
The problem is that the last 3 rows have the same id used into the first 4.
select a.id as id0, b.id as id1, abs(b.price-a.price) as delta,
rank() over ( partition by a.country, a.city order by abs(b.price-a.price) )
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city;
This will return
id0 id1 prc rank
1 2 1 1 <---
1 3 100 2
4 5 4 1 <---
8 7 2 1 <---
6 7 3 2
8 9 10 3
6 9 15 4
We are missing 6,7 and this is somehow correct.
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1
The lowest price difference for (6,7),(6,9),(8,7),(8,9) is in (8,7). (ambiguous join)
I think you will love this video about this topic : Big Data Analytics Using Window Functions

Related

I need a Select for Max(Version)

I have two Tables with a foreign key from t1.ID to T2.T_ID
T1:
ID
PR_ID
Version
1
1
1
2
2
1
3
2
2
4
3
1
5
3
2
6
4
1
T2:
ID
T_ID
ab_nr
1
1
56
2
2
3
3
3
76
4
4
4
5
5
87
6
6
64
I need a select which gets all T2.IDs with the highest T1.Version. For example T1.PR_ID has the Numbers 2 and 3 with different Versions, here i would only need as end Result the T1.ID 's 1,3,5 and 6.
I tried it with:
SELECT * FROM T2
JOIN T1 ON T1.ID = T2.T_ID
WHERE T1.Version IN (SELECT MAX(VERSION) FROM T1);
but this doesnt work because it only gets the Number 2 and nothing else.
There's always a many ways to skin a SQL cat, but here's a simple one.
SELECT t2.*
FROM t1
INNER JOIN t2 ON t2.t_id = t1.id
WHERE NOT EXISTS ( SELECT 'higher version for the same PR_ID'
FROM t1 t1x
WHERE t1x.pr_id = t1.pr_id
AND t1x.version > t1.version )
That is, add a NOT EXISTS condition to filter out any results that are for old versions.
The way you tried to do it was on the right track, but you just needed to correlate your MAX(VERSION) subquery so that it got the max version for the current PR_ID. Like this:
SELECT * FROM T2
JOIN T1 ON T1.ID = T2.T_ID
WHERE T1.Version IN (SELECT MAX(VERSION) FROM T1X
-- You missed this part, below
WHERE T1X.PR_ID = T1.PR_ID
);
Anyway, try either of these. If performance is not good, we can start looking at more efficient ways of doing it (e.g., MAX ... KEEP)

only keep distinct rows when doing collect_set over a moving windowing function in hive

Lets say I have a hive table that has 3 rows: merchant_id, week_id, acc_id. My goal is to collect the unique customers in the previous 4 weeks for each week and I am using a moving window to do this.
My codes:
create a test table:
CREATE TABLE table_test_test (merchant_id INT, week_id INT, acc_id INT);
INSERT INTO TABLE table_test_test VALUES
(1,0,8),
(1,0,9),
(1,0,10),
(1,2,1),
(1,2,2),
(1,2,4),
(1,4,1),
(1,4,3),
(1,4,4),
(1,5,1),
(1,5,3),
(1,5,5),
(1,6,1),
(1,6,5),
(1,6,6)
Then do the collect:
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND 0 preceding) as uniq_accs_prev_4_weeks
from
table_test_test
The result table is :
merchant_id week_id uniq_accs_prev_4_weeks
1 1 0 []
2 1 0 []
3 1 0 []
4 1 2 [9,8,10]
5 1 2 [9,8,10]
6 1 2 [9,8,10]
7 1 4 [9,8,10,1,2,4]
8 1 4 [9,8,10,1,2,4]
9 1 4 [9,8,10,1,2,4]
10 1 5 [1,2,4,3]
11 1 5 [1,2,4,3]
12 1 5 [1,2,4,3]
13 1 6 [1,2,4,3,5]
14 1 6 [1,2,4,3,5]
15 1 6 [1,2,4,3,5]
As you can see, there are redundant rows in the table. This is just an example, in my actual case this table is huge and the redundancy causes memory problem.
I have tried using distinct and group by but neither of these works.
Is there a good way to do it? Thanks a lot.
Distinct works good:
select distinct merchant_id, week_id, uniq_accs_prev_4_weeks
from
(
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND current row) as uniq_accs_prev_4_weeks
from
table_test_test
)s;
Result:
OK
1 0 [9,8,10]
1 2 [9,8,10,1,2,4]
1 4 [9,8,10,1,2,4,3]
1 5 [1,2,4,3,5]
1 6 [1,2,4,3,5,6]
Time taken: 98.088 seconds, Fetched: 5 row(s)
My Hive does not accept 0 preceding, I replaced with current row. It seems like this bug or this bug, my Hive version is 1.2. Yours should work fine with distinct added in the upper subquery.

Can I update a particular attribute of a tuple with the same attribute of another tuple of same table? If possible what should be the algorithm?

Suppose I have a table with 10 records/tuples. Now I want to update an attribute of 6th record with the same attribute of 1st record, 2nd-7th, 3rd-8th, 4th-9th, 5th-10th in a go i.e. without using cursor/loop. Use of any number of temporary table is allowed. What is the strategy to do so?
PostgreSQL (and probably other RDBMSes) let you use self-joins in UPDATE statements just as you can in SELECT statements:
UPDATE tbl
SET attr = t2.attr
FROM tbl t2
WHERE tbl.id = t2.id + 5
AND tbl.id >= 6
This would be easy with an update-with-join but Oracle doesn't do that and the closest substitute can be very tricky to get to work. Here is the easiest way. It involves a subquery to get the new value and a correlated subquery in the where clause. It looks complicated but the set subquery should be self-explanatory.
The where subquery really only has one purpose: it connects the two tables, much as the on clause would do if we could do a join. Except that the field used from the main table (the one being updated) must be a key field. As it turns out, with the self "join" being performed below, they are both the same field, but it is required.
Add to the where clause other restraining criteria, as shown.
update Tuples t1
set t1.Attr =(
select t2.Attr
from Tuples t2
where t2.Attr = t1.Attr - 5 )
where exists(
select t2.KeyVal
from Tuples t2
where t1.KeyVal = t2.KeyVal)
and t1.Attr > 5;
SqlFiddle is pulling a hissy fit right now so here the data used:
create table Tuples(
KeyVal int not null primary key,
Attr int
);
insert into Tuples
select 1, 1 from dual union all
select 2, 2 from dual union all
select 3, 3 from dual union all
select 4, 4 from dual union all
select 5, 5 from dual union all
select 6, 6 from dual union all
select 7, 7 from dual union all
select 8, 8 from dual union all
select 9, 9 from dual union all
select 10, 10 from dual;
The table starts out looking like this:
KEYVAL ATTR
------ ----
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
with this result:
KEYVAL ATTR
------ ----
1 1
2 2
3 3
4 4
5 5
6 1
7 2
8 3
9 4
10 5

Oracle convert DECODE to PIVOT or force use of index

I have a very complex SQL view definition that has been inherited and requires altering to improve performance. It takes a list of records based on a foreign key and displays the rows returned as columns.
Thus :-
Data from select using RANK
ID RANK DKEY RECORD1 RECORD2 RECORD3
1 1 1 003 Rob Emmerry
1 2 2 004 Sue Emmerry
Returns
ID REC11 REC12 REC13 REC21 REC22 REC23
1 003 Rob Emmerry 004 Sue Emmerry
There are 37 columns of data repeated for each returned row upto a max of 5.
Using
SELECT ID,
MIN(DECODE(ranking,1,RECORD1, NULL)) AS REC11
MIN(DECODE(ranking,1,RECORD2, NULL)) AS REC12
MIN(DECODE(ranking,1,RECORD3, NULL)) AS REC13
MIN(DECODE(ranking,1,RECORD4, NULL)) AS REC14
MIN(DECODE(ranking,1,RECORD5, NULL)) AS REC15
MIN(DECODE(ranking,1,RECORD6, NULL)) AS REC16
MIN(DECODE(ranking,2,RECORD1, NULL)) AS REC21
MIN(DECODE(ranking,2,RECORD2, NULL)) AS REC22
MIN(DECODE(ranking,2,RECORD3, NULL)) AS REC23
MIN(DECODE(ranking,2,RECORD4, NULL)) AS REC24
MIN(DECODE(ranking,2,RECORD5, NULL)) AS REC25
MIN(DECODE(ranking,2,RECORD6, NULL)) AS REC26
FROM
(
SELECT ID, RANK () OVER (PARTITION BY id ORDER BY dkey) ranking,
RECORD1,
RECORD2,
RECORD3,
RECORD4,
RECORD5,
RECORD6
FROM TABLEA
JOIN
(SELECT ID, DKEY, RECORD4, RECORD5, RECORD6
FROM TABLEB
) ON TABLEB.DKEY = TABLEA.DKEY AND TABLEB.ID = TABLEA.ID
)
GROUP BY ID;
When using the explain plan and filtering on the DKEY field which has an index the index is ignored presumably because of the min/decode statements.
So I thought about rewriting this using PIVOT but don't know how to start.
Any thoughts as to how I can
a) Get the query to use the index
b) Rewrite using PIVOT
First option is obviously preferable.
Thanks
Craig
UPDATE
Here is some sample data showing how my tables are.
Table 1
DKEY PID RECORD1 RECORD2 RECORD3
1 1 3 Rob Emmerry
2 1 4 Sue Emmerry
3 1 4 Jan Morris
4 1 4 Sue Pye
5 1 4 Jane Taylor
Table 2
CID DKEY RECORD10
1 3 A
2 3 D
3 3 G
4 3 J
5 4 A
6 5 A
7 5 D
8 6 A
9 6 D
10 6 G
11 7 A
12 7 D
13 7 G
14 7 J
15 7 M
Table 3
QID DKEY RECORD3
1 3 C
2 6 C
3 6 F
4 7 C
5 7 F
So tables 2 & 3 link to table 1 with DKEY
If we took the DKEY=3 as an example I would want to see this:-
PID DKEY REC1 REC2 REC3 REC4 REC5 REC6 REC7 REC8 REC9 REC10 REC11 REC12 REC13
1 3 4 Jan Morris A D G J NULL C NULL NULL NULL NULL
There could be up to 5 rows in each of tables 2 & 3. Fields PID, DKEY, REC1-REC3 from table 1, REC4-REC8 come from table 2 and the rest from table 3. The other records from table 1 would simply continue on the row so after REC13, DKEY=4 etc etc.
Hope this makes sense.
SELECT
ID,
MIN(DECODE(ranking,1,RECORD1, NULL)) AS REC11,
MIN(DECODE(ranking,1,RECORD2, NULL)) AS REC12,
MIN(DECODE(ranking,1,RECORD3, NULL)) AS REC13,
MIN(DECODE(ranking,1,RECORD4, NULL)) AS REC14,
MIN(DECODE(ranking,1,RECORD5, NULL)) AS REC15,
MIN(DECODE(ranking,1,RECORD6, NULL)) AS REC16,
MIN(DECODE(ranking,2,RECORD1, NULL)) AS REC21,
MIN(DECODE(ranking,2,RECORD2, NULL)) AS REC22,
MIN(DECODE(ranking,2,RECORD3, NULL)) AS REC23,
MIN(DECODE(ranking,2,RECORD4, NULL)) AS REC24,
MIN(DECODE(ranking,2,RECORD5, NULL)) AS REC25,
MIN(DECODE(ranking,2,RECORD6, NULL)) AS REC26
FROM
(
SELECT /*+ INDEX(tablea tablea_index) */
ID,
RANK () OVER (PARTITION BY id ORDER BY dkey) ranking,
RECORD1,
RECORD2,
RECORD3,
RECORD4,
RECORD5,
RECORD6
FROM TABLEA
JOIN TABLEB
-- was: ON TABB.DKEY = TABLEA.DKEY AND TABB ON TABB.ID = TABLEA.ID
ON TABLEB.DKEY = TABLEA.DKEY
AND TABLEB.ID = TABLEA.ID
)
GROUP BY ID;

select attribute mysql

I have there mysql table:
**product (id,name)**
1 Samsung
2 Toshiba
3 Sony
**attribute (id,name,parentid)**
1 Size 0
2 19" 1
3 17" 1
4 15" 1
5 Color 0
6 White 5
7 Black 5
8 Price 0
9 <$100 8
10 $100-$300 8
11 >$300 8
**attribute2product (id,productid,attributeid)**
1 1 2
2 1 6
3 2 2
4 2 7
5 3 3
6 3 7
7 1 9
8 2 9
9 3 10
And listed them like:
**Size**
-- 19" (2)
-- 17" (1)
-- 15" (0)
**Color**
-- White (1)
-- Black (2)
**Price**
-- <$100 (1)
-- $100-$300 (1)
-- >$300 (1)
Please help me the mysql query to list the attribute name and count the number product that this attribute have. EG: When select Size 19" (attribute.id 2)
**Size**
-- 19"
**Color**
-- White (1)
-- Black (1)
**Price**
-- <$100 (1)
-- $100-$300 (1)
This will query to attribute2product >> select the productid >> next query to select other attribute of that productid and display the attribute name, number of prod that attribute name now have... (Like Magento)
Thanks,
I've modified the query. This should be what you based on your updates:
SELECT attribute.name AS attributename, COUNT(*) AS numofproducts FROM product
INNER JOIN attribute2product ON attribute2product.productid = product.id
INNER JOIN attribute ON attribute.id = attribute2product.attributeid
WHERE product.id IN
(
SELECT p.id FROM product AS p
INNER JOIN attribute2product AS a2p ON a2p.productid = p.id
WHERE a2p.attributeid = 2
)
GROUP BY attribute.id, attribute.name;
Based on your above data I got:
attributename numofproducts
19" 2
White 1
Black 1
<$100 2
For multiple attributes (based a more knowledgeable expert Quassnoi's blog article) :
I've removed product table since it's not needed here
SELECT attribute.name AS attributename, COUNT(*) AS numofproducts
FROM attribute2product
INNER JOIN attribute ON attribute.id = attribute2product.attributeid
WHERE attribute2product.productid IN (
SELECT o.productid
FROM (
SELECT productid
FROM (
SELECT 2 AS att
UNION ALL
SELECT 6 AS att
) v
JOIN attribute2product ON attributeid >= att AND attributeid <= att
) o
GROUP BY o.productid
HAVING COUNT(*) = 2
)
GROUP BY attribute.id, attribute.name
2, 6 refer to 19" and White, respectively. COUNT(*) = 2 is to match 2 attributes. More attributes can be added by appending the following to nested derived table:
UNION ALL
SELECT <attributeid> AS att
As expected the result from the query:
attributename numofproducts
19" 1
White 1
<$100 1

Resources