What is the best logic to find max version in HiveQL? - max

I have the data in Hive table as below for Mobile Number and App Version number, and I wanted to get Max App Version Number installed in that mobile:
select * from app_ver_test;
1112223333 1.1.1.10
1112223333 1.1.1.8
1112224444 1.1.1.15
1112224444 2.1.1.0
1112225555 5.1.1.8
1112225555 5.1.1.20
1112226666 5.1.1.20
when you select simple max(App_ver) with group by mobile number, then I am getting wrong results for 1112223333 as it is showing 1.1.1.8 instead of 1.1.1.10.
So Please help me here.
Thanks
Sateesh
select ctn,max(CONCAT_WS('0',(split(app_ver, '\\.')))) from localytics.app_ver_test group by ctn;
select ctn,max(cast(CONCAT_WS('',(split(app_ver, '\\.'))) as Int) from localytics.app_ver_test group by ctn;
select ctn,(cast(CONCAT_WS('',(split(app_ver, '\\.'))) as float)) from localytics.app_ver_test order by ctn

You can try this:
select a.app_ver,ctn
FROM temp.test_table a
INNER JOIN (
select app_ver, (cast(max(cast(regexp_replace(ctn, "\\.", "")as int )) as string)) as max_value
from temp.test_table
group by app_ver) b
ON a.app_ver=b.app_ver
and regexp_replace(a.ctn, "\\.", "")=b.max_value ;
Result:
+-------------+-----------+--+
| a.app_ver | ctn |
+-------------+-----------+--+
| 1112223333 | 1.1.1.10 |
| 1112224444 | 1.1.1.15 |
| 1112225555 | 5.1.1.20 |
| 1112226666 | 5.1.1.20 |
+-------------+-----------+--+

Related

Clickhouse running diff with grouping

General Task
A table consists of three columns (time, key, value). The task is to calculate a running difference for each key.
So, from input
---------------
| time | key | value |
---------------
| 1 | A | 4 |
| 2 | B | 1 |
| 3 | A | 6 |
| 4 | A | 7 |
| 5 | B | 3 |
| 6 | B | 7 |
it is desired to get
----------------------
| key | value | delta |
----------------------
| A | 4 | 0 |
| B | 1 | 0 |
| A | 6 | 2 |
| A | 7 | 1 |
| B | 3 | 2 |
| B | 7 | 4 |
Approaches
runningDifference function. Works, if the key is fixed. So we can
select *, runningDifference(value) from
(SELECT key, value from table where key = 'A' order by time)
Note that subquery is necessary here. This solution suffers when you want to get this for different keys
groupArray.
select key, groupArray(value) from
(SELECT key, value from table order by time)
group by key
So, now we get a key and an array of elements with this key. Good.
But how to calculate a sliding difference? If we could do that, then ARRAY JOIN would lead us to a result.
Or we can even zip the array with itself and then apply lambda (we have arrayMap for that) but... we don't have any zip alternative.
Any ideas?
Thanks in advance.
Solution with arrays:
WITH
groupArray(value) as time_sorted_vals,
arrayEnumerate(time_sorted_vals) as indexes,
arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT
key,
running_diffs
FROM
(SELECT key, value from table order by time)
GROUP by key
Other option (doing sort inside each group separately, which is more optimal in a lot of cases)
WITH
groupArray( tuple(value,time) ) as val_time_tuples,
arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
arrayMap( t -> t.1, indexes) as time_sorted_vals,
arrayEnumerate(time_sorted_vals) as indexes,
arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT
key,
running_diffs
FROM
time
GROUP by key
and you can apply ARRAY JOIN on the result afterward.
Lately I've also encountered the problem and Clickhouse offers function arrayDifference.
WITH
groupArray(value) as vals
arrayDifference(vals) as running_diffs
SELECT
key,
running_diffs
FROM
(SELECT key, value from table order by time)
GROUP by key
This question was posted years ago, for today, Sep 29th, 2021, we can use arrayDifference instead of arrayMap. And we can ARRAY JOIN so that we can get a tabulated result instead of a nested array.
SELECT key, sorted_time, time_sorted_vals, running_diffs
FROM (
WITH
groupArray( tuple(value,time) ) as val_time_tuples,
arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
arrayMap( t -> t.1, val_time_tuples_sorted) as time_sorted_vals,
arrayMap( t -> t.2, val_time_tuples_sorted) as sorted_time,
arrayDifference(time_sorted_vals) as running_diffs
SELECT
key,
sorted_time,
time_sorted_vals,
running_diffs
FROM
table_name
GROUP by key)
ARRAY JOIN sorted_time, time_sorted_vals, running_diffs
The only restriction is that the value column should not be of nullable types.

Left Outer Join via a link table, using min() to restrict join to one row

I am trying to write an Oracle SQL query to join two tables that are linked via a link table (by that I mean a table with 2 columns, each a foreign key to the primary tables). A min() function is to be used to limit the results from the left outer join to a single row.
My model consists of "parents" and "nephews". Parents can have 0 or more nephews. Parents can be enabled or disabled. Each nephew has a birthday date. The goal of my query is:
Print a single row for each enabled parent, listing that parent's oldest nephew (ie the one with the min(birthday)).
My problem is illustrated here at sqlfiddle: http://sqlfiddle.com/#!4/9a3be0d/1
I can form a query that lists all of the nephews for the enabled parents, but that is not good enough- I just want one row per parent which includes just the oldest nephew. Forming the where clause to the outer table seems to be my stumbling block.
My tables and sample data:
create table parent (parent_id number primary key, parent_name varchar2(50), enabled int);
create table nephew (nephew_id number primary key, birthday date, nephew_name varchar2(50));
create table parent_nephew_link (parent_id number not null, nephew_id number not null);
parent table:
+----+-------------+---------+
| id | parent_name | enabled |
+----+-------------+---------+
| 1 | Donald | 1 |
+----+-------------+---------+
| 2 | Minnie | 0 |
+----+-------------+---------+
| 3 | Mickey | 1 |
+----+-------------+---------+
nephew table:
+-----------+------------+-------------+
| nephew_id | birthday | nephew_name |
+-----------+------------+-------------+
| 100 | 01/01/2017 | Huey |
+-----------+------------+-------------+
| 101 | 01/01/2016 | Dewey |
+-----------+------------+-------------+
| 102 | 01/01/2015 | Louie |
+-----------+------------+-------------+
| 103 | 01/01/2014 | Morty |
+-----------+------------+-------------+
| 104 | 01/01/2013 | Ferdie |
+-----------+------------+-------------+
parent_nephew_link table:
+-----------+-----------+
| parent_id | nephew_id |
+-----------+-----------+
| 1 | 100 |
+-----------+-----------+
| 1 | 101 |
+-----------+-----------+
| 1 | 102 |
+-----------+-----------+
| 3 | 103 |
+-----------+-----------+
| 3 | 104 |
+-----------+-----------+
My (not correct) query:
-- This query is not right, it returns a row for each nephew
select parent_name, nephew_name
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
-- I wish I could add this clause to restrict the result to the oldest
-- nephew but p.parent_id is not available in sub-selects.
-- You get an ORA-00904 error if you try this:
-- and n.birthday = (select min(birthday) from nephew nested where nested.parent_id = p.parent_id)
My desired output would be:
+-------------+-------------+
| parent_name | nephew_name |
+-------------+-------------+
| Donald | Louie |
+-------------+-------------+
| Mickey | Ferdie |
+-------------+-------------+
Thanks for any advice!
John
markaaronky's suggestion
I tried using markaaronky's suggestion but this sql is also flawed.
-- This query is not right either, it returns the correct data but only for one parent
select * from (
select parent_name, n.nephew_name, n.birthday
from parent p
left outer join parent_nephew_link pnl
on p.parent_id = pnl.parent_id
left outer join nephew n
on n.nephew_id = pnl.nephew_id
where enabled = 1
order by parent_name, n.birthday asc
) where rownum <= 1
Why not:
(1) include the n.birthday from the nephews table in your SELECT statement
(2) add an ORDER BY n.birthday ASC to your query
(3) also modify your select so that it only takes the top row?
I tried to write this out in sqlfiddle for you but it doesn't seem to like table aliases (e.g. it throws an error when I write n.birthday), but I'm sure that's legal in Oracle, even though I'm a SQL Server guy.
Also, if I recall correctly, Oracle doesn't have a SELECT TOP like SQL Server does... you have to do something like "WHERE ROWNUM = 1" instead? Same concept... you're just ordering your results so the oldest nephew is the first row, and you're only taking the first row.
Perhaps an undesired side effect is you WOULD get the birthday along with the names in your results. If that's unacceptable, my apologies. It looked like your question has been sitting unanswered for a while and this solution should at least give you a start.
Lastly, since you don't have a NOT NULL constraint on your birthday column and are doing left outer joins, you might make the query safer by adding AND n.birthday IS NOT NULL
Use:
select parent_name, nephew_name
from parent p
left outer join
(
SELECT pnl.parent_id, n.nephew_name
FROM parent_nephew_link pnl
join nephew n
on n.nephew_id = pnl.nephew_id
AND n.BIRTHDAY = (
SELECT min( BIRTHDAY )
FROM nephew n1
JOIN parent_nephew_link pnl1
ON pnl1.NEPHEW_ID = n1.NEPHEW_ID
WHERE pnl1.PARENT_ID = pnl.PARENT_ID
)
) ppp
on p.parent_id = ppp.parent_id
where p.enabled = 1
Demo: http://sqlfiddle.com/#!4/98758/23
| PARENT_NAME | NEPHEW_NAME |
|-------------|-------------|
| Mickey | Louie |
| Donald | Ferdie |

Insert value based on min value greater than value in another row

It's difficult to explain the question well in the title.
I am inserting 6 values from (or based on values in) one row.
I also need to insert a value from a second row where:
The values in one column (ID) must be equal
The values in column (CODE) in the main source row must be IN (100,200), whereas the other row must have value of 300 or 400
The value in another column (OBJID) in the secondary row must be the lowest value above that in the primary row.
Source Table looks like:
OBJID | CODE | ENTRY_TIME | INFO | ID | USER
---------------------------------------------
1 | 100 | x timestamp| .... | 10 | X
2 | 100 | y timestamp| .... | 11 | Y
3 | 300 | z timestamp| .... | 10 | F
4 | 100 | h timestamp| .... | 10 | X
5 | 300 | g timestamp| .... | 10 | G
So to provide an example..
In my second table I want to insert OBJID, OBJID2, CODE, ENTRY_TIME, substr(INFO(...)), ID, USER
i.e. from my example a line inserted in the second table would look like:
OBJID | OBJID2 | CODE | ENTRY_TIME | INFO | ID | USER
-----------------------------------------------------------
1 | 3 | 100 | x timestamp| substring | 10 | X
4 | 5 | 100 | h timestamp| substring2| 10 | X
My insert for everything that just comes from one row works fine.
INSERT INTO TABLE2
(ID, OBJID, INFO, USER, ENTRY_TIME)
SELECT ID, OBJID, DECODE(CODE, 100, (SUBSTR(INFO, 12,
LENGTH(INFO)-27)),
600,'CREATE') INFO, USER, ENTRY_TIME
FROM TABLE1
WHERE CODE IN (100,200);
I'm aware that I'll need to use an alias on TABLE1, but I don't know how to get the rest to work, particularly in an efficient way. There are 2 million rows right now, but there will be closer to 20 million once I start using production data.
You could try this:
select primary.* ,
(select min(objid)
from table1 secondary
where primary.objid < secondary.objid
and secondary.code in (300,400)
and primary.id = secondary.id
) objid2
from table1 primary
where primary.code in (100,200);
Ok, I've come up with:
select OBJID,
min(case when code in (300,400) then objid end)
over (partition by id order by objid
range between 1 following and unbounded following
) objid2,
CODE, ENTRY_TIME, INFO, ID, USER1
from table1;
So, you need a insert select the above query with a where objid2 is not null and code in (100,200);

Build adjacency matrix from list of weighted edges in BigQuery

Related issue:
How to create dummy variable columns for thousands of categories in Google BigQuery
I have a table of list of weighted edges which is a list of user-item rating, it looks like this:
| userId | itemId | rating
| 001 | 001 | 5.0
| 001 | 002 | 4.0
| 002 | 001 | 4.5
| 002 | 002 | 3.0
I want to convert this weighted edge list into a adjacency matrix:
| userId | item001 | item002
| 001 | 5.0 | 4.0
| 002 | 4.5 | 3.0
According to this post, we can do it in two steps, the first step is to extract the matrix entry's value to generate a query, and second step is to run the query which is generated from 1st step.
But my question is how to extract the rating value and use the rating value in the IF() statement? My intuition is to put a nested query inside the IF() statement such like:
IF(itemId = blah,
(select rating
from mytable
where
userId = blahblah
and itemId = blah),
0)
But this query looks too expensive, can someone give me an example?
Thanks
Unless I am missing something - it is quite similar to the post you referenced
Step 1 - generate query
SELECT 'SELECT userID, ' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(itemId = "' + STRING(itemId) + '", rating, 0)) AS item' + STRING(itemId)
)
+ ' FROM YourTable GROUP BY userId'
FROM (
SELECT itemId
FROM YourTable
GROUP BY itemId
)
Step 2 - run generated query
SELECT
userID,
SUM(IF(itemId = "001", rating, 0)) AS item001,
SUM(IF(itemId = "002", rating, 0)) AS item002
FROM YourTable
GROUP BY userId
Result as expected
userID item001 item002
001 5.0 4.0
002 4.5 3.0

How to transpose/pivot data in hive?

I know there's no direct way to transpose data in hive. I followed this question: Is there a way to transpose data in Hive? , but as there is no final answer there, could not get all the way.
This is the table I have:
| ID | Code | Proc1 | Proc2 |
| 1 | A | p | e |
| 2 | B | q | f |
| 3 | B | p | f |
| 3 | B | q | h |
| 3 | B | r | j |
| 3 | C | t | k |
Here Proc1 can have any number of values. ID, Code & Proc1 together form a unique key for this table. I want to Pivot/ transpose this table so that each unique value in Proc1 becomes a new column, and corresponding value from Proc2 is the value in that column for the corresponding row. In essense, I'm trying to get something like:
| ID | Code | p | q | r | t |
| 1 | A | e | | | |
| 2 | B | | f | | |
| 3 | B | f | h | j | |
| 3 | C | | | | k |
In the new transformed table, ID and code are the only primary key. From the ticket I mentioned above, I could get this far using the to_map UDAF. (Disclaimer - this may not be a step in the right direction, but just mentioning here, if it is)
| ID | Code | Map_Aggregation |
| 1 | A | {p:e} |
| 2 | B | {q:f} |
| 3 | B | {p:f, q:h, r:j } |
| 3 | C | {t:k} |
But don't know how to get from this step to the pivot/transposed table I want.
Any help on how to proceed will be great!
Thanks.
Here is the approach i used to solved this problem using hive's internal UDF function, "map":
select
b.id,
b.code,
concat_ws('',b.p) as p,
concat_ws('',b.q) as q,
concat_ws('',b.r) as r,
concat_ws('',b.t) as t
from
(
select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from (
select
id,
code,
map(proc1,proc2) as group_map
from
test_sample
) a
group by
a.id,
a.code
) b;
"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.
Here is the solution I ended up using:
add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select
id,
code,
group_map['p'] as p,
group_map['q'] as q,
group_map['r'] as r,
group_map['t'] as t
from ( select
id, code,
collect(proc1,proc2) as group_map
from test_sample
group by id, code
) gm;
The to_map UDF was used from the brickhouse repo: https://github.com/klout/brickhouse
Yet another solution.
Pivot using Hivemall to_map function.
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
uid c1 c2 c3
101 11 12 13
102 21 22 23
Unpivot
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
uid key value
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23
I have not written this code, but I think you can use some of the UDFs provided by klouts brickhouse: https://github.com/klout/brickhouse
Specifically, you could do something like use their collect as mentioned here: http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
and then explode the arrays (they will be of differing length) using the methods detailed in this post http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra
I have created one dummy table called hive using below query-
create table hive (id Int,Code String, Proc1 String, Proc2 String);
Loaded all the data in the table-
insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f');
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');
Now use the below query to achieve the output.
select id,code,
case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
from(
select id, code,
case when proc1 ='p' then proc2 end as p,
case when proc1 ='q' then proc2 end as q,
case when proc1 ='r' then proc2 end as r,
case when proc1 ='t' then proc2 end as t
from hive
) dummy group by id,code;
In case of numeric value you can use below hive query:
Sample data
ID cust_freq Var1 Var2 frequency
220444 1 16443 87128 72.10140547
312554 6 984 7339 0.342452643
220444 3 6201 87128 9.258396518
220444 6 47779 87128 2.831972441
312554 1 6055 7339 82.15209213
312554 3 12868 7339 4.478333954
220444 2 6705 87128 15.80822558
312554 2 37432 7339 13.02712127
select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id,
map(cust_freq,frequency) as group_map
from table
) a group by a.id having id in
( '220444',
'312554');
ID one two three six
220444 72.10140547 15.80822558 9.258396518 2.831972441
312554 82.15209213 13.02712127 4.478333954 0.342452643
In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.
For Unpivot, we can simply use below logic.
SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);
Below is also a way for Pivot
SELECT TM1_Code, Product, Size, State_code, Description
, Promo_date
, Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
, MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
You can use case statements and some help from collect_set to achieve this. You can check this out. You can check detail answer at - http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/
Here is the query for reference,
SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;

Resources