Hive getting top n records in group by query - user-defined-functions

I have following table in hive
user-id, user-name, user-address,clicks,impressions,page-id,page-name
I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]
I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.
How can we do this using HIve UDF ?

As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank
No UDF required, and only one subquery! Also, all of the rank logic is localized.
You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

Revised answer, fixing the bug as mentioned by #Himanshu Gahlot
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
SELECT page-id, user-id, clicks FROM mytable
DISTRIBUTE BY page-id
SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank
Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
SELECT page-id, user-id, clicks
FROM (
SELECT page-id, user-id, rank(user-id) as rank, clicks
FROM mytable
DISTRIBUTE BY page-id, user-id
SORT BY page-id, user-id, clicks desc
) a
WHERE rank < 5
ORDER BY page-id, rank

You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.
select
page-id,
user-id,
clicks
from (
select
each_top_k(5, page-id, clicks, page-id, user-id)
as (rank, clicks, page-id, user-id)
from (
select
page-id, user-id, clicks
from
mytable
DISTRIBUTE BY page-id SORT BY page-id
) t1
) t2
order by page-id ASC, clicks DESC
The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.

Let us say your data looks like following :
page-id user-id clicks
page1 user1 10
page1 user2 10
page1 user3 9
page1 user4 8
page1 user5 7
page1 user6 7
page1 user7 6
page1 user8 5
page2 user1 20
page2 user2 19
page2 user3 18
Below Query will give you :
SELECT page-id, user-id, clicks, rank
FROM (
SELECT page-id, user-id, rank()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5
Result :
page-id user-id clicks rank
page1 user1 10 1
page1 user2 10 1
page1 user3 9 3
page1 user4 8 4
page1 user5 7 5
page1 user6 7 5
page2 user1 20 1
page2 user2 19 2
page2 user3 18 3
So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.
But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query
SELECT page-id, user-id, clicks, rank
FROM (
SELECT page-id, user-id, row_number()
over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks
FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5
Result :
page-id user-id clicks rank
page1 user1 10 1
page1 user2 10 2
page1 user3 9 3
page1 user4 8 4
page1 user5 7 5
page2 user1 20 1
page2 user2 19 2
page2 user3 18 3

select * from (select user_id,user-name,user-address,page,click,row_num() over (partition by page order by clicks desc) a where a.row_num<=5
there might be a change in choosing the columns but the logic is correct.

Related

How to join data from two views in oracle?

I have two views view1 and view2 I want to join the data from both views. The data example is the following : view1
old_NUmbers
counts
123
2
324
3
4454
13
343433
20
View2 data:
numbers
counts
343344
10
24344
15
So the desired result which I want is the following:
old_NUmbers
counts
numbers
counts
123
2
343344
10
324
3
24344
15
4454
13
343433
20
If you're combining the results and want to align data from the two views in counts order, you can generate a nominal ordinal value for each row in each view, for example with the row_number() function:
select v.old_numbers, v.counts,
row_number() over (order by v.counts, v.old_numbers)
from view1 v
and something similar for the other view; then use those as inline views or CTEs, and perform a full outer join based on that ordinal value:
with v1 (old_numbers, counts, rn) as (
select v.old_numbers, v.counts,
row_number() over (order by v.counts, v.old_numbers)
from view1 v
),
v2 (numbers, counts, rn) as (
select v.numbers, v.counts,
row_number() over (order by v.counts, v.numbers)
from view2 v
)
select v1.old_numbers, v1.counts, v2.numbers, v2.counts
from v1
full outer join v2 on v2.rn = v1.rn
order by coalesce(v1.rn, v2.rn)
OLD_NUMBERS
COUNTS
NUMBERS
COUNTS
123
2
343344
10
324
3
24344
15
4454
13
null
null
343433
20
null
null
db<>fiddle

Tracking data change on Oracle 9 without timestamps or indexing

We're building a data warehouse on BigQuery, which includes data from a old Oracle 9 transactional database (still active), which does not include any indexing or timestamps.
Using Standard SQL, I would like to analyse changes in some tables imported from this database.
Simplifying the situation, imagine we have a two versions of the same table before and after as follows:
with before as (
select
'U123' as user, 'Gum' as product, '3' as quantity
union all
select
'U456', 'Tissue', '20'
union all
select
'U123', 'Cream', '1'
)
and
with after as (
select
'U123' as user, 'Gum' as product, '3' as quantity
union all
select
'U456', 'Tissue', '20'
union all
select
'U123', 'Cream', '3'
union all
select
'U456', 'Tomato', '5'
)
So that row 4 was added and row 3 modified.
What is the correct approach to compare data and locate changes given there is no indexing nor timestamps?
So the comparative method should output:
user | product | quantity
U123 | Cream | 3
U456 | Tomato | 5
I don't even know where to start.
Below is for BigQuery Standard SQL
#standardSQL
SELECT user, product, IFNULL(a.quantity, 0) - IFNULL(b.quantity, 0) AS quantity
FROM after a
FULL OUTER JOIN before b
USING(user, product)
WHERE IFNULL(a.quantity, 0) != IFNULL(b.quantity, 0)
When applied to sample data from your question as in below example
#standardSQL
WITH before AS (
SELECT 'U123' AS user, 'Gum' AS product, 3 AS quantity UNION ALL
SELECT 'U456', 'Tissue', 20 UNION ALL
SELECT 'U123', 'Cream', 1
), after AS (
SELECT 'U123' AS user, 'Gum' AS product, 3 AS quantity UNION ALL
SELECT 'U456', 'Tissue', 20 UNION ALL
SELECT 'U123', 'Cream', 3 UNION ALL
SELECT 'U456', 'Tomato', 5
)
SELECT user, product, IFNULL(a.quantity, 0) - IFNULL(b.quantity, 0) AS quantity
FROM after a
FULL OUTER JOIN before b
USING(user, product)
WHERE IFNULL(a.quantity, 0) != IFNULL(b.quantity, 0)
output is
Row user product quantity
1 U123 Cream 2
2 U456 Tomato 5
Oracle 9 keeps track of data change at Row level with the help of SCN (System Change Number). As a result any change performed through DML (INSERT/UPDATE) is internally captured with a TIMESTAMP.
How it works?
Create a Table with ROWDEPENDENCIES Option
Use SCN_TO_TIMESTAMP(ORA_ROWSCN) Function to get the TIMETAMP of Row Changes
Example:
-- Create Table
CREATE TABLE SCNTEST(USER NUMBER, PRODUCT NUMBER, QUANTITY NUMBER) ROWDEPENDENCIES;
-- Insert Data
INSERT ...
-- Query Data
SELECT USER, PRODUCT, QUANTITY, SCN_TO_TIMESTAMP(ORA_ROWSCN) FROM SCNTEST;
You can group data on SCN_TO_TIMESTAMP(ORA_ROWSCN) value to get before and after records.

Order column inside PIVOT on the basis of max(count_date) for each value inside in condition

I have users name in column users, I want to display all users as a column and the order of representation of column must be in descending order of their sum of data.
query:
select *
from (
select sum(tran_count) over (partition by schema) as table_name
from main_table
) pivot (sum(tran_count) for users in ('abc','lmn','pqr'));
ans:
schema table abc lmn pqr
pm sector 32 216 12
history trn 321 61 4
tap issuer 43 325 2
count: 396 602 18
so I want to represent the column abc,lmn and pqr in order of count of their data:
required answer:
schema table lmn abc pqr
pm sector 216 32 12
history trn 61 321 4
tap issuer 325 43 2
You cannot use (sub)query in pivot's in clause. What you can do is to rank users according to their summaric values and use these three values (1, 2, 3) in in. Then either use my inner query, which presents user names and sums in separate columns or make a final union, where names are listed in first row and sums in rows below as strings.
with t as (
select *
from (
select dense_rank() over (order by smu desc, users) rn,
schema_, table_, users, smt
from (
select schema_, table_, users, sum(tran_count) smt,
sum(sum(tran_count)) over (partition by users) smu
from main_table
group by schema_, table_, users))
pivot (max(users) name, max(smt) smt for rn in (1 u1, 2 u2, 3 u3)))
select null schema_, null table_, u1_name u1, u2_name u2, u3_name u3
from t where rownum = 1 union all
select schema_, table_, to_char(u1_smt), to_char(u2_smt), to_char(u3_smt)
from t
dbfiddle demo
If you really need to put user names in headers then you have to use dynamic SQL or external code-writing-code technique.
I don't know if you really have columns like table or schema, these are reserved words, also once you write tran_count and in title count_date, so I am somewhat confused. But you can see in the linked dbfiddle working example with columns schema_, table_, users, tran_count.

Using sum in Oracle SQL view to add column total in each row

i have a simple Oracle View with 3 columns team, minors, adults as follows:-
create view main_summary ( team, minors, adults)
as
select all_teams.team_name as team ,
case when age<18 then count(id) as minors,
case when age>= 18 then count(id) as adults
from all_teams ;
a select statement return rows like:-
-----------------------------
team | minors | adults
-----------------------------
volleyball 2 4
football 6 3
tennis 4 8
-------------------------------
i want to add a total column in the end which should make the view look like:-
--------------------------------------
team | minors | adults| total
--------------------------------------
volleyball 2 4 6
football 6 3 9
tennis 4 8 12
-----------------------------------------
i tried following, but nothing worked:-
create view main_summary( team, minors, adults, sum(minors+adults)) ...
and
create view main_summary ( team, minors, adults, total)
as
select all_teams.team_name as team ,
case when age<18 then count(id) as minors,
case when age>= 18 then count(id) as adults
sum ( case when age<18 then count(id) +
case when age>= 18 then count(id) ) as total ...
The syntax may not be correct as i am not copying directly from my Database. However, pseudo code remains the same. Please guide me how to achieve this.
I have no idea why your original view works. I think you want:
create view main_summary as
select t.team_name as team ,
sum(case when age < 18 then 1 else 0 end) as minors,
sum(case when age >= 18 then 1 else 0 end) as adults,
count(*) as total
from all_teams
group by t.team_name;

Finding top 5 region and within region find top 5 customer by their price. (HIVE)

We have one requirement where we want to find top N region by their price sum and then find top N customers for each of the region.
Sample Data.
REGION_NAME,CUSTOMER_NAME,PRICE
RG1,Customer1,100
RG1,Customer2,200
RG1,Customer3,100
RG2,Customer4,100
RG2,Customer5,200
RG2,Customer6,400
RG3,Customer7,100
RG3,Customer8,200
RG3,Customer9,500
RG3,Customer9,200
Assume we want Top 2 region and Top 2 customer within each region by summing the price
Region_name,Region_sum,Customer_name,Customer_price (Sum)
RG3,1000,Customer9,700 (Sum of customer price)
RG3,1000,Customer8,200
RG2,700,Customer6,400
RG2,700,customer5,200
How to write HIVE query for this? We are not able to think how to write this using HIVE. We may have to write MapReduce or PIG?
You can do this in Hive using analytics functions and a self-join:
select regions_ranked.region_name, regions_ranked.region_sum, customers_ranked.customer_name, customers_ranked.customer_sum from
(
select region_name, customer_name, customer_sum, rank() over (partition by region_name order by customer_sum desc) as customer_rank from (
select region_name, customer_name, sum(price) as customer_sum
from foo group by region_name, customer_name
) customers_sum
) customers_ranked
join
(
select region_name, region_sum, rank() over (order by region_sum desc) as region_rank from (
select region_name, sum(price) as region_sum
from foo group by region_name
) regions_sum
) regions_ranked
on customers_ranked.region_name = regions_ranked.region_name
where region_rank <= 2 and customer_rank <= 2;
This gives the exact output that you were looking for, although out of order. You can tack on an "order by" clause at the very end if you want that.

Resources