RANK OVER function in Hive - hadoop

I'm trying to run this query in Hive to return only the top 10 url which appear more often in the adimpression table.
select
ranked_mytable.url,
ranked_mytable.cnt
from
( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url
) iq
) ranked_mytable
where
ranked_mytable.rnk <= 10
order by
ranked_mytable.url,
ranked_mytable.rnk desc
;
Unfortunately I get an error message stating:
FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference 'rnk'
I've tried to debug it and until the ranked_mytable sub-queries everything goes smooth. I've tried to comment the where ranked_mytable.rnk <= 10 clause but the error message keep appearing.

Hive is unable to order by a column that is not in the "output" of a select statement. To fix it, just include that column in the selected columns:
select
ranked_mytable.url,
ranked_mytable.cnt,
ranked_mytable.rnk
from
( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url
) iq
) ranked_mytable
where
ranked_mytable.rnk <= 10
order by
ranked_mytable.url,
ranked_mytable.rnk desc
;
If you don't want that 'rnk' column in your final output, I expect you could wrap that whole thing in another inner-query and just select out the 'url' and 'cnt' fields.

RANK OVER is not the best function to achieve this goal.
A better solution would be to use a combination of SORT BY and LIMIT. It's true in fact LIMIT picks randomly the rows in a table, but this might be avoided if used with the SORT BY function. From the Apache Wiki:
-- Top k queries. The following query returns the top 5 sales records wrt amount.
SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount
DESC LIMIT 5
The query can be re-written in this way:
select
iq.url,
iq.cnt
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url ) iq
sort by
iq.cnt desc
limit
10
;

Remove the partition by iq.url clause from rank over() and re-run query.
Thanks & Regards,
Kamleshkumar Gujarathi

Put as before the rnk variable. It should work fine.

Related

ORA-00979 Not a Group function error for query with User defined function in select statement

I have this query where a user defined function is added in the select and group by statement.
The inner select query without the WITH clause runs fine and doesn't give any error. But after adding WITH clause it gives the following error -
ORA-00979: not a GROUP BY expression
00979. 00000 - "not a GROUP BY expression"
*Cause:
*Action: Error at Line: 3 Column: 29
I need the WITH clause to return only a subset of the entire result set based on input ranges.
Query is as follows:
WITH INFO AS (
SELECT
GET_EVAULATED_VALUE(T.C_IMP, T.IMP) AS IMPORTANCE,
count(*) AS NO_OF_PC_AFFECTED
FROM TABLE_NAME T
WHERE T.ACNT_REL_ID = 16
GROUP BY
(GET_EVAULATED_VALUE(T.C_IMP, T.IMP))
ORDER BY IMPORTANCE desc
)
SELECT * FROM
(
SELECT ROWNUM AS RN,
(SELECT COUNT(*) FROM INFO) COUNTS,
IMPORTANCE
FROM INFO
)
WHERE RN > 0 AND RN <= 10;
I am not sure how to use CTE with group by on user defined function. But I realized that I can rewrite the query to remove sub-query and CTE and make it simpler as following (and it works):
select * from (
select a.*, ROWNUM rnum from
(SELECT
count(*) over() as COUNTS,
GET_EVAULATED_VALUE(T.C_IMP, T.IMP) AS IMPORTANCE,
count(*) AS NO_OF_PC_AFFECTED
FROM TABLE_NAME T
WHERE T.ACNT_RELATION_ID = 16
GROUP BY
(GET_EVAULATED_VALUE(T.C_IMP, T.IMP))
ORDER BY importance desc) a
where ROWNUM <= 10 )
where rnum >= 0;
Same issue here, I created a table "TABLE_CTE" instead of using a CTE and it worked.
CREATE TABLE TABLE_CTE
AS
SELECT
USER_DEFINED_FUNCTION(date_1),
COUNT(*)
FROM
TABLE_NAME
GROUP BY
USER_DEFINED_FUNCTION(date_1)
;
SELECT * FROM TABLE_CTE

HiveSQLException: cannot recognize input near 'SELECT' 'MAX' '(' in expression specification

I'm trying to get the maximum value of a count. The code is as follows
SELECT coachID, COUNT(coachID)
FROM coaches_awards GROUP BY coachID
HAVING COUNT(coachID) =
(
SELECT MAX(t2.awards)
FROM (
SELECT coachID, count(coachID) as awards
FROM coaches_awards
GROUP BY coachID
) t2
);
Yet something keeps failing. The inner query works and gives the answer that I want and the outer query will work if the inner query is replaced by the number required. So I'm assuming I've made some syntax error.
Where am I going wrong?
If you are just looking for one row, why not do:
SELECT coachID, COUNT(coachID) as cnt
FROM coaches_awards
GROUP BY coachID
ORDER BY cnt DESC
LIMIT 1;
If you want ties, then use RANK() or DENSE_RANK():
SELECT ca.*
FROM (SELECT coachID, COUNT(*) as cnt,
RANK() OVER (ORDER BY COUNT(*) DESC) as seqnum
FROM coaches_awards
GROUP BY coachID
) ca
WHERE seqnum = 1;

How to left join with conditions in Toad Data Point Query Builder?

I'm trying to build a query in Toad Data Point. I have a subquery that has a row number to identify the records I'm interested in. This subquery needs to be left joined onto the main table only when the row number is 1. Here's the query I'm trying to visualize:
SELECT distinct E.EMPLID, E.ACAD_CAREER
FROM PS_STDNT_ENRL E
LEFT JOIN (
SELECT ACAD_CAREER, ROW_NUMBER() OVER (PARTITION BY ACAD_CAREER ORDER BY EFFDT DESC) as RN
FROM PS_ACAD_CAR_TBL
) T on T.ACAD_CAREER = E.ACAD_CAREER and RN = 1
When I try to replicate this, the row number condition is placed in the global WHERE clause. This is not the intended functionality because it removes any records that don't have a match in the subquery effectively making it an inner join.
Here is the query it's generating:
SELECT DISTINCT E.EMPLID, E.ACAD_CAREER, T.RN
FROM SYSADM.PS_STDNT_ENRL E
LEFT OUTER JOIN
(SELECT PS_ACAD_CAR_TBL.ACAD_CAREER,
ROW_NUMBER ()
OVER (PARTITION BY ACAD_CAREER ORDER BY EFFDT DESC)
AS RN
FROM SYSADM.PS_ACAD_CAR_TBL PS_ACAD_CAR_TBL) T
ON (E.ACAD_CAREER = T.ACAD_CAREER)
WHERE (T.RN = 1)
Is there a way to get the query builder to place that row number condition on the left join instead of the global WHERE clause?
I found a way to get this to work.
Add a calculated field to the main table with a value of 1.
Join the row number to this new calculated field.
Now the query has the filter in the join condition instead of the WHERE clause so that it joins as intended. Here is the query it made:
SELECT DISTINCT E.EMPLID, E.ACAD_CAREER, T.RN
FROM SYSADM.PS_STDNT_ENRL E
LEFT OUTER JOIN
(SELECT PS_ACAD_CAR_TBL.ACAD_CAREER,
ROW_NUMBER ()
OVER (PARTITION BY ACAD_CAREER ORDER BY EFFDT DESC)
AS RN
FROM SYSADM.PS_ACAD_CAR_TBL PS_ACAD_CAR_TBL) T
ON (E.ACAD_CAREER = T.ACAD_CAREER) AND (1 = T.RN)

data not retrieved in the same order

I have a query which returns list of SIMs. Each SIM is linked to a Customer. The SIMs are in T_SIM table and Customers are in T_CUSTOMER table. There can be more than one SIM linked to a single Customer. When returning the SIMs it returns the Customer details also.
The T_SIM table will have a foreigh key to T_CUSTOMER table.
The issue is:
First run the query by requesting top 100 records by doing order by CUSTOMER_CODE in ascending order.
Now run the same query by requesting top 1000 records by doing order by CUSTOMER_CODE in ascending order.
Here in point #2, in the results of 1000 records the first 100 records are not same as in point #1 result. The records got shuffled. The order is not consistent.
To resolve this I have used ROWID along with order by CUSTOMER_CODE.
But the solution is not accepted by the client.
Could you please suggest any other alternative to resolve the issue. The data type of CUSTOMER_CODE is VARCHAR2
Below is the query:
SELECT TT.SIM_ID,
TT.IMSI,
TT.MSISDN,
TT.SECONDARY_MSISDN,
TT.CUSTOMER_ID,
TT.SIM_STATE,
TCU.CUSTOMER_CODE
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 = 1
AND TT.SIM_ID IN
(SELECT SIM_ID
FROM
(SELECT *
FROM
(SELECT Z.*,
ROWNUM RNUM
FROM
(SELECT TT.SIM_ID
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 =1
ORDER BY TCU.CUSTOMER_CODE ASC
) Z
WHERE ROWNUM <= 1000
)
WHERE RNUM >= 0
)
)
ORDER BY TCU.CUSTOMER_CODE ASC
The result in both the cases is done order by CUSTOMER_CODE but the SIMS belonging to them are not coming in the same order.
The problem is that first you are limiting number of rows when selecting from t_sim (so these are selected randomly) , and just then you are ordering your output.
So what you should do, is to remove ROWNUM<1000 from inner query and
put it on the very top level like this:
select * from
( TT.SIM_ID,
TT.IMSI,
TT.MSISDN,
TT.SECONDARY_MSISDN,
TT.CUSTOMER_ID,
TT.SIM_STATE,
TCU.CUSTOMER_CODE
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 = 1
AND TT.SIM_ID IN
(SELECT SIM_ID
FROM
(SELECT *
FROM
(SELECT Z.*,
ROWNUM RNUM
FROM
(SELECT TT.SIM_ID
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 =1
ORDER BY TCU.CUSTOMER_CODE ASC
) Z
)
WHERE RNUM >= 0
)
)
ORDER BY TCU.CUSTOMER_CODE ASC
) where rownum<1000
Because first you want to make complete ordered result set and just then display 1000 top records of sim cards ordered by customer_code.

No Results returned for ROW_NUMBER() query

I am getting "no results returned" for the following query:
SELECT
Referer
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY CT.Referer ASC) AS RowNum,
CT.Referer, CT.LastModified
FROM
ClickTrack CT
JOIN
OrderTrack OT ON OT.ClickTrackID = CT.ClickTrackID
GROUP BY
CT.Referer, CT.LastModified
HAVING
LEN(CT.Referer) > 0) as num
WHERE
RowNum = 1
AND LastModified BETWEEN '07/06/2013' and '08/05/2013'
Curiously, when I leave off RowNum = 1, I get the full list of values. I need to get one at a time though to assign to a variable and drop into a temporary table.
The end query will be in a while loop using scalar variables in place of the date ranges and RowNum comparison.
Any help is appreciated. Thank you!
I'm thinking RowNum 1 may not have a date between your selections. Maybe put the date selection inside so that you know that the first one matches.
SELECT Referer
FROM (SELECT ROW_NUMBER() OVER (ORDER BY CT.Referer ASC)
AS RowNum, CT.Referer, CT.LastModified
FROM ClickTrack CT
JOIN OrderTrack OT ON OT.ClickTrackID = CT.ClickTrackID
WHERE CT.LastModified BETWEEN '07/06/2013' and '08/05/2013'
GROUP BY CT.Referer, CT.LastModified
HAVING LEN(CT.Referer) > 0) as num
WHERE RowNum = 1

Resources