Hive Getting error on group by column while using case statements and aggregations - hadoop

I am working on a query in hive. In that I am using aggregations like sum and case statements and group by clause. I have changed the column names and table names but my logic is same which I was using in my project
select
empname,
empsal,
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when tot_sal > 1000 then exp(tot_hike)
else 0
end as manager
from employee
group by
empname,
empsal,
emphike
For the above query I was getting error as "Expression not in group by key '1000'".
So I have slightly modified the query and tried again My other query is
select
empname,
empsal,
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when sum(empsal) > 1000 then exp(sum(emphike))
else 0
end as manager
from employee
group by
empname,
empsal,
emphike
For above query its putting me error as "Expression not in group by key 'Manager'".
When I add manager in the group by its showing invalid alias.
Please help me out here

I see three issues in your query:
1.) Hive cannot group by a variable you defined in the select block by the name you gave it right away. You will probably need a subquery for that.
2.) Hive tends to show errors when sum or count operations are not at the end of the query.
3.) Although I do not know what your goal is, I think that your query will not deliver the desired result. If you group by empsal there would be no difference between empsal and sum(empsal) by design. Same goes for emphike and sum(emphike).
I think the following query might solve these issues:
select
a.empname,
a.tot_sal,
a.tot_hike,
if(a.tot_sal > 1000, exp(a.tot_hike), 0) as manager
from
(select
empname,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
from employee
group by
empname
)a
The if statement is equivalent to your case statement, however I find it a bit easier to read.
In this example you wouldn't need to group by after the subquery because the grouping is done in the subquery a.

Related

ORACLE APEX-ORA 01722 Invalid Number ERROR

I don't understand , why I am taking this error.
SELECT
to_char(view_date, 'Month') MONAT ,
COUNT(*) AS countx
FROM
AXY_TABLE
GROUP BY
to_char(view_date,'Month')
ORDER BY
to_char(view_date, 'Month'),
COUNT(*) desc;
When I execute this Query for a Interactive Report, it throws ORA-01722 Error. This Query run not only correctly in SQL developer but also as Classic Report correctly. When I changed the type to Interactive Report, throws it again the same error.
What should I do ?
Thanks a lot in advance.
Is the problem the ORDER BY clause? Try removing the aggregation from it:
SELECT
to_char(view_date, 'Month') MONAT ,
COUNT(*) AS countx
FROM
AXY_TABLE
GROUP BY
to_char(view_date,'Month')
ORDER BY
to_char(view_date, 'Month'),
2 desc;
This orders by the position of the second column of the projection. However, it is strictly unnecessary, as your result set will conatin only one row per MONTH, so you only need to sort by that.

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

What is the result of an Oracle query with COUNT and no parameters

I'm comfortable with SQL Server, but not so much with Oracle.
I've got a query that looks something like the following:
SELECT umr.region, payee_name, COUNT, corporate_office_name
FROM payers, offices,
(
SELECT region, h.payee_name, COUNT, company_name FROM someTable h, someTable2
GROUP BY region, h.payee_name, company_name
) umr
WHERE ...
I know that the example isn't complete, but the key to the question is what is the COUNT in the SELECT statement telling Oracle to do?
I'd say it means there's a column called count in one of the tables.

Using rownum in subquery

In an algorithm the users passes a query, for instance:
SELECT o_orderdate, o_orderpriority FROM h_orders WHERE rownum <= 5
The query returns the following:
1996-01-02 5-LOW
1996-12-01 1-URGENT
1993-10-14 5-LOW
1995-10-11 5-LOW
1994-07-30 5-LOW
The algorithm needs the count for the select attributes (o_orderdate, o_orderpriority in the above example) and therefore it rewrites the query to:
SELECT o_orderdate, count(o_orderdate) FROM
(SELECT o_orderdate, o_orderpriority FROM h_orders WHERE rownum <= 5)
GROUP BY o_orderdate
This query returns the following:
1992-01-01 5
However the intended result is:
1996-12-01 1
1995-10-11 1
1994-07-30 1
1996-01-02 1
1993-10-14 1
Any idea how I could rewrite the parsing stage or how the user could pass a syntactically different query to receive the above results?
The rows returned by the inner query are essentially non-deterministic, as they depend on the order in which the optimiser identifies rows as part of the required data set. A change in execution plan due to modified predicates might change the order in which the rows come back, and new rows added to the table can also change which rows are included.
If you always want n rows then either use distinct(o_orderdate) in the innerquery, which will render the GROUP BY useless.
Or you can add another outer select with rownum to get n of the grouped rows, like this:
select o_orderdate, counter from
(
SELECT o_orderdate, count(o_orderdate) as counter FROM
(SELECT o_orderdate, o_orderpriority FROM h_orders)
GROUP BY o_orderdate
)
WHERE rownum <= 5
Although the results will most likely be useless as they will be undeterministic (as mentioned by David Aldridge).
As your outer query makes no use of "o_orderpriority", why not just get rid of the subquery and simply query like this:
SELECT o_orderdate, count(o_orderdate) AS order_count
FROM h_orders
WHERE rownum <= 5
GROUP BY o_orderdate

oracle query error: exact fetch return more than requested no of rows

I have two tables seatinfo(siid,seatno,classid,tsid) and booking (bookid,siid,date,status).
I've input parameter bookDate,v_tsId ,v_clsId. I need exactly one row (bookid) to return. This query is not working. I don't no why. How can I fix it?
select bookid
into v_bookid
from booking
where (to_char(booking.bookdate,'dd-mon-yy'))=(to_char(bookDate,'dd-mon-yy'))
and status=0
and rownum <= 1
and siid in(select siid
from seatinfo
where tsid=v_tsId
and classid= v_clsId);
I also tried this:
select bookid
into v_bookid
from booking,
seatinfo
where booking.siid=seatinfo.siid
and (to_char(booking.bookdate,'dd-mon-yy'))=(to_char(bookDate,'dd-mon-yy'))
and booking.status=0
and rownum <= 1
and seatinfo.tsid=v_tsId
and seatinfo.classid= v_clsId;
Are you saying that you get an "ORA-01422: exact fetch returns more than requested number of rows" when you run both of those queries? That seems highly unlikely since you're including the predicate rownum <= 1. Can you cut and paste from a SQL*Plus session that runs just this query in a PL/SQL block and generates the error?
If you are not complaining about the error you mention in the title, and the problem is just that you're not getting the data you expect, the likely problem is that you apparently have a bookDate parameter that has the same name as a column in your table. That is not going to work. When you say
(to_char(booking.bookdate,'dd-mon-yy'))=(to_char(bookDate,'dd-mon-yy'))
you presumably mean to compare the bookDate column in the booking table against the bookDate parameter. But since column names have precedence over local variables, the left-hand side of your expression is also looking at the bookDate column in the booking table. So you're comparing a column to itself. It would make much more sense to change the name of the parameter (to, say, p_bookDate) and then write
booking.bookDate = p_bookDate
or, if you want to do the comparison ignoring the time component of the dates
trunc( booking.bookDate ) = trunc( p_bookDate )

Resources