Hive Query: How to use group by with rank?

Hive Query: How to use group by with rank? - hadoop

I have a table like below
year int
month int
symbol string
company_name string
sector string
sub_industry string
state string
avg_open double
avg_close double
avg_low double
avg_high double
avg_volume double
The field starting with avg_ refers to the average value in a month for a year. I need to find for each sector the year in which average of avg_close is the lowest.
I tried to do something like below
SELECT sector, year FROM
(
SELECT sector, year, RANK() OVER (ORDER BY s2.yearly_avg_close) AS RANK FROM
( SELECT year,sector, AVG(avg_close) AS yearly_avg_close FROM stock_summary GROUP BY sector, year) s2
) s1
WHERE
s1.RANK = 1;
But this is printing just one sector and year like below
Telecommunications Services 2010
I am new to hive and playing around with some toy schemas. Can someone let me know what should be the correct way of solving this?
Hive Version - 1.1.0

Include sector into the partition by in the rank() function:
SELECT sector, year, RANK() OVER (partition by sector ORDER BY s2.yearly_avg_close) AS RANK
Add year as well if you need rank per each sector and year
Read also this explanation how rank works: https://stackoverflow.com/a/55909947/2700344

Related

How to use SUM and MAX in select statement more than one table

I have 2 Tables
table a mempur
memberno = member number
purdt = purchase date
amount = purchase amount
table b meminfo
memberno = member number
fname = first name
age = age
select a.memberno,b.fname,sum(a.amount),a.purdt,b.age from mempur a,(select max(purdt) as maxdate,memberno from mempur group by memberno) maxresult,meminfo b
where a.memberno=b.memberno
and a.purdt between '01-JAN-22' and '28-FEB-22'
and a.memberno=maxresult.memberno
and a.purdt=maxresult.maxdate
group by a.memberno,b.fname,a.purdt,b.age
order by a.memberno;
How to get my result with total purchase amount and highest date purchase from table mempur?
I use this query able to show the result but the total amount incorrect between the range.
Anyone help is much appreciated.
my sample data
MEMBERNO PURDT AMOUNT
--------------- --------------- ---------
BBMY0004580 12-AUG-21 823.65
BBMY0004580 12-AUG-21 1709.1
BBMY0004580 26-AUG-21 1015.1
BBMY0004580 28-AUG-21 1105.1
my result only show total amount 1105.1

You can aggregate in mempur and then join to meminfo:
SELECT i.*, p.total_amount, p.maxdate
FROM meminfo i
INNER JOIN (
SELECT memberno, SUM(amount) total_amount, MAX(purdt) maxdate
FROM mempur
WHERE purdt BETWEEN '01-JAN-22' AND '28-FEB-22'
GROUP BY memberno
) p ON p.memberno = i.memberno;
You may use a LEFT join if there are members with no purchases which you want in the results.

Your query gets the maximum purdt and adds up the amount for this date. It also checks whether the maximum purdt is in January or February 2022. If it is, the result gets show, if it is not, you don't show the result. This is not the query you want.
Apart from that, the query looks rather ugly. You are using an ancient join syntax that is hard to read and prone to errors. We used that in the 1980s, but in 1992 explicit joins made it into the SQL standard. You should no longer use this old comma syntax. It is strange to see it still being used. Feels like going to a museum. Then, you are using table aliases. The idea of these is to get a query more readable, but yours even lessen readability, because your alias names a and b are arbitrary. Use mnemonic names instead, e.g. mp for mempur and mi for meminfo. Then, you are comparing the date (I do hope purdt is a date!) with strings. Don't. Use date literals instead.
As to your tables: Are you really storing the age? You will have to update it daily to keep it up-to-date. Better store the date of birth and calculate the age from it in your queries.
Here is a query that gets you the maximum date and the total amount for the given date range:
select memberno, m.name, p.sum_amount, p.max_purdt, m.age
from meminfo m
left outer join
(
select memberno, sum(amount) as sum_amount, max(purdt) as max_purdt
from mempur
where purdt >= date '2022-01-01' and purdt < date '2022-03-01'
group by memberno
) p using (memberno)
order by memberno;
And here is a query that gets you the maximum overall date along with the total amount for the given date range:
select memberno, m.name, p.sum_amount, p.max_purdt, m.age
from meminfo m
left outer join
(
select
memberno,
sum(case when where purdt >= date '2022-01-01' and purdt < date '2022-03-01'
then amount
end) as sum_amount,
max(purdt) as max_purdt
from mempur
group by memberno
) p using (memberno)
order by memberno;

How to iterate over a hive table row by row and calculate metric when a specific condition is met?

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours

You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

oracle's analytical function issue

Please let me know if the following is off topic, or not clear, or too specific, or too complex to understand. I think the following is a challenge to describe, understand and solve.
CIF=cost, insurance, frieght (basically it is the import value)
The simiplified version of input table (Import) looks like this:
enter image description here
So from January to June the value 1 is assigned to SixMonthPeriod column, and the rest of the months are given the value 2.
I then want to calculate unit price for a six period, thus I use
select SixMonthPeriod, ProductDescrip, Sum(weight), Sum (CIF), (Sum (CIF))/(Sum(weight)) as UnitPrice
from Import
group by SixMonthPeriod, ProductDescrip;
This is fine, but I then want to calculate inflation for each product (over a six month period )where I need to use lag (an oracle analytical function). The six month period has to be fixed. Thus, if the previous period for a particular product is missing, then the unit price should be zero. I want to re-begin/begin the calculation of inflation for each product. The unit price and inflation equations looks like the following, respectively:
unit price = (Sum(weight) over a six month period)/(Sum (CIF) over a six month period)
inflation = (Current Unit price - previous unit price)/(previous unit price)
I use the following SQL to calculate inflation for a six month period for each product, where the calculation begins again for each product:
select Yr, SixMthPeriod, Product, UnitPrice, LagUnitPrice, ((UnitPrice -LagUnitPrice)/LagUnitPrice) as inflation
from (select Year as Yr, SixMonthPeriod as SixMthPeriod,
ProductDescrip as product, (Sum (CIF))/(Sum(weight)) as UnitPrice,
lag((Sum (CIF))/(Sum(weight)))
over (partition by ProductDescrip order by YEAR, SixMonthPeriod) as LagUnitPrice
From Import
group by Year, SixMonthPeriod, ProductDescrip)
The problem is the inflation period is not fixed.
For example, for the result, I get the following:
enter image description here
The first two rows are fine and there should be null values because they are my first line, thus there is no LagUnitPrice and inflation.
The third line has a problem where it has taken 0.34 as the LagUnitPrice but actually it is zero (for the period 2016 where SixMthPeriod=1 for the product barley). the oracle analytical functions does not take into account missing rows (e.g. for the period 2016 where SixMthPeriod=1 for the product barley).
how do I fix this problem (if you understand me)?
I have 96 rows, thus I can export the file to excel, and use excel's formulas to fix these exceptions.

You can autogenerate missing periods with nullable price, attach them to your data and do the rest as you did:
select product, year, smp, price, prev_price, (price - prev_price) / prev_price inflation
from (
select product, year, smp, price,
lag(price) over (partition by product order by year, smp) prev_price
from (
select year, ProductDescrip product, SixMonthPeriod smp, sum(CIF)/sum(weight) price
from Import
group by year, SixMonthPeriod, ProductDescrip) a
full join (
select distinct year, productdescrip product, column_value smp
from import cross join table(sys.odcinumberlist(1, 2))) b
using (product, year, smp))
order by product, year, smp
SQLFiddle demo
Subquery b is responsible for generating all periods, you can run it separately to see what it produces.

using alias for cast in HIVE

I have a table called loan with loan amount,annual income, year (MMM-YY format) and member id. I am trying to find the highest loan amount in a year along wit annual income and member id details.
I tried to group the highest loan amount by year using the code
select max(cast(loan_amt as int)),issue_d from loan group by issue_d;
then I wanted also to fetch the member id and annual income information so I wrote the following code
but it is giving me error message for using alias for a column which is cast.
Code:
select a.loan_amt,a.member_id,a.annual_inc,a.issue_d
from
(select loan_amt,member_id,annual_inc,issue_d from loan) a
join
(select max(cast(loan_amt as int)) as ml,issue_d from loan group by issue_d) c
where ((a.issue_d=c.issue_d) and (a.loan_amt=a.ml));

What you want to do is rank the records based on the Amount, per Period, then keep only the top 1 record for each Period.
Use one of the analytic functions that are designed exactly for that purpose -- Hive has a pretty good support of the SQL standard on that topic.
Since you don't say what to do about ties (i.e. what if several loans have the same Amount???) I assume you want just one record chosen randomly...
select X, Y, Z, Period, Amount as TopAmount
from
(select X, Y, Z, Period, cast(StrAmt as double) as Amount,
row_number() over (partition by Period order by cast(StrAmt as double) desc) as TmpRank
from WTF
) TMPWTF
where TmpRank =1
If you want all the records with top Amount then replace row_number with rank or dense_rank (the "dense" stuff would make a difference for the top 2, but not for the top 1)

Oracle query to fetch the previous value of a related row in same table

I have a table Student which has name and ratings year wise.
Name Year Rating
Ram 2016 10
Sam 2016 9
Ram 2014 8
Sam 2012 7
I need to find the previous rating of the employee which could be last year or some years before.
The query should return below results
Name Cur_rating_year_2016 Prev_rating
Ram 10 8
Sam 9 7
Below is the script for insert and create
Create table Student (name varchar2(10), year number, rating number );
insert into student values('Ram' ,2016 ,10);
insert into student values('Sam' ,2016 ,9);
insert into student values('Sam' ,2012 ,7);
insert into student values('Ram' ,2014 ,8);
Is there a way to achieve this using select query?

Use LAG analytical function https://docs.oracle.com/database/122/SQLRF/LAG.htm#SQLRF00652
LAG is an analytic function. It provides access to more than one row
of a table at the same time without a self join. Given a series of
rows returned from a query and a position of the cursor, LAG provides
access to a row at a given physical offset prior to that position.
For the optional offset argument, specify an integer that is greater
than zero. If you do not specify offset, then its default is 1. The
optional default value is returned if the offset goes beyond the scope
of the window. If you do not specify default, then its default is
null.
SELECT stud_name AS name,
r_year AS year,
r_value AS rating,
lag(r_value, 1, NULL) OVER(PARTITION BY stud_name ORDER BY r_year) AS prev_rating
FROM stud_r
ORDER BY stud_name;

Try as:
SELECT A.NAME,A.RATING,B.RATING FROM
STUDENTS A INNER JOIN STUDENTS B
ON A.NAME=B.NAME
WHERE A.YEAR='2016' AND B.YEAR<>'2016'
ORDER BY A.NAME ASC

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio