HIVE/HiveQL Get Max count

HIVE/HiveQL Get Max count - hadoop

Sample Data
DATE WindDirection
1/1/2000 SW
1/2/2000 SW
1/3/2000 SW
1/4/2000 NW
1/5/2000 NW
Question below
Every day is unqiue, and wind direction is not unique, So now we are trying to get the COUNT of the most COMMON wind direction
select w.wind_direction as most_common_wd
from (
select wind_direction, count(*) as cnt
from weather
group by wind_direction
order by cnt desc
) w
limit 1;
Currently this query works, however it outputs the count of ALL count, i am only interested in the count of each type of wind direction, it outputs south and outputs 170000 but the answer is only 10,000.
I am aware something is wrong the way count is being used, i think it have to specify an alias and do count by specfic wind_direction but i can't translate that into syntax

Looks like you want the most common wind direction from your data
select wind_direction, count(*) as cnt
from weather
group by wind_direction
order by cnt desc
limit 1;
If there are multiple wind direction with the same max count then get the max count and use it in the having clause to get the most common wind direction
select wind_direction
from weather
group by wind_direction
having count(*) = ( select max(c) from
(
select wind_direction,count(*) c
from weather
group by wind_direction
) a
)

Related

How to use SUM and MAX in select statement more than one table

I have 2 Tables
table a mempur
memberno = member number
purdt = purchase date
amount = purchase amount
table b meminfo
memberno = member number
fname = first name
age = age
select a.memberno,b.fname,sum(a.amount),a.purdt,b.age from mempur a,(select max(purdt) as maxdate,memberno from mempur group by memberno) maxresult,meminfo b
where a.memberno=b.memberno
and a.purdt between '01-JAN-22' and '28-FEB-22'
and a.memberno=maxresult.memberno
and a.purdt=maxresult.maxdate
group by a.memberno,b.fname,a.purdt,b.age
order by a.memberno;
How to get my result with total purchase amount and highest date purchase from table mempur?
I use this query able to show the result but the total amount incorrect between the range.
Anyone help is much appreciated.
my sample data
MEMBERNO PURDT AMOUNT
--------------- --------------- ---------
BBMY0004580 12-AUG-21 823.65
BBMY0004580 12-AUG-21 1709.1
BBMY0004580 26-AUG-21 1015.1
BBMY0004580 28-AUG-21 1105.1
my result only show total amount 1105.1

You can aggregate in mempur and then join to meminfo:
SELECT i.*, p.total_amount, p.maxdate
FROM meminfo i
INNER JOIN (
SELECT memberno, SUM(amount) total_amount, MAX(purdt) maxdate
FROM mempur
WHERE purdt BETWEEN '01-JAN-22' AND '28-FEB-22'
GROUP BY memberno
) p ON p.memberno = i.memberno;
You may use a LEFT join if there are members with no purchases which you want in the results.

Your query gets the maximum purdt and adds up the amount for this date. It also checks whether the maximum purdt is in January or February 2022. If it is, the result gets show, if it is not, you don't show the result. This is not the query you want.
Apart from that, the query looks rather ugly. You are using an ancient join syntax that is hard to read and prone to errors. We used that in the 1980s, but in 1992 explicit joins made it into the SQL standard. You should no longer use this old comma syntax. It is strange to see it still being used. Feels like going to a museum. Then, you are using table aliases. The idea of these is to get a query more readable, but yours even lessen readability, because your alias names a and b are arbitrary. Use mnemonic names instead, e.g. mp for mempur and mi for meminfo. Then, you are comparing the date (I do hope purdt is a date!) with strings. Don't. Use date literals instead.
As to your tables: Are you really storing the age? You will have to update it daily to keep it up-to-date. Better store the date of birth and calculate the age from it in your queries.
Here is a query that gets you the maximum date and the total amount for the given date range:
select memberno, m.name, p.sum_amount, p.max_purdt, m.age
from meminfo m
left outer join
(
select memberno, sum(amount) as sum_amount, max(purdt) as max_purdt
from mempur
where purdt >= date '2022-01-01' and purdt < date '2022-03-01'
group by memberno
) p using (memberno)
order by memberno;
And here is a query that gets you the maximum overall date along with the total amount for the given date range:
select memberno, m.name, p.sum_amount, p.max_purdt, m.age
from meminfo m
left outer join
(
select
memberno,
sum(case when where purdt >= date '2022-01-01' and purdt < date '2022-03-01'
then amount
end) as sum_amount,
max(purdt) as max_purdt
from mempur
group by memberno
) p using (memberno)
order by memberno;

Hadoop Hive MAX gives multiple results

I am trying to get a maximum value from a count selecting 2 label srcip and max, but everytime I include srcip I have to use group by srcip at the end and gives me result as the max wasnt even there.
When I write the query like this it gives me the correct max value but I want to select srcip as well.
Select max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t;
But when I do include srcip in the select I get result as there was no max function
Select srcip,max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t
group by srcip;
I would expect from this a single result but I get multiple.
Anyone has any ideas?

You may do ORDER BY count DESC with LIMIT 1 to get the scrip with MAX of count.
SELECT srcip, count(srcip) as count1
from data group by srcip
ORDER BY count1 DESC LIMIT 1

let's consider you have a data like this.
Table
Let's see what happens when you run following query, what happens to data.
Query
SELECT srcip,count(srcip) as count1 from data group by srcip
Output: table1
Now let's see what happens you run your outer query on above table .
Select srcip,max(count1) as maximum from table1 group by srcip
Same Output
Reason being your query says to select srcip and maximum of count from each group of srcip. And we have 3 groups, so 3 rows.

The query below returns exact one row having the max count and the associated scrip. This is the query based on the expected result; you would rather look more into sql and earlier comments, then progress to hive analytical queries.
Some people could argue that there is better way to optimize this query for your expected result but this should give you a motivation to look more into Hive analytical queries.
select scrip, count1 as maximum from (select srcip, count(scrip) over (PARTITION by scrip) as count1, row_number() over (ORDER by scrip desc) as row_num from data) q1 having row_num = 1;

Add indicator to top and bottom 10%

I'm trying to capture the average of FIRST_CONTACT_CAL_DAYS but what I would like to do is create an indicator for the top and bottom 10% of values so I can exclude those (outliers) from my average calculation.
Not sure how to go about do this, any thoughts?
SELECT DISTINCT
TO_CHAR(A.FIRST_ASSGN_DT,'DAY') AS DAY_NUMBER,
A.FIRST_ASSGN_DT,
A.FIRST_CONTACT_DT,
TO_CHAR(A.FIRST_CONTACT_DT,'DAY') AS DAY_NUMBER2,
A.FIRST_CONTACT_DT AS FIRST_PHONE_CONTACT,
A.ID,
ABS(TO_DATE(A.FIRST_CONTACT_DT, 'DD/MM/YYYY') - TO_DATE(A.FIRST_ASSGN_DT, 'DD/MM/YYYY')) AS FIRST_CONTACT_CAL_DAYS,
FROM HIST A
LEFT JOIN CONTACTS D ON A.ID = D.ID
WHERE 1=1

You may be looking for something like this. Please adapt to your situation.
I assume you may have more than one "group" or "partition" and you need to compute the average for each group separately, after throwing out the outliers in each partition. (An alternative, which can be easily accommodated by adapting the query below, is to throw out the outliers at the global level, and only then to group and take the average for each group.)
If you don't have any groups, and everything is one big pile of data, it's even easier - you don't need GROUP BY and PARTITION BY.
Then: the function NTILE assigns a bucket number, in this example between 1 and 10, to each row, based on where they fall (first decile, i.e. first 10%, next decile, ... all the way to the last decile). I do this in a subquery. Then in the outer query just filter out the first and last bucket before you group by and you compute the average.
For testing purposes I create three groups with 10,000 random numbers each in a WITH clause - no need to spend any time on that portion of the code, since it is not part of the solution (the SQL code to solve your problem) - it's just a dirty trick to create test data on the fly.
with
inputs ( grp, val ) as (
select ceil(level/10000), dbms_random.value(0, 150)
from dual
connect by level <= 30000
)
select grp, avg(val) as avg_val
from (
select grp, val, ntile(10) over (partition by grp order by val) as bkt
from inputs
)
where bkt between 2 and 9
group by grp
;
GRP AVG_VAL
--- -----------------------
1 75.021614866547043734458
2 74.286117923344418598032
3 75.437412573353736953791

Oracle pagination ROWNUM column>=value challenge

Having some trouble with oracle pagination. Case:
Table with > 1 billion rows:
Measurement(Id Number, Classification VARCHAR, Value NUMBER)
Index:
ON Measurement(Value)
I need a query that gets the first match and the following 2000 matches ordered by Value. I also would like to use the index.
First idea:
SELECT * FROM Measurement WHERE Value >= 1234567890
AND ROWNUM <= 2000 ORDER BY Value ASC
Result:
The query just returns the first 2000 cases it can find in the table, starting from the top, where Value is higher or equal to 1234567890, and then orders that resultset ascending.
Second idea:
SELECT * FROM
(SELECT * FROM Measurement WHERE Value >= 1234567890 ORDER BY Value ASC)
WHERE ROWNUM <= 2000
Result:
Oracle does not understand that ROWNUM should limit the amount from the inner query, so oracle decides to get all rows where Value is greater or equal to 1234567890 first, and then order that giant resultset before returning the first 2000 rows. Because Oracle is guessing that most of the data in the table will be returned, it ignores any use of index as well.
None of these approaches are acceptable as the first one gives the wrong results, and the second one takes hours.
Is pagination supported at all in Oracle?

You can use the following
SELECT * FROM
(SELECT Id, Classification, Value, ROWNUM Rank FROM Measurement WHERE Value >= 1234567890)
WHERE Rank <= 2000
order by Rank
You do not need to order in the sub-query. Simply unnecessary.
The above is not pagination but the firs page I would suppose.

Not sure if you got the solution for your problem, but to put my two cents:
The first query will not answer your requirements as it will fetch 2000 random records that satisfy your query and then do an order by.
Coming to the second query :
Oracle will first do the execution of the second query and will then only move to the outer query. So, the rownum filter will be applied only after the inner query is executed.
You can try the below approach, to do INDEX FAST FULL SCAN, i have tested it on a table with 2.76 million rows and it is having lesser cost than the other approach:
SELECT * from Measurement
where value in ( SELECT VALUE FROM
(SELECT Value FROM Measurement
WHERE Value >= 1234567890 ORDER BY Value ASC)
WHERE ROWNUM <= 2000)
Hope it Helps
Vishad

I think I have fond a potential solution. However, it's not a query.
declare
cursor c is
SELECT * FROM Measurement WHERE Value >= 1234567890 ORDER BY Value ASC;
l_rec c%rowtype;
begin
open c;
for i in 1 .. 2000
loop
fetch c into l_rec;
exit when c%notfound;
end loop;
close c;
end;
/

Kindly experiment with more options
SELECT *
FROM( SELECT /*+ FIRST_ROWS(2000) */
Id,
Classification,
Value,
ROW_NUMBER() OVER (ORDER BY Value) AS rn
FROM Measurement
where Value > 1234567889
)
WHERE rn <=2000;
Update1:- Force the use of index on Value.Here IDX_ON_VALUE is the Name of the index on Value in Measurement
SELECT * FROM
(SELECT /*+ INDEX(a IDX_ON_VALUE) */* FROM Measurement
a WHERE value >=1234567890 )
ORDER BY a.Value ASC)
WHERE ROWNUM <= 2000

Trying to figure out top 5 land areas of the 50 states in the U.S

I have a table created. With one column named states and another column called land area. I am using oracle 11g. I have looked at various questions on here and cannot find a solution. Here is what I have tried so far:
SELECT LandAreas, State
FROM ( SELECT LandAreas, State, DENSE_RANK() OVER (ORDER BY State DESC) sal_dense_rank
FROM Map )
WHERE sal_dense_rank >= 5;
This does not provide the top 5 land areas as far as number wise.
I have also tried this one but no go either:
SELECT * FROM Map order by State desc)
where rownum < 5;
Anyone have any suggestions to get me on the right track??
Here is a samle of the table
states land areas
michagan 15000
florida 25000
tennessee 10000
alabama 80000
new york 150000
california 20000
oregon 5000
texas 6000
utah 3000
nebraska 1000
Desired output from query:
States land area
new york 150000
alabama 80000
florida 25000
california 20000

Try:
Select * from
(SELECT State, LandAreas FROM Map ORDER BY LandAreas DESC)
where rownum < 6
Link to Fiddle

Use a HAVING clause and count the number state states larger:
SELECT m.state, m.landArea
FROM Map m
LEFT JOIN Map m2 on m2.landArea > m.landArea
GROUP BY m.state, m.landArea
HAVING count(*) < 5
ORDER BY m.landArea DESC
See SQLFiddle
This joins each state to every state whose area is greater, then uses a HAVING clause to return only those states where the number of larger states was less than 5.
Ties are all returned, leading to more than 5 rows in the case of a tie for 5th.
The left join is needed for the case of the largest state, which has no other larger state to join to.
The ORDER BY is optional.

Try something like this
select m.states,m.landarea
from map m
where (select count(‘x’) from map m2 where m2.landarea > m.landarea)<=5
order by m.landarea

There are two bloomers in your posted code.
You need to use landarea in the DENSE_RANK() call. At the moment you're ordering the states in reverse alphabetical order.
Your filter in the outer query is the wrong way around: you're excluding the top four results.
Here is what you need ...
SELECT LandArea, State
FROM ( SELECT LandArea
, State
, DENSE_RANK() OVER (ORDER BY landarea DESC) as area_dr
FROM Maps )
WHERE area_dr <= 5
order by area_dr;
... and here is the SQL Fiddle to prove it. (I'm going with the statement in the question that you want the top 5 biggest states and ignoring the fact that your desired result set has only four rows. But adjust the outer filter as you will).
There are three different functions for deriving top-N result sets: DENSE_RANK, RANK and ROW_NUMBER.
Using ROW_NUMBER will always guarantee you 5 rows in the result set, but you may get the wrong result if there are several states with the same land area (unlikely in this case, but other data sets will produce such clashes). So: 1,2,3,4,5
The difference between RANK and DENSE_RANK is how they handle ties. DENSE_RANK always produces a series of consecutive numbers, regardless of how many rows there are in each rank. So: 1,2,2,3,3,3,4,5
RANK on the other hand will produce a sparse series if a given rank has more than one hit. So: 1,2,2,4,4,4.
Note that each of the example result sets has a different number of rows. Which one is correct? It depends on the precise question you want to ask.
Using a sorted sub-query with the ROWNUM pseudo-column will work like the ROW_NUMBER function, but I prefer using ROW_NUMBER because it is more powerful and more error-proof.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio