Hive Grouping and calculating average by calculating distinct

Hive Grouping and calculating average by calculating distinct - hadoop

Folks we have one wired requirement in HIVE and we are not able to write query for the same
Basically we have following data.
CUSTOMER_NAME PRODUCT_NAME PRICE OCCURANCE ID
customer1, product1, 20, 1
customer1, product2, 30, 2
customer1, product1, 25, 3
customer1, product1, 20, 1
customer1, product2, 20, 2
Basically what we have to do is list the average price for (customer_name,product_name) for single occurance.
e.g. for combination (customer1,product1) price for product1 is
25+20/2(no of distinct occurences for customer(1 and 3)) = 22.5. But as we want to group by PRODUCT_NAME also we donot know how to calculate the distinct occurance. I have marked the query with [] bracket where we feel we need to do some change.
Other aspect is the inner query here we want to select customers where their average price will fall in to top 5 for distinct occurrencs. ( This works properly as group by clause is having only one attribute CUSTOMER_NAME)
select customer_name,product_name,[sum(price)/count(distinct(occurance_id))]
from customer_prd cprd
Join (select customer_name,sum(price)/count(distinct(occurance id))
order by sum group
by customer_name limit 5)
cprdd
where cprd.customer_name = cprdd.customer_name group by cprd.customer_name,cprd.product_name
output expected.
customer1,product1, 20 (avg for occurance ID 1) + 25(average for occurance ID 2)/2 = 22.5
customer1,product2, 30 + 20/2 = 25

If I understand correctly, it seems like the only trouble here is that you have duplicates. If you remove duplicate occurrences, then it's a simple group by and average:
select customer_name, product_name, avg(price)
from (
select distinct customer_name, product_name, price, occurance_id from cprd
) t
group by customer_name, product_name

Related

Max number of counts in a tparticular hour

I have a table called Orders, i want to get maximum number of orders for each day with respect to hours with following query
SELECT
trunc(created,'HH') as dated,
count(*) as Counts
FROM
orders
WHERE
created > trunc(SYSDATE -2)
group by trunc(created,'HH') ORDER BY counts DESC
this gets the result of all hours, I want only max hour of a day e.g.
Image
This result looks good but now i want only rows with max number of count for a day
e.g.
for 12/23/2019 max number of counts is 90 for "12/23/2019 4:00:00 PM",
for 12/22/2019 max number of counts is 25 for "12/22/2019 3:00:00 PM"
required dataset
1 12/23/2019 4:00:00 PM 90
2 12/24/2019 12:00:00 PM 76
3 12/22/2019 1:00:00 PM 25

This could be the solution and in my opinion is the most trivial.
Use the WITH clause to make a sub query then search for the greatest value in the data set on a specific date.
WITH ORD AS (
SELECT
trunc(created,'HH') as dated,
count(*) as Counts
FROM
orders
WHERE
created > trunc(SYSDATE-2)
group by trunc(created,'HH')
)
SELECT *
FROM ORD ord
WHERE NOT EXISTS (
SELECT 'X'
FROM ORD ord1
WHERE trunc(ord1.dated) = trunc(ord.dated) AND ord1.Counts > ord.Counts
)

Use ROW_NUMBER analytic function over your original query and filter the rows with number 1.
You need to partition on the day, i.e. TRUNC(dated) to get the correct result
with ord1 as (
SELECT
trunc(created,'HH') as dated,
count(*) as Counts
FROM
orders
WHERE
created > trunc(SYSDATE -2)
group by trunc(created,'HH')
),
ord2 as (
select dated, Counts,
row_number() over (partition by trunc(dated) order by Counts desc) as rn
from ord1)
select dated, Counts
from ord2
where rn = 1
The advantage of using the ROW_NUMBER is that it correct handels ties, i.e. cases where there are more hour in a day with the same maximal count. The query shows only one record and you can controll with the order by e.g. to show the first / last hour.

You can use the analytical function ROW_NUMBER as following to get the desired result:
SELECT DATED, COUNTS
FROM (
SELECT
TRUNC(CREATED, 'HH') AS DATED,
COUNT(*) AS COUNTS,
ROW_NUMBER() OVER(
PARTITION BY TRUNC(CREATED)
ORDER BY COUNT(*) DESC NULLS LAST
) AS RN
FROM ORDERS
WHERE CREATED > TRUNC(SYSDATE - 2)
GROUP BY TRUNC(CREATED, 'HH'), TRUNC(CREATED)
)
WHERE RN = 1
Cheers!!

ORDER BY BASED ON COLUMN

I have two tables,PRODUCTS AND LOOKUP TABLES.Now i want to order the KEY Column in products table based on KEY column value in LOOKUP TABLE.
CREATE TABLE PRODUCTS
(
ID INT,
KEY VARCHAR(50)
)
INSERT INTO PRODUCTS
VALUES (1, 'EGHS'), (2, 'PFE'), (3, 'EGHS'),
(4, 'PFE'), (5, 'ABC')
CREATE TABLE LOOKUP (F_KEY VARCHAR(50))
INSERT INTO LOOKUP VALUES('PFE,EGHS,ABC')
Now I want to order the records in PRODUCTS table based on KEY (PFE,EGHS,ABC) values in LOOKUP table.
Example output:
PRODUCTS
ID F_KEY
-----------
2 PFE
4 PFE
1 EGHS
3 EGHS
5 ABC
I use this query, but it is not working
SELECT *
FROM PRODUCTS
ORDER BY (SELECT F_KEY FROM LOOKUP)

You can split the string using XML. You first need to convert the string to XML and replace the comma with start and end XML tags.
Once done, you can assign an incrementing number using ROW_NUMBER() like following.
;WITH cte
AS (SELECT dt,
Row_number()
OVER(
ORDER BY (SELECT 1)) RN
FROM (SELECT Cast('<X>' + Replace(F.f_key, ',', '</X><X>')
+ '</X>' AS XML) AS xmlfilter
FROM [lookup] F)F1
CROSS apply (SELECT fdata.d.value('.', 'varchar(500)') AS DT
FROM f1.xmlfilter.nodes('X') AS fdata(d)) O)
SELECT P.*
FROM products P
LEFT JOIN cte C
ON C.dt = P.[key]
ORDER BY C.rn
Online Demo
Output:
ID F_KEY
-----------
2 PFE
4 PFE
1 EGHS
3 EGHS
5 ABC

You may do it like this:
SELECT ID, [KEY] FROM PRODUCTS
ORDER BY
CASE [KEY]
WHEN 'PFE' THEN 1
WHEN 'EGHS' THEN 2
WHEN 'ABC' THEN 3
END

Oracle - Group by the most frequent entries

I have 2 tables on my Oracle DB
One with a product list
PRODUCT_ID - PRODUCT_NAME - PRODUCT_PRICE
1 P_1 50
2 P_2 60
3 P_3 70
4 P_4 80
And one with the orders
CLIENT_ID - PRODUCT_ID - ORDER_PRICE
1 1 50
2 3 60
3 2 70
4 2 70
I need to make a query so it returns the product_list table but ordered by the most frequent Product_id in the orders table. So in this case the Product ID=2 must be first on the list.
I have found some examples but i cant find something that will work for this case.

You can use subquery for aggregation on orders table to find count for each product id and then left join it with the product_list table to use the calculated count for ordering.
select p.*
from product_list p
left join (
select product_id,
count(*) as cnt
from orders
group by product_id
) o on p.product_id = o.product_id
order by o.cnt desc nulls last;

LEFT Join is used since not all products could have orders and we need to find the count of orders for each product.
GROUP BY is used because we use the aggregate count() to find the occurrence of orders for a given Product.
ORDER BY DESC is used so the count is ordered highest count of product orders first to lowest. However when ties exist, we don't know what order will be returned as a second level of order by is not defined. Could be order We could add a Product_ID so they are low to high after that...
.
SELECT PL.Product_ID, PL.Product_Name, PL.Product_Price, count(O.Product_ID) cnt
FROM Product_List
LEFT JOIN Orders O
on O.Product_ID = PL.Product_ID
GROUP BY PL.Product_ID, PL.Product_Name, PL.Product_Price
ORDER BY cnt Desc

One column calculate multiple output

I have show the total product sale on the basis YTD (Year to Date), QTD(Quarter to Date) and MTD (Month to Date). The thing is I have to show only one from those. Only one output can be seen on the basis of selection i.e. like we have radio buttons to select one from many. Here also a input is given to select and on the basis of that input the output is generated. The input can be any YTD,QTD or MTD. The output is generated on the basis of input. I don't how to calculate a column output where the input can be vary.
I have a Product Table-
Product_ID Product_name Price
1 Mobile 200
2 T.V. 400
3 Mixer 300
I have a Sales table like this-
Product_ID Sales_Date Quantity
1 01-01-2015 30
2 03-01-2015 40
3 06-02-2015 10
1 22-03-2015 30
2 09-04-2015 10
3 21-05-2015 40
1 04-06-2015 40
2 29-07-2015 30
1 31-08-2015 30
3 14-09-2015 30
And my ouput column contains 3 columns that are-
Product_id, Product_Name and Total_Amount.
The column Total_Amount(quantity*price) have to calculate sale on the basis of input given by user i.e.,
IF it is YTD then it should calculate the total sale from Starting Date of Year ( 01-01-2015) to the current_date(sysdate),
IF it is QTD then in which quarter the current date is falling i.e if current month is september then from 1 July to current_date(sysdate),
IF it is MTD then in which month the current date is falling to the current_date(sysdate).
Can anyone help. Thanks!!!

-- step 1
create or replace view my_admin
as
select 'YTD' element, product_id, sum(quantity) sum_quantity
from sales
where Sales_date between trunc(sysdate,'Y') and sysdate
group by product_id
union
select 'QTD', product_id, sum(quantity) sum_quantity
from sales
where Sales_date between trunc(sysdate,'Q') and sysdate
group by product_id
union
select 'MTD', product_id, sum(quantity) sum_quantity
from sales
where Sales_date between trunc(sysdate,'MM') and sysdate
group by product_id
-- step 2
select element, p.product_name, (sum_quantity * p.PRICE) agregate
from my_admin a
inner join products p on a.product_id = p.product_id
where element = (:input)

My presumption is that you have 3 radio buttons(variables :YTD,:QTD,:MTD in my example) where just one value at a time can be picked by the user the rest will be null.
You can use a something like this to get what you want:
select SUM(a.QTY*B.PRICE) from PRODUCTS a
inner join SALES B on a.PRODUCT_ID=B.PRODUCT_ID
where
(:YTD is null or B.SALES_DATE between '01-JAN-15' and sysdate)
and
(:QTD is null or TO_CHAR(B.SALES_DATE, 'YYYY-Q')=TO_CHAR(sysdate, 'YYYY-Q'))
and
(:MTD is null or TO_CHAR(B.SALES_DATE, 'MM')=TO_CHAR(sysdate, 'MM'));
You can test it here sqlfiddle

SELECT only rows that aren't repeated

So I have a table like this. This is a standard Order header - Order Detail table:
order id order_line
----------- -----------
100 1
100 2
100 3
101 1
102 1
103 1
103 2
104 1
105 1
Now, how can I make a SELECT that will only pick the orders that only have one line?
In this case I don't want orders 100 and 103.
Thanks!
Tiago

Counting lines using "group by order_id" is a good solution, however counting is not needed, simpler Max function works fine:
select order_id from orders
group by order_id
having max(order_line)=1;
In case order_line has consecutive values further "optimization" is possible:
select order_id from orders
where order_line <= 2
group by order_id
having max(order_line)=1;

Group by the order_id and take only those having 1 record per group
select order_id
from orders
group by order_id
having count(*) = 1
If you need the complete record then do
select t1.*
from orders t1
join
(
select order_id
from orders
group by order_id
having count(*) = 1
) t2 on t1.order_id = t2.order_id

You can try following query too :
select order_id , order_line
from Order_Detail
group by order_id ,order_line
having count(order_id)<2;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive Grouping and calculating average by calculating distinct - hadoop

Related

Max number of counts in a tparticular hour

ORDER BY BASED ON COLUMN

Oracle - Group by the most frequent entries

One column calculate multiple output

SELECT only rows that aren't repeated

Categories

Resources