Hive: Sum over a specified group (HiveQL)

Hive: Sum over a specified group (HiveQL) - hadoop

I have a table:
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over function in conjunction with the sum function but I cant get it to work. My best try:
SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;
Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.

Similar to #VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement.
The HiveQL query is therefore:
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

You could use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to achieve that without a self join.
Code as below:
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;

The analytics function sum gives cumulative sums. For example, if you did:
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test
then you would get:
key product_code cost total_costs
1 UK 20 20
1 US 10 30
1 EU 5 35
2 UK 3 3
2 EU 6 9
which, it seems, is not what you want.
Instead, you should use the aggregation function sum, combined with a self join to accomplish this:
select test.key, test.product_code, test.cost, agg.total_cost
from (
select key, sum(cost) as total_cost
from test
group by key
) agg
join test
on agg.key = test.key;

This query gives me perfect result
select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;

similar answer (if we use oracle emp table):
select deptno, ename, sal, sum(sal) over(partition by deptno) from emp;
output will be like below:
deptno ename sal sum_window_0
10 MILLER 1300 8750
10 KING 5000 8750
10 CLARK 2450 8750
20 SCOTT 3000 10875
20 FORD 3000 10875
20 ADAMS 1100 10875
20 JONES 2975 10875
20 SMITH 800 10875
30 BLAKE 2850 9400
30 MARTIN 1250 9400
30 ALLEN 1600 9400
30 WARD 1250 9400
30 TURNER 1500 9400
30 JAMES 950 9400

The table above looked like
key product_code cost
1 UK 20
1 US 10
1 EU 5
2 UK 3
2 EU 6
The user wanted a tabel with the total costs like the following
key product_code cost total_costs
1 UK 20 35
1 US 10 35
1 EU 5 35
2 UK 3 9
2 EU 6 9
Therefor we used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
So far so good.
I want a column more, counting the occurences of each country
key product_code cost total_costs occurences
1 UK 20 35 2
1 US 10 35 1
1 EU 5 35 2
2 UK 3 9 2
2 EU 6 9 2
Therefor I used the following query
SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;
Sadly this is not working. I get an cryptic error. To exclude an error in my query I want to ask if I did something wrong.
Thanks

Related

PL SQL function that includes multiple tables

I'm new to PL SQL and have to write a function, which has customer_id as an input and has to output a product_name of the best selling product for that customer_id.
The schema looks like this:
I found a lot of simple examples where it includes two tables, but I can't seem to find one where you have to do multiple joins and use a function, while selecting only the best selling product.
I could paste a lot of very bad code here and how I tried to approach this, but this seems to be a bit over my head for current knowledge, since I've been learning PL SQL for less than 3 days now and got this task.

With some sample data (minimal column set):
SQL> select * from products order by product_id;
PRODUCT_ID PRODUCT_NAME
---------- ----------------
1 BMW
2 Audi
SQL> select * From order_items;
PRODUCT_ID CUSTOM QUANTITY UNIT_PRICE
---------- ------ ---------- ----------
1 Little 100 1
1 Little 200 2
2 Foot 300 3
If we check some totals:
SQL> select o.product_id,
2 o.customer_id,
3 sum(o.quantity * o.unit_price) total
4 from order_items o
5 group by o.product_id, o.customer_id;
PRODUCT_ID CUSTOM TOTAL
---------- ------ ----------
2 Little 400
1 Little 100
2 Foot 900
SQL>
It says that
for customer Little, product 2 was sold with total = 400 - that's our choice for Little
for customer Little, product 1 was sold with total = 100
for customer Foot, product 2 was sold with total = 900 - that's our choice for Foot
Query might then look like this:
temp CTE calculates totals per each customer
rank_them CTE ranks them in descending order per each customer; row_number so that you get only one product, even if there are ties
finally, select the one that ranks as the highest
SQL> with
2 temp as
3 (select o.product_id,
4 o.customer_id,
5 sum(o.quantity * o.unit_price) total
6 from order_items o
7 group by o.product_id, o.customer_id
8 ),
9 rank_them as
10 (select t.customer_id,
11 t.product_id,
12 row_number() over (partition by t.customer_id order by t.total desc) rn
13 from temp t
14 )
15 select * From rank_them;
CUSTOM PRODUCT_ID RN
------ ---------- ----------
Foot 2 1 --> for Foot, product 2 ranks as the highest
Little 2 1 --> for Little, product 1 ranks as the highest
Little 1 2
SQL>
Moved to a function:
SQL> create or replace function f_product (par_customer_id in order_items.customer_id%type)
2 return products.product_name%type
3 is
4 retval products.product_name%type;
5 begin
6 with
7 temp as
8 (select o.product_id,
9 o.customer_id,
10 sum(o.quantity * o.unit_price) total
11 from order_items o
12 group by o.product_id, o.customer_id
13 ),
14 rank_them as
15 (select t.customer_id,
16 t.product_id,
17 row_number() over (partition by t.customer_id order by t.total desc) rn
18 from temp t
19 )
20 select p.product_name
21 into retval
22 from rank_them r join products p on p.product_id = r.product_id
23 where r.customer_id = par_customer_id
24 and r.rn = 1;
25
26 return retval;
27 end;
28 /
Function created.
SQL>
Testing:
SQL> select f_product ('Little') result from dual;
RESULT
--------------------------------------------------------------------------------
Audi
SQL> select f_product ('Foot') result from dual;
RESULT
--------------------------------------------------------------------------------
Audi
SQL>
Now, you can improve it so that you'd care about no data found issue (when customer didn't buy anything), ties (but you'd then return a collection or a refcursor instead of a scalar value) etc.
[EDIT] I'm sorry, ORDERS table has to be included into the temp CTE; your data model is correct, you don't have to do anything about it - my query was wrong (small screen + late hours issue; not a real excuse, just saying).
So:
with
temp as
(select i.product_id,
o.customer_id,
sum(i.quantity * i.unit_price) total
from order_items i join orders o on o.order_id = i.order_id
group by i.product_id, o.customer_id
),
The rest of my code is - otherwise - unmodified.

HOW can I add a sort row column?

Im using view on oracle plsql. In the table showing my sales,
I want to have a column showing the sequence number next to the sales of top 50 products.
The best selling products should be listed and followed by the sequence number in the row.
how can i do that?
Thanks.
This is my related query
NVL (
(SELECT ROUND (
SUM (
CASE DOCUMENT_TYPE
WHEN 2
THEN
(CASE TRANSACTION_TYPE
WHEN 0 THEN 0 - AMOUNT
ELSE AMOUNT
END)
ELSE
(CASE TRANSACTION_TYPE
WHEN 1 THEN 0 - AMOUNT
ELSE AMOUNT
END)
END),
8)
FROM TBL_TRANSACTION_LINES
WHERE STORE_NO = tbl_location.locationno
AND (TRANSACTION_TYPE NOT IN (10, 30))
AND TRANSACTION_DATE >
TO_DATE ('2020-09-27 0:0:0',
'yyyy-mm-dd HH24:MI:SS')
AND TRANSACTION_DATE <=
TO_DATE ('2020-10-04 0:0:0',
'yyyy-mm-dd HH24:MI:SS')
AND (URUNID = TBL_URUNLER.URUNID)),
0)
AS RESULTS

I don't have your tables nor data, so - here's an example based on Scott's sample schema; ranking employees by salary within their departments. Apply that to your case.
SQL> with temp as
2 (select deptno, ename, sal,
3 rank() over (partition by deptno order by sal desc) rnk
4 from emp
5 )
6 select deptno, ename, sal, rnk
7 from temp
8 order by deptno, rnk;
DEPTNO ENAME SAL RNK
---------- ---------- ---------- ----------
10 KING 10000 1
10 CLARK 2450 2
10 MILLER 1300 3
20 SCOTT 3000 1
20 FORD 3000 1
20 JONES 2975 3
20 ADAMS 1100 4
20 SMITH 920 5
30 BLAKE 2850 1
30 ALLEN 1600 2
30 TURNER 1500 3
30 MARTIN 1250 4
30 WARD 1250 4
30 JAMES 950 6
14 rows selected.
SQL>

Query to find before and after values of a given value

If we have table Employees
EMP_ID ENAME SALARY DEPT_ID
1 abc 1000 10
2 bca 1050 10
3 dsa 2000 20
4 zxc 3000 30
5 bnm 5000 30
6 rty 5050 30
I want to get the rank of the salary with before 2 values and after 2 values including the given rank
Like if I give rank 4 it should give ranks 2,3,4,5,6 details.
output should be
5 bnm 5000 30
4 zxc 3000 30
3 dsa 2000 20
3 dsa 2000 20
2 bca 1050 10
1 abc 1000 10
I have a query
WITH dept_count AS (
SELECT
e.*,
dense_rank() over( ORDER BY salary DESC) AS rk
FROM employees e
)
SELECT
*
FROM dept_count dc
WHERE dc.rk BETWEEN (
SELECT
c.rk-2
FROM dept_count c
WHERE c.rk =4
)
AND (
SELECT
c.rk + 2
FROM dept_count c
WHERE c.rk = 4
)
but I need a query which can be simplified.
Could someone help me with this query?

You just need to use ROW_NUMBER() along with a substitution parameter :
WITH dept_count AS (
SELECT
e.*,
ROW_NUMBER() OVER( ORDER BY salary DESC) AS rk
FROM employees e
)
SELECT *
FROM dept_count
WHERE rk BETWEEN &prm - 2 AND &prm + 2

DENSE_RANK ORDER BY NULL always return 1

I find in PROD procedure code like:
select *
from (select "EMPNO",
"SAL",
"COMM",
"DEPTNO",
DENSE_RANK() OVER (PARTITION BY deptno ORDER BY null) AS drank
from SCOTT."EMP"
where deptno in (10,30))
where drank = 1
order by deptno
RESULT:
EMPNO SAL COMM DEPTNO DRANK
7934 1300 - 10 1
7839 5000 - 10 1
7782 2450 - 10 1
7844 1500 0 30 1
7900 950 - 30 1
7654 1250 1400 30 1
7499 1600 300 30 1
7698 2850 - 30 1
7521 1250 500 30 1
As result drank is always equal to 1. This is also true for:
DENSE_RANK() OVER (ORDER BY null) AS drank
DENSE_RANK() OVER (PARTITION BY comm ORDER BY null) AS drank
DENSE_RANK() OVER (PARTITION BY 1 ORDER BY null) AS drank
DENSE_RANK() OVER (PARTITION BY null ORDER BY null) AS drank
Is there any case when drank is not equal to 1 when there is ORDER BY null clause?
EDIT: I know dense_rank start with 1. Question is about values greater than 1.

The documentation says:
Use the order_by_clause to specify how data is ordered within a partition. For all analytic functions you can order the values in a partition on multiple keys, each defined by a value_expr and each qualified by an ordering sequence.
Within each function, you can specify multiple ordering expressions. Doing so is especially useful when using functions that rank values, because the second expression can resolve ties between identical values for the first expression.
Whenever the order_by_clause results in identical values for multiple rows, the function behaves as follows:
CUME_DIST, DENSE_RANK, NTILE, PERCENT_RANK, and RANK return the same result for each of the rows.
...
When you order by null, the order_by_clause results in identical values for multiple all rows (in the partition), so they all get the same result.
The documentation for dense_rank also says:
The ranks are consecutive integers beginning with 1.
So they get the same result, which has to be 1.

the number of same record until this row

Is it possible to retrieve current number of same record for each row by using oracle pl/sql?
For example,
I have class table which consists of id, name, age columns
I want to have the sequence of student with the same name and age entering the class, assuming that id is countering up without altering data structure.
Thanks.
Regards,
Jim

Not sure I entirely get what you're asking for; you have an odd turn of phrase. An example of input data and expected result is always useful.
Perhaps something like this:
select id, name, age
from your_table
where (name, age) in
( select name. age
from your_table
group by name, age
having count(id) > 1 )
order by name, age, id
/
You could solve this with analytics. However, you still need an outer query to filter out the records which aren't duplicated, so I'm not sure what you'd gain:
select * from (
select id, name, age
, count(id) over (partition by name, age) as dup_count
from your_table )
where dup_count > 1
order by name, age, id
/

I'm not sure either, but it sounds to me something related to analytic functions. Take this as an example, look at srlno column, calculated using analytic functions:
SELECT empno, deptno, hiredate,
ROW_NUMBER( ) OVER (PARTITION BY
deptno ORDER BY hiredate
NULLS LAST) SRLNO
FROM emp
WHERE deptno IN (10, 20)
ORDER BY deptno, SRLNO;
EMPNO DEPTNO HIREDATE SRLNO
------ ------- --------- ----------
7782 10 09-JUN-81 1
7839 10 17-NOV-81 2
7934 10 23-JAN-82 3
7369 20 17-DEC-80 1
7566 20 02-APR-81 2
7902 20 03-DEC-81 3
7788 20 09-DEC-82 4
7876 20 12-JAN-83 5
More on analyltic functions:
http://www.orafaq.com/node/55
Remember, is a better approach if you can achieve your goal with a SQL instead of PL/SQL.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive: Sum over a specified group (HiveQL) - hadoop

Similar to #VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement. The HiveQL query is therefore: SELECT key, product_code, SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM test;

You could use BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to achieve that without a self join. Code as below: SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM T;

This query gives me perfect result select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;

Related

PL SQL function that includes multiple tables

HOW can I add a sort row column?

Query to find before and after values of a given value

DENSE_RANK ORDER BY NULL always return 1

the number of same record until this row

Categories

Resources