HIVE Pivot and Sum - hadoop

I have a table that I am trying to figure out how to pivot and sum based on the values in a second column.
Example input:
|own|pet|qty|
|---|---|---|
|bob|dog| 2 |
|bob|dog| 3 |
|bob|dog| 1 |
|bob|cat| 1 |
|jon|dog| 1 |
|jon|cat| 1 |
|jon|cat| 1 |
|jon|cow| 4 |
|sam|dog| 3 |
|sam|cow| 1 |
|sam|cow| 2 |
Example output:
|own|dog|cat|cow|
|---|---|---|---|
|bob| 6 | 1 | |
|jon| 1 | 2 | 4 |
|sam| 1 | | 3 |

Use case and sum():
select own, sum(case when pet='dog' then qty end) as dog,
sum(case when pet='cat' then qty end) as cat,
sum(case when pet='cow' then qty end) as cow
from your_table
group by own;

For dynamic data you can use MAP
select own
,str_to_map(concat_ws(',',collect_list(concat(pet,':',cast(qty as string))))) as pet_qty
from (select own,pet
,sum(qty) qty
from mytable
group by own,pet
) t
group by own
;
+-----+---------------------------------+
| own | pet_qty |
+-----+---------------------------------+
| bob | {"cat":"1","dog":"6"} |
| jon | {"cat":"2","cow":"4","dog":"1"} |
| sam | {"cow":"3","dog":"3"} |
+-----+---------------------------------+

Related

Calculating Rolling Weekly Spend in Hive using Window Functions

I need to develop a distribution of customer week long spend. Every time a customer makes a purchase, I want to know how much they've spent with us in the past week. I would like to do this with my Hive code.
My data set is somewhat similar to this:
Spend_Table
Cust_ID | Purch_Date | Purch_Amount
1 | 1/1/19 | $10
1 | 1/2/19 | $21
1 | 1/3/19 | $30
1 | 1/4/19 | $11
1 | 1/5/19 | $21
1 | 1/6/19 | $31
1 | 1/7/19 | $41
2 | 1/1/19 | $12
2 | 1/2/19 | $22
2 | 1/3/19 | $32
2 | 1/5/19 | $42
2 | 1/7/19 | $52
2 | 1/9/19 | $62
2 | 1/11/19 | $72
So far, I've tried code that looks similar to this:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table
Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend
1 | 1/1/19 | $10 | $10
1 | 1/2/19 | $21 | $31
1 | 1/3/19 | $30 | $61
1 | 1/4/19 | $11 | $72
1 | 1/5/19 | $21 | $93
1 | 1/6/19 | $31 | $124
1 | 1/7/19 | $41 | $165
2 | 1/1/19 | $12 | $12
2 | 1/2/19 | $22 | $34
2 | 1/3/19 | $32 | $66
2 | 1/5/19 | $42 | $108
2 | 1/7/19 | $52 | $160
2 | 1/9/19 | $62 | $188
2 | 1/11/19 | $72 | $228
I believe the issue is with my range between, because it appears to be grabbing the preceding number of rows. I was expecting it to grab data within the preceding amount of seconds (604800 being 6 days in seconds).
Is what I'm trying to do feasible? I can't do the previous 6 rows, since not every customer makes a purchase every single day, like customer 2. Any help is greatly appreciated!
SELECT *, sum(some_value) OVER (
PARTITION BY Cust_ID
ORDER BY CAST(Purch_Date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS cummulativeSum FROM Spend_Table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Moving answer here from the question,
I was able to get the original code to work by changing it to:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend
from Spend_Table
The key was specifying the date format in the unix_timestamp formula

Lag and Lead to next month

TABLE: HIST
CUSTOMER MONTH PLAN
1 1 A
1 2 B
1 2 C
1 3 D
If I query:
select h.*, lead(plan) over (partition by customer order by month) np from HIST h
I get:
CUSTOMER MONTH PLAN np
1 1 A B
1 2 B C
1 2 C D
1 3 D (null)
But I wanted
CUSTOMER MONTH PLAN np
1 1 A B
1 2 B D
1 2 C D
1 3 D (null)
Reason being, next month to 2 is 3, with D. I'm guessing partition by customer order by month doesn't work the way I thought.
Is there a way to achieve this in Oracle 12c?
One way to do it is to use RANGE partitioning with the MIN analytic function. Like this:
select h.*,
min(plan) over
(partition by customer
order by month
range between 1 following and 1 following) np
from HIST h;
+----------+-------+------+----+
| CUSTOMER | MONTH | PLAN | NP |
+----------+-------+------+----+
| 1 | 1 | A | B |
| 1 | 2 | B | D |
| 1 | 2 | C | D |
| 1 | 3 | D | |
+----------+-------+------+----+
When you use RANGE partitioning, you are telling Oracle to make the windows based on the values of the column you are ordering by rather than making the windows based on the rows.
So, e.g.,
ROWS BETWEEN 1 following and 1 following
... will make a window containing the next row.
RANGE BETWEEN 1 following and 1 following
... will make a window containing all the rows having the next value for month.
UPDATE
If it is possible that some values for MONTH might be skipped for a given customer, you can use this variant:
select h.*,
first_value(plan) over
(partition by customer
order by month
range between 1 following and unbounded following) np
from h
+----------+-------+------+----+
| CUSTOMER | MONTH | PLAN | NP |
+----------+-------+------+----+
| 1 | 1 | A | B |
| 1 | 3 | B | D |
| 1 | 3 | C | D |
| 1 | 4 | D | |
+----------+-------+------+----+
You can use LAG/LEAD twice. The first time to check for duplicate months and to set the value to NULL in those months and the second time use IGNORE NULLS to get the next monthly value.
It has the additional benefit that if months are skipped then it will still find the next value.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE HIST ( CUSTOMER, MONTH, PLAN ) AS
SELECT 1, 1, 'A' FROM DUAL UNION ALL
SELECT 1, 2, 'B' FROM DUAL UNION ALL
SELECT 1, 2, 'C' FROM DUAL UNION ALL
SELECT 1, 3, 'D' FROM DUAL UNION ALL
SELECT 2, 1, 'E' FROM DUAL UNION ALL
SELECT 2, 1, 'F' FROM DUAL UNION ALL
SELECT 2, 3, 'G' FROM DUAL UNION ALL
SELECT 2, 5, 'H' FROM DUAL;
Query 1:
SELECT CUSTOMER,
MONTH,
PLAN,
LEAD( np ) IGNORE NULLS OVER ( PARTITION BY CUSTOMER ORDER BY MONTH, PLAN, ROWNUM ) AS np
FROM (
SELECT h.*,
CASE MONTH
WHEN LAG( MONTH ) OVER ( PARTITION BY CUSTOMER ORDER BY MONTH, PLAN, ROWNUM )
THEN NULL
ELSE PLAN
END AS np
FROM hist h
)
Results:
| CUSTOMER | MONTH | PLAN | NP |
|----------|-------|------|--------|
| 1 | 1 | A | B |
| 1 | 2 | B | D |
| 1 | 2 | C | D |
| 1 | 3 | D | (null) |
| 2 | 1 | E | G |
| 2 | 1 | F | G |
| 2 | 3 | G | H |
| 2 | 5 | H | (null) |
Just so that it is listed here as an option for Oracle 12c (onward), you can use an apply operator for this style of problem
select
h.customer, h.month, h.plan, oa.np
from hist h
outer apply (
select
h2.plan as np
from hist h2
where h.customer = h.customer
and h2.month > h.month
order by month
fetch first 1 rows only
) oa
order by
h.customer, h.month, h.plan
I don't know of any Oracle 12c public fiddles so, an example in SQL Server can be found here: http://sqlfiddle.com/#!18/cd95e/1
| customer | month | plan | np |
|----------|-------|------|--------|
| 1 | 1 | A | C |
| 1 | 2 | B | D |
| 1 | 2 | C | D |
| 1 | 3 | D | (null) |

MDX - filter empty outside of selected range

Cube is populated with data divided into time dimension ( period ) which represents a month.
Following query:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
returns:
+--------+----+---+--------+
| Period | a | b | c |
+--------+----+---+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 5 | 23 | 2 | 2 |
+--------+----+---+--------+
Removing non empty
select {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
Renders:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 1 | (null) | (null) | (null) |
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
| 6 | (null) | (null) | (null) |
+--------+--------+--------+--------+
What i would like to get, is all records from period 2 to period 5, first occurance of values in measure "a" denotes start of range, last occurance - end of range.
This works - but i need this to be dynamically calculated during runtime by mdx:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].&[2] :[Period].[Period].&[5]} on rows
from MyCube
desired output:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
+--------+--------+--------+--------+
I tried looking for first/last values but just couldn't compose them into the query properly. Anyone has this issue before ? This should be pretty common seeing as I want to get a continuous financial report without skipping months where nothing is going on. Thanks.
Maybe try playing with NonEmpty / Tail function in a WITH clause:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;
to debug a custom set, to see what members it is returning you can do something like this:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First] on rows
FROM MyCube;
I think reading your comment about Children means that this is also an alternative - to add an extra [Period]:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;

List customer ID, name and all of his/her accounts

customers:
+------------+--------------+
| cid | Name |
+------------+--------------+
| 1 | Bob |
| 2 | John |
| 3 | Jane |
+------------+--------------+
accounts:
+------------+--------------+
| aid | type |
+------------+--------------+
| 1 | Checking |
| 2 | Saving |
| 3 | CD |
+------------+--------------+
transactions:
+------------+--------------+--------------+
| tid | cid | aid |
+------------+--------------+--------------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 1 | 2 |
| 4 | 2 | 3 |
| 5 | 3 | 1 |
+------------+--------------+--------------+
I am trying to write a plsql procedure that, given the customer id as a parameter, will display his/her id, name and all accounts. Displaying the id and name is simple enough. What I'm not sure about is how to get all the accounts that are linked to the customer id and how to retrieve more than a single account.
An ideea can be:
select c.cid, c.name, a.type
from customers c
left join transactions t on (t.cid = c.cid)
left join accounts a on (a.aid = t.aid)
where c.cid = :customer_id
group by c.cid, c.name, a.type;
the group by is needed because can be more transactions.
Further, if you want to see one line:
select cid, name, LISTAGG(type, ',') WITHIN GROUP (ORDER BY type) as account_types
from(
select distinct c.cid, c.name, a.type
from customers c
left join transactions t on (t.cid = c.cid)
left join accounts a on (a.aid = t.aid)
where c.cid = :customer_id
)
group by cid, name;
Putting this into a stored procedure/function is too simple, so I let it to you.

Sum of the grouped distinct values

This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")

Resources