Calculating Rolling Weekly Spend in Hive using Window Functions - hadoop

I need to develop a distribution of customer week long spend. Every time a customer makes a purchase, I want to know how much they've spent with us in the past week. I would like to do this with my Hive code.
My data set is somewhat similar to this:
Spend_Table
Cust_ID | Purch_Date | Purch_Amount
1 | 1/1/19 | $10
1 | 1/2/19 | $21
1 | 1/3/19 | $30
1 | 1/4/19 | $11
1 | 1/5/19 | $21
1 | 1/6/19 | $31
1 | 1/7/19 | $41
2 | 1/1/19 | $12
2 | 1/2/19 | $22
2 | 1/3/19 | $32
2 | 1/5/19 | $42
2 | 1/7/19 | $52
2 | 1/9/19 | $62
2 | 1/11/19 | $72
So far, I've tried code that looks similar to this:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table
Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend
1 | 1/1/19 | $10 | $10
1 | 1/2/19 | $21 | $31
1 | 1/3/19 | $30 | $61
1 | 1/4/19 | $11 | $72
1 | 1/5/19 | $21 | $93
1 | 1/6/19 | $31 | $124
1 | 1/7/19 | $41 | $165
2 | 1/1/19 | $12 | $12
2 | 1/2/19 | $22 | $34
2 | 1/3/19 | $32 | $66
2 | 1/5/19 | $42 | $108
2 | 1/7/19 | $52 | $160
2 | 1/9/19 | $62 | $188
2 | 1/11/19 | $72 | $228
I believe the issue is with my range between, because it appears to be grabbing the preceding number of rows. I was expecting it to grab data within the preceding amount of seconds (604800 being 6 days in seconds).
Is what I'm trying to do feasible? I can't do the previous 6 rows, since not every customer makes a purchase every single day, like customer 2. Any help is greatly appreciated!

SELECT *, sum(some_value) OVER (
PARTITION BY Cust_ID
ORDER BY CAST(Purch_Date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS cummulativeSum FROM Spend_Table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Moving answer here from the question,
I was able to get the original code to work by changing it to:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend
from Spend_Table
The key was specifying the date format in the unix_timestamp formula

Related

Concanate two or more rows from result into single result on CI activerecord

I have situation like this, I want to get value from database(this values used comma delimited) from more than one rows based on month and year that I choose, for more detail check this out..
My Schedule.sql :
+---+------------+-------------------------------------+
|ID |Activ_date | Do_skill |
+---+------------+-------------------------------------+
| 1 | 2020-10-01 | Accountant,Medical,Photograph |
| 2 | 2020-11-01 | Medical,Photograph,Doctor,Freelancer|
| 3 | 2020-12-01 | EO,Teach,Scientist |
| 4 | 2021-01-01 | Engineering, Freelancer |
+---+------------+-------------------------------------+
My skillqmount.sql :
+----+------------+------------+-------+
|ID |Date_skill |Skill |Price |
+----+------------+------------+-------+
| 1 | 2020-10-02 | Accountant | $ 5 |
| 2 | 2020-10-03 | Medical | $ 7 |
| 3 | 2020-10-11 | Photograph | $ 5 |
| 4 | 2020-10-12 | Doctor | $ 9 |
| 5 | 2020-10-01 | Freelancer | $ 7 |
| 6 | 2020-10-04 | EO | $ 4 |
| 7 | 2020-10-05 | Teach | $ 4 |
| 8 | 2020-11-02 | Accountant | $ 5 |
| 9 | 2020-11-03 | Medical | $ 7 |
| 10 | 2020-11-11 | Photograph | $ 5 |
| 11 | 2020-11-12 | Doctor | $ 9 |
| 12 | 2020-11-01 | Freelancer | $ 7 |
+----+------------+------------+-------+
In my website I want to make calculation with those two table. So if in my website want to see start from date 2020-10-01 until 2020-11-01 for total amount between those date, I try to show it with this code :
Output example
+----+-----------+-----------+---------+
|No |Date Start |Date End |T.Amount |
+----+-------- --+-----------+---------+
|1 |2020-10-01 |2020-11-01 |$ 45 | <= this amount came from $5+$7+$5+$7+$5+$9+$7
+----+-------- --+-----------+---------+
Note :
Date Start : Input->post("A")
Date End : Input->post("B")
T.Amount : Total Amount based input A and B (on date)
I tried this code to get it :
<?php
$startd = $this->input->post('A');
$endd= $this->input->post('B');
$chck = $this->db->select('Do_skill')
->where('Activ_date >=',$startd)
->where('Activ_date <',$endd)
->get('Schedule')
->row('Do_skill');
$dcek = $this->Check_model->comma_separated_to_array($chck);
$t_amount = $this->db->select_sum('price')
->where('Date_skill >=',$startd)
->where('Date_skill <',$endd)
->where_in('Skill',$dcek)
->get('skillqmount')
->row('price');
echo $t_amount; ?>
Check_model :
public function comma_separated_to_array($chck, $separator = ',')
{
//Explode on comma
$vals = explode($separator, $chck);
$count = count($vals);
$val = array();
//Trim whitespace
for($i=0;$i<=$count-1;$i++) {
$val[] .= $vals[$i];
}
return $val;
}
My problem is the result from $t_amount not $45, I think there's some miss with my code above, please if there any advice, I very appreciate it...Thank you...
Your first query only return 1 row data.
I think you can do something like this for the first query.
$query1 = $this->db->query("SELECT Do_skill FROM schedule WHERE activ_date >= $startd and activ_date < $startd");
$check = $query1->result_array();
$array = [];
foreach($check as $ck){
$dats = explode(',',$ck['Do_skill']);
$counter = count($dats);
for($i=0;$i<$counter;$i++){
array_push($array,$dats[$i]);
}
and you can use the array to do your next query :)
The array $dcek has the values
Accountant,Medical,Photograph
The query from Codeigniter is
SELECT SUM(`price`) AS `price` FROM `skillqmount`
WHERE `Date_skill` >= '2020-10-01' AND
`Date_skill` < '2020-11-01' AND
`Skill` IN('Accountant', 'Medical', 'Photograph')
which returns 17 - this matches the first three entries in your data.
Your first query will only ever give one row, even if the date range would match multiple rows.

MDX - filter empty outside of selected range

Cube is populated with data divided into time dimension ( period ) which represents a month.
Following query:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
returns:
+--------+----+---+--------+
| Period | a | b | c |
+--------+----+---+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 5 | 23 | 2 | 2 |
+--------+----+---+--------+
Removing non empty
select {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
Renders:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 1 | (null) | (null) | (null) |
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
| 6 | (null) | (null) | (null) |
+--------+--------+--------+--------+
What i would like to get, is all records from period 2 to period 5, first occurance of values in measure "a" denotes start of range, last occurance - end of range.
This works - but i need this to be dynamically calculated during runtime by mdx:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].&[2] :[Period].[Period].&[5]} on rows
from MyCube
desired output:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
+--------+--------+--------+--------+
I tried looking for first/last values but just couldn't compose them into the query properly. Anyone has this issue before ? This should be pretty common seeing as I want to get a continuous financial report without skipping months where nothing is going on. Thanks.
Maybe try playing with NonEmpty / Tail function in a WITH clause:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;
to debug a custom set, to see what members it is returning you can do something like this:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First] on rows
FROM MyCube;
I think reading your comment about Children means that this is also an alternative - to add an extra [Period]:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;

HIVE Pivot and Sum

I have a table that I am trying to figure out how to pivot and sum based on the values in a second column.
Example input:
|own|pet|qty|
|---|---|---|
|bob|dog| 2 |
|bob|dog| 3 |
|bob|dog| 1 |
|bob|cat| 1 |
|jon|dog| 1 |
|jon|cat| 1 |
|jon|cat| 1 |
|jon|cow| 4 |
|sam|dog| 3 |
|sam|cow| 1 |
|sam|cow| 2 |
Example output:
|own|dog|cat|cow|
|---|---|---|---|
|bob| 6 | 1 | |
|jon| 1 | 2 | 4 |
|sam| 1 | | 3 |
Use case and sum():
select own, sum(case when pet='dog' then qty end) as dog,
sum(case when pet='cat' then qty end) as cat,
sum(case when pet='cow' then qty end) as cow
from your_table
group by own;
For dynamic data you can use MAP
select own
,str_to_map(concat_ws(',',collect_list(concat(pet,':',cast(qty as string))))) as pet_qty
from (select own,pet
,sum(qty) qty
from mytable
group by own,pet
) t
group by own
;
+-----+---------------------------------+
| own | pet_qty |
+-----+---------------------------------+
| bob | {"cat":"1","dog":"6"} |
| jon | {"cat":"2","cow":"4","dog":"1"} |
| sam | {"cow":"3","dog":"3"} |
+-----+---------------------------------+

Get average value for every N tuples in Apache Pig

Assuming I have a table with two columns CUSTTYPE and AMOUNT. I want to add a third column NTILE which I can then group on and use to get my averages, something like below:
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
RETAIL | 249.00 | 1
RETAIL | 278.00 | 2
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
RETAIL | 500.00 | 3
RETAIL | 738.00 | 3
RETAIL | 1250.00 | 3
RETAIL | 2029.00 | 4
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
Essentially, I am trying to take the average of every n terms (here, n=3):
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 187.00 | 1
RETAIL | 389.33 | 2
RETAIL | 829.33 | 3
RETAIL | 2785.0 | 4
From the Pig reference here, it seems this could be achieved using Over() but I could not find an example of how this could be done. Thoughts?
You can rank every record of your data using RANK operator:
http://pig.apache.org/docs/r0.14.0/basic.html#rank
like this:
A = LOAD 'path' AS (schema);
B = RANK A;
and then divide each rank by 3:
C = FOREACH B generate ($0 + 1) / 3 as NTILE, CUSTTYPE, AMOUNT;

Display record count in listbox using multiple tables and fields

i need help with a query, can't get it to work correctly. What i'm trying to achieve is to have a select box displaying the number of records associated with a particular theme, for some theme it works well for some it displays (0) when infact there are 2 records, I'm wondering if someone could help me on this, your help would be greatly appreciated, please see below my actual query + table structure :
SELECT theme.id_theme, theme.theme, calender.start_date,
calender.id_theme1,calender.id_theme2, calender.id_theme3, COUNT(*) AS total
FROM theme, calender
WHERE (YEAR(calender.start_date) = YEAR(CURDATE())
AND MONTH(calender.start_date) > MONTH(CURDATE()) )
AND (theme.id_theme=calender.id_theme1)
OR (theme.id_theme=calender.id_theme2)
OR (theme.id_theme=calender.id_theme3)
GROUP BY theme.id_theme
ORDER BY theme.theme ASC
THEME table
|---------------------|
| id_theme | theme |
|----------|----------|
| 1 | Yoga |
| 2 | Music |
| 3 | Taichi |
| 4 | Dance |
| 5 | Coaching |
|---------------------|
CALENDAR table
|---------------------------------------------------------------------------|
| id_calender | id_theme1 | id_theme2 | id_theme3 | start_date | end_date |
|-------------|-----------|-----------|-----------|------------|------------|
| 1 | 2 | 4 | | 2015-07-24 | 2015-08-02 |
| 2 | 4 | 1 | 5 | 2015-08-06 | 2015-08-22 |
| 3 | 1 | 3 | 2 | 2014-10-11 | 2015-10-28 |
|---------------------------------------------------------------------------|
LISTBOX
|----------------|
| |
| Yoga (1) |
| Music (1) |
| Taichi (0) |
| Dance (2) |
| Coaching (1) |
|----------------|
Thanking you in advance
I think that themes conditions should be into brackets
((theme.id_theme=calender.id_theme1)
OR (theme.id_theme=calender.id_theme2)
OR (theme.id_theme=calender.id_theme3))
Hope this help

Resources