Get average value for every N tuples in Apache Pig - hadoop

Assuming I have a table with two columns CUSTTYPE and AMOUNT. I want to add a third column NTILE which I can then group on and use to get my averages, something like below:
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
RETAIL | 249.00 | 1
RETAIL | 278.00 | 2
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
RETAIL | 500.00 | 3
RETAIL | 738.00 | 3
RETAIL | 1250.00 | 3
RETAIL | 2029.00 | 4
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
Essentially, I am trying to take the average of every n terms (here, n=3):
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 187.00 | 1
RETAIL | 389.33 | 2
RETAIL | 829.33 | 3
RETAIL | 2785.0 | 4
From the Pig reference here, it seems this could be achieved using Over() but I could not find an example of how this could be done. Thoughts?

You can rank every record of your data using RANK operator:
http://pig.apache.org/docs/r0.14.0/basic.html#rank
like this:
A = LOAD 'path' AS (schema);
B = RANK A;
and then divide each rank by 3:
C = FOREACH B generate ($0 + 1) / 3 as NTILE, CUSTTYPE, AMOUNT;

Related

Calculating Rolling Weekly Spend in Hive using Window Functions

I need to develop a distribution of customer week long spend. Every time a customer makes a purchase, I want to know how much they've spent with us in the past week. I would like to do this with my Hive code.
My data set is somewhat similar to this:
Spend_Table
Cust_ID | Purch_Date | Purch_Amount
1 | 1/1/19 | $10
1 | 1/2/19 | $21
1 | 1/3/19 | $30
1 | 1/4/19 | $11
1 | 1/5/19 | $21
1 | 1/6/19 | $31
1 | 1/7/19 | $41
2 | 1/1/19 | $12
2 | 1/2/19 | $22
2 | 1/3/19 | $32
2 | 1/5/19 | $42
2 | 1/7/19 | $52
2 | 1/9/19 | $62
2 | 1/11/19 | $72
So far, I've tried code that looks similar to this:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table
Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend
1 | 1/1/19 | $10 | $10
1 | 1/2/19 | $21 | $31
1 | 1/3/19 | $30 | $61
1 | 1/4/19 | $11 | $72
1 | 1/5/19 | $21 | $93
1 | 1/6/19 | $31 | $124
1 | 1/7/19 | $41 | $165
2 | 1/1/19 | $12 | $12
2 | 1/2/19 | $22 | $34
2 | 1/3/19 | $32 | $66
2 | 1/5/19 | $42 | $108
2 | 1/7/19 | $52 | $160
2 | 1/9/19 | $62 | $188
2 | 1/11/19 | $72 | $228
I believe the issue is with my range between, because it appears to be grabbing the preceding number of rows. I was expecting it to grab data within the preceding amount of seconds (604800 being 6 days in seconds).
Is what I'm trying to do feasible? I can't do the previous 6 rows, since not every customer makes a purchase every single day, like customer 2. Any help is greatly appreciated!
SELECT *, sum(some_value) OVER (
PARTITION BY Cust_ID
ORDER BY CAST(Purch_Date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS cummulativeSum FROM Spend_Table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Moving answer here from the question,
I was able to get the original code to work by changing it to:
Select Cust_ID,
Purch_Date,
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
current row) as Rolling_Spend
from Spend_Table
The key was specifying the date format in the unix_timestamp formula

Split a single row into multiple rows with grouping data check - Hive

Now I'm using the query below in hive to split a row into multiple rows, but I also want to group a "Product" column based on "Category" column each group will match by the order of the group and have ";" to sperate each group and have "," separate item in the group.
SELECT id, customer, prodcut_split
FROM orders lateral view explode(split(product,';')) products AS prodcut_split
Here is my data look like now
| id | Customer| Category | Product |
+----+----------+---------------------------+-----------------------------------+
| 1 | John | Furniture; Technology | Bookcases, Chairs; Phones, Laptop |
| 2 | Bob | Office supplies; Furniture| Paper, Blinders; Tables |
| 3 | Dylan | Furniture | Tables, Chairs, Bookcases |
my desired result will look like:
| id | Customer| Category | Product |
+----+----------+----------------+-----------+
| 1 | John | Furniture | Bookcases |
| 1 | John | Furniture | Chairs |
| 1 | John | Technology | Phones |
| 1 | John | Technology | Laptop |
| 2 | Bob | Office supplies| Paper |
| 2 | Bob | Office supplies| Blinders |
| 2 | Bob | Furniture | Tables |
| 3 | Dylan | Furniture | Tables |
| 3 | Dylan | Furniture | Chairs |
| 3 | Dylan | Furniture | Bookcases |
I have tried this one and it's work well, all credit goes to this question: Hive - Split delimited columns over multiple rows, select based on position
select id,customer ,category, products
from
(
SELECT id, category, product
FROM tale_name
lateral VIEW posexplode(split(category,';')) category AS pos_category, category_split
lateral VIEW posexplode(split(product,';')) product AS pos_product, product_split
WHERE pos_category = pos_product) a
lateral view explode(split(product_split,',')) product_split AS products

How to implement Oracle's "func(...) keep (dense_rank ...)" In Hive

I have a table abcd in Oracle DB
+-------------+----------+
| abcd.speed | abcd.ab |
+-------------+----------+
| 4.0 | 2 |
| 4.0 | 2 |
| 7.0 | 2 |
| 7.0 | 2 |
| 8.0 | 1 |
+-------------+----------+
And I'm using a query like this:
select min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST) MOD from abcd;
I'm trying to convert the code to Hive, but it looks like keep is not available in Hive.
Could you suggest an equivalent statement?
select -max(struct(ab,-speed)).col2 as mod
from abcd
;
+------+
| mod |
+------+
| 4.0 |
+------+
Let start by explaining min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST):
Find the row(s) with the max value of ab.
For this/those row(s), find the min value of speed.
We are using 2 tricks here.
The 1st is based on the ability to get the max value of a struct.
max(struct(c1,c2,c3,...)) returns the same result as if you have sorted the structs by c1, then by c2, then by c3 etc. and then chose the last element.
The 2nd trick is to use -speed (which is the same of -1*speed).
Finding the max of -speed and then taking the minus of that value (which gives us speed), is the same of finding the min of speed.
If we would have ordered the structs, it would have looked like this (since 2 is bigger than 1 and -4 is bigger than -7):
+----+-------+
| ab | speed |
+----+-------+
| 1 | -8.0 |
| 2 | -7.0 |
| 2 | -7.0 |
| 2 | -4.0 |
| 2 | -4.0 |
+----+-------+
The last struct in this case in struct(2,-4.0), therefore this is the result of the max function.
The fields names for a struct are col1, col2, col3 etc., so
struct(2,-4.0).col2 is -4.0. and preceding it with minus (which is the same as multiple it by -1) as in -struct(2,-4.0).col2 is 4.0.

Efficient way to join by levenshtein in Hive or Impala

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

Sum of the grouped distinct values

This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")

Resources