Collect data by id in hive

Collect data by id in hive - hadoop

I have a table with rows in following format
user | purchase | time_of_purchase|quantity
Sample
1234 | Bread | Jul 7 20:48| 1
1234 | Shaving Cream | July 10 14:20 | 2
5678 | Milk | July 7 3:48 | 1
5678 | Bread | July 7 3:49 | 2
5678 | Bread | July 7 15:30 | 1
I want to create purchase history of the user in following format
1234 | {[Bread , Jul 7 20:48,1] ,[ Shaving Cream , July 10 14:20, 2 ]}
5678 | {[Milk, July 7 3:48 , 1 ] , [Bread , July 7 3:49 , 2], [Bread , July 7 15:30 , 1]}
Is it possible to do this in hive or pig script? I tried collect_list but that does not keep order across columns to combine , Also tried brickhouse collect but that behaves like collect_set and I lose part of the information.

PIG Script
File = LOAD 'file.txt' using PigStorage(',') as (user:int, Purchase:chararray, timeofpurchase:chararray, quantity:int);
GRP_USER = GROUP File by user;
DUMP GRP_USER;
you can refer few example on http://ybhavesh.blogspot.com/
Hope it Helps

Related

AWS QuickSight filtering based on result of a query or other dataset

I want to create an analysis table in AWS Quicksight that shows the quantity sold in a given month and it's subsequent month based on users who made a purchase on the given month.
Let's say I have a dataset called user_orders with the following data:
+---------+----------+------------+
| user_id | quantity | order_date |
+---------+----------+------------+
| 1 | 2 | 2020-04-01 |
+---------+----------+------------+
| 1 | 3 | 2020-04-02 |
+---------+----------+------------+
| 1 | 1 | 2020-05-23 |
+---------+----------+------------+
| 1 | 2 | 2020-06-02 |
+---------+----------+------------+
| 2 | 1 | 2020-05-03 |
+---------+----------+------------+
| 2 | 1 | 2020-05-04 |
+---------+----------+------------+
| 3 | 2 | 2020-04-07 |
+---------+----------+------------+
| 3 | 1 | 2020-04-10 |
+---------+----------+------------+
| 3 | 1 | 2020-06-23 |
+---------+----------+------------+
For example, using the table above I want to be able to show how many quantities sold in April, May, June, and so on (max 12 months) by users who made a purchase in April.
The resulting table should look like this:
+-----------+----------+
| | quantity |
+-----------+----------+
| 04-2020 | 8 |
+-----------+----------+
| 05-2020 | 1 |
+-----------+----------+
| 06-2020 | 3 |
+-----------+----------+
8 sold in April because user_id 1 made 5 purchase and user_id 3 made 3 purchase while user_id 2 did not make any purchase.
There is only 1 item sold in May because only user_id 1 made the purchase in May, but also made a purchase in April. user_id 2 also made a purchase in May but didn't in April so it's not counted.
I can make the table above using PHP and MySQL fairly easily using the following code:
# first get all the user ids who made a purchase in April
$user_ids = sql_query("SELECT DISTINCT user_id FROM user_orders WHERE order_date BETWEEN '2020-04-01' AND '2020-04-30'");
# get the quantity sold for each month by users who made a purchase in April
$purchases = sql_query("SELECT MONTH(order_date), SUM(quantity) FROM user_orders WHERE user_id IN ({$user_ids}) AND order_date BETWEEN '2020-04-01' AND '2021-03-31' GROUP BY MONTH(order_date);")
(Obviously, April is just an example, I'd like to be able to change the starting month dynamically using QuickSight control)
As my above example shown, it requires two queries to perform this analysis. First, is to get the user_ids of the users, and the next is to actually get the quantity sold by the users.
I have been trying to achieve this using Quicksight for the last 3 days but hasn't found any way yet.
I hope someone can point me in the right direction.
Thank you!

You can achieve this by creating a calculated field like this and filtering on it
distinctCountOver(ifelse(truncDate('MM', {order_Date}) = parseDate('2020-04-01'), 1, NULL), [{user_id}], PRE_AGG)
(ofcourse, you can change the parseDate portion to be your date parameter)
Now, lets say the name of the above calculated field is SpecificMonthUser. You can add a filter sum(SpecificMonthUser) != 0.
And then create a pivot table visualization with OrderDate, user id in the rows and sum(quantity) in the values. You should get the desired result.

How to fetch all possible pattern in hive

I have below table:
+----------+----+
|customerID|name|
+----------+----+
| 1| Ram|
+----------+----+
I want output as (All possible value of column-value):
+----------+----+
|customerID|name|
+----------+----+
| 1| Ram|
| 2| Arm|
| 3| Mar|
| .| ...|
| .| ...|
+----------+----+

Split string, explode array and use cross join with itself to find all possible combinations:
with s as (select col
from (select explode( split(lower('Ram'),'')) as col)s
where col <>''
)
select concat(upper(s1.col), s2.col, s3.col) as name,
row_number() over() as customerId
from s s1
cross join s s2
cross join s s3
where s1.col<>s2.col and s2.col<>s3.col;
Result:
OK
name customerid
Mam 1
Mar 2
Mrm 3
Mra 4
Ama 5
Amr 6
Arm 7
Ara 8
Rma 9
Rmr 10
Ram 11
Rar 12
Time taken: 185.638 seconds, Fetched: 12 row(s)
Without last WHERE s1.col<>s2.col and s2.col<>s3.col you will get all combinations like Aaa, Arr, Rrr, etc.

Oracle SQL - distributing into buckets

i'am searching for a smart oracle sql solution to distribute data into a number of buckets. The order of x is important. I know there are a lot of algorithms but iam pretty sure there must be smart sql (analytic function) solution e.g. NTILE(3) but i don't get it.
x|quantity
1|7
2|4
3|9
4|2
5|10
6|3
8|7
9|7
10|4
11|9
12|2
13|10
16|3
17|7
The result should look something like this:
x_from|x_to|sum(quantity)
1|4|22
...and so on
Thanks in advance
Tim

This example divides the table into 4 buckets (ntile( 4 )):
SELECT min( "x" ) as "From",
max( "x" ) as "To",
sum("quantity")
FROM (
SELECT t.*,
ntile( 4 ) over (order by "x" ) as group_no
FROM table1 t
)
GROUP BY group_no
ORDER BY 1;
| From | To | SUM("QUANTITY") |
|------|----|-----------------|
| 1 | 4 | 22 |
| 5 | 9 | 27 |
| 10 | 12 | 15 |
| 13 | 17 | 20 |

Hive : group column based on max value

I have a table with fields as
date value
10-02-1900 23
09-05-1901 22
10-03-1900 10
10-02-1901 24
....
I have to return maximum value for each year
i.e.,
1900 23
1901 24
I tried the below query but getting wrong ans.
SELECT YEAR(FROM_UNIXTIME(UNIX_TIMESTAMP(date,'dd-mm-yyyy'))) as date,MAX(value) FROM teb GROUP BY date;
Can anyone suggest me a query to do this?

Option 1
select year(from_unixtime(unix_timestamp(date,'dd-MM-yyyy'))) as year
,max(value) as max_value
from t
group by year(from_unixtime(unix_timestamp(date,'dd-MM-yyyy')))
;
Option 2
pre Hive 2.2.0
set hive.groupby.orderby.position.alias=true;
as of Hive 2.2.0
set hive.groupby.position.alias=true;
select year(from_unixtime(unix_timestamp(date,'dd-MM-yyyy'))) as date
,max(value)
from t
group by 1
;
+------+-----------+
| year | max_value |
+------+-----------+
| 1900 | 23 |
| 1901 | 24 |
+------+-----------+
P.s.
Another way to extract the year:
from_unixtime(unix_timestamp(date,'dd-MM-yyyy'),'yyyy')

Calculate Average Count Using MapReduce in HBase

I have a table called Log which every single row represent the single activity and have a table structure like this
info:date, info:ip_address, info:action, info:info
The example of data is like this
Column Family : info
date | ip_address | action | info
3 March 2014 | 191.2.2.2 | delete | blabla
4 March 2014 | 191.2.2.3 | view | blabla
5 March 2014 | 191.2.2.4 | create | blabla
3 March 2014 | 191.2.2.5 | delete | blabla
4 March 2014 | 191.2.2.5 | create | blabla
4 March 2014 | 191.2.2.6 | delete | blabla
What i want to do is to calculate the average of total of activity based on time. The first things to do is compute the total activity based on time:
time | total_activity
3 March 2014 | 2
4 March 2014 | 3
5 March 2014 | 1
Then, i want to calculate the average of that total_activity which the output will be represent like this
(2 + 3 + 1) / 3 = 2
How i can do this in HBase using MapReduce? I am already thinking that only using one reducer just capable to compute the total of activity based on time.
Thanks

Suggest you look into Scalding - it's the easiest and fastest way to write production Hadoop jobs that can tie in easily to HBase and stuff. Here is a project to help with HBase & Scalding https://github.com/ParallelAI/SpyGlass/blob/master/src/main/scala/parallelai/spyglass/hbase/example/SimpleHBaseSourceExample.scala
Then have a look at the Scalding API to work out how to do what you want:
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Collect data by id in hive - hadoop

PIG Script File = LOAD 'file.txt' using PigStorage(',') as (user:int, Purchase:chararray, timeofpurchase:chararray, quantity:int); GRP_USER = GROUP File by user; DUMP GRP_USER; you can refer few example on http://ybhavesh.blogspot.com/ Hope it Helps

Related

AWS QuickSight filtering based on result of a query or other dataset

How to fetch all possible pattern in hive

Oracle SQL - distributing into buckets

Hive : group column based on max value

Calculate Average Count Using MapReduce in HBase

Categories

Resources