Calculate Average Count Using MapReduce in HBase

Calculate Average Count Using MapReduce in HBase - hadoop

I have a table called Log which every single row represent the single activity and have a table structure like this
info:date, info:ip_address, info:action, info:info
The example of data is like this
Column Family : info
date | ip_address | action | info
3 March 2014 | 191.2.2.2 | delete | blabla
4 March 2014 | 191.2.2.3 | view | blabla
5 March 2014 | 191.2.2.4 | create | blabla
3 March 2014 | 191.2.2.5 | delete | blabla
4 March 2014 | 191.2.2.5 | create | blabla
4 March 2014 | 191.2.2.6 | delete | blabla
What i want to do is to calculate the average of total of activity based on time. The first things to do is compute the total activity based on time:
time | total_activity
3 March 2014 | 2
4 March 2014 | 3
5 March 2014 | 1
Then, i want to calculate the average of that total_activity which the output will be represent like this
(2 + 3 + 1) / 3 = 2
How i can do this in HBase using MapReduce? I am already thinking that only using one reducer just capable to compute the total of activity based on time.
Thanks

Suggest you look into Scalding - it's the easiest and fastest way to write production Hadoop jobs that can tie in easily to HBase and stuff. Here is a project to help with HBase & Scalding https://github.com/ParallelAI/SpyGlass/blob/master/src/main/scala/parallelai/spyglass/hbase/example/SimpleHBaseSourceExample.scala
Then have a look at the Scalding API to work out how to do what you want:
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

Related

Want to convert given source table into expected Target

Want to conver below source to expected Target Table.
Source:
========================
Course | Year | Earning
========================
.NET | 2012 | 10000
Java | 2012 | 20000
.NET | 2012 | 5000
.NET | 2013 | 48000
Java | 2013 | 30000
Expected Output:
=====================
Year | .NET | Java
=====================
2012 | 15000 | 20000
2013 | 48000 | 30000

You can do this like rows to column problem. Here i am assuming you have only .Net and Java courses available. if you have more, you need to add more columns to below transformations.
First use expression transformation with below out ports. They will calculate earning for each course.
java = = IIF(Course ='Java', earning, 0)
dotnet= = IIF(Course ='.Net', earning, 0)
Use an aggregator to calculate those columns.
Year -- input output port with group by
out_java = SUM(java)
out_dotnet = SUM(dotnet)
Link Year, out_java ,out_dotnet to corresponding target columns.
So, whole mapping should look like
SQ --> EXP--> AGG --> Target

AWS QuickSight filtering based on result of a query or other dataset

I want to create an analysis table in AWS Quicksight that shows the quantity sold in a given month and it's subsequent month based on users who made a purchase on the given month.
Let's say I have a dataset called user_orders with the following data:
+---------+----------+------------+
| user_id | quantity | order_date |
+---------+----------+------------+
| 1 | 2 | 2020-04-01 |
+---------+----------+------------+
| 1 | 3 | 2020-04-02 |
+---------+----------+------------+
| 1 | 1 | 2020-05-23 |
+---------+----------+------------+
| 1 | 2 | 2020-06-02 |
+---------+----------+------------+
| 2 | 1 | 2020-05-03 |
+---------+----------+------------+
| 2 | 1 | 2020-05-04 |
+---------+----------+------------+
| 3 | 2 | 2020-04-07 |
+---------+----------+------------+
| 3 | 1 | 2020-04-10 |
+---------+----------+------------+
| 3 | 1 | 2020-06-23 |
+---------+----------+------------+
For example, using the table above I want to be able to show how many quantities sold in April, May, June, and so on (max 12 months) by users who made a purchase in April.
The resulting table should look like this:
+-----------+----------+
| | quantity |
+-----------+----------+
| 04-2020 | 8 |
+-----------+----------+
| 05-2020 | 1 |
+-----------+----------+
| 06-2020 | 3 |
+-----------+----------+
8 sold in April because user_id 1 made 5 purchase and user_id 3 made 3 purchase while user_id 2 did not make any purchase.
There is only 1 item sold in May because only user_id 1 made the purchase in May, but also made a purchase in April. user_id 2 also made a purchase in May but didn't in April so it's not counted.
I can make the table above using PHP and MySQL fairly easily using the following code:
# first get all the user ids who made a purchase in April
$user_ids = sql_query("SELECT DISTINCT user_id FROM user_orders WHERE order_date BETWEEN '2020-04-01' AND '2020-04-30'");
# get the quantity sold for each month by users who made a purchase in April
$purchases = sql_query("SELECT MONTH(order_date), SUM(quantity) FROM user_orders WHERE user_id IN ({$user_ids}) AND order_date BETWEEN '2020-04-01' AND '2021-03-31' GROUP BY MONTH(order_date);")
(Obviously, April is just an example, I'd like to be able to change the starting month dynamically using QuickSight control)
As my above example shown, it requires two queries to perform this analysis. First, is to get the user_ids of the users, and the next is to actually get the quantity sold by the users.
I have been trying to achieve this using Quicksight for the last 3 days but hasn't found any way yet.
I hope someone can point me in the right direction.
Thank you!

You can achieve this by creating a calculated field like this and filtering on it
distinctCountOver(ifelse(truncDate('MM', {order_Date}) = parseDate('2020-04-01'), 1, NULL), [{user_id}], PRE_AGG)
(ofcourse, you can change the parseDate portion to be your date parameter)
Now, lets say the name of the above calculated field is SpecificMonthUser. You can add a filter sum(SpecificMonthUser) != 0.
And then create a pivot table visualization with OrderDate, user id in the rows and sum(quantity) in the values. You should get the desired result.

Show daily count and the total count up to that day in Quicksight

I want to create a table analysis in AWS Quicksight that shows the number of new user per day and also the total number of user that has registered up until that day for the specified month.
The following sample table is what I want to achieve in Quicksight.
It shows the daily register count for March:
+-----------+----------------------+----------------------+
| | Daily Register Count | Total Register Count |
+-----------+----------------------+----------------------+
| March 1st | 2 | 42 |
+-----------+----------------------+----------------------+
| March 2nd | 5 | 47 |
+-----------+----------------------+----------------------+
| March 3rd | 3 | 50 |
+-----------+----------------------+----------------------+
| March 4th | 8 | 58 |
+-----------+----------------------+----------------------+
| March 5th | 2 | 60 |
+-----------+----------------------+----------------------+
The "Total Register Count" column above should show the total count of users registered from the beginning up until March 1st, and then for each row it should be incremented with the value from "Daily Register Count"
I'm absolutely scratching my head trying to implement the "Total Register Count". I have found some form of success using runningSum function however I need to be able to filter my dataset by month, and the runningSum function won't count the number outside of the filtered date.
My dataset is very simple, it looks like this:
+----+-------------+---------------+
| id | email | registered_at |
+----+-------------+---------------+
| 1 | aaa#aaa.com | 2020-01-01 |
+----+-------------+---------------+
| 2 | bbb#aaa.com | 2020-01-01 |
+----+-------------+---------------+
| 3 | ccc#aaa.com | 2020-01-03 |
+----+-------------+---------------+
| 4 | abc#aaa.com | 2020-01-04 |
+----+-------------+---------------+
| 5 | def#bbb.com | 2020-02-01 |
+----+-------------+---------------+
I hope someone can help me with this.
Thank you!

I am new to QuickSight but the way I was able to get Total Register Count is by creating a calculated field called count and assigned it the fixed value of 1.
Then I created a second calculated field "Total Register Count" with the following formula
runningSum(sum(count), [{ registered_at} ASC], [])

It sounds as if the CountOver function would work well for you. You'll need to partition your count by the day of the month (using the extract function). Here is a link related to the CountOver function.
https://docs.aws.amazon.com/quicksight/latest/user/countOver-function.html
This is called a Level Aware Aggregation in QuickSight. Here is additional information on that:
https://docs.aws.amazon.com/quicksight/latest/user/level-aware-aggregations.html
Here is information on the extract function:
https://docs.aws.amazon.com/quicksight/latest/user/extract-function.html
If I were to take a stab at your formula, it would look like this:
countover(ID,[extract('DD',registered_at)],PRE_FILTER)
Your table would have the registered_at field as the date.

OBIEE EVALUATE or EVALUATE_AGGR, MAX/MIN Group By

I am trying to summarize the table in OBIEE Analysis Tool (11g) using the EVALUATE or EVALUATE_AGGR Function. I have tried using the traditional MAX and MIN without EVALUATE but due to a bug with the union functionality I am not getting the desired result.
+------------------+------+-----------+----------+
| Loan ID | Year | Month | Balance |
+------------------+------+-----------+----------+
| L201618100000009 | 2021 | March | 232,000 |
| L201618100000009 | 2021 | June | 232,000 |
| L201618100000009 | 2021 | September | 232,000 |
| L201618100000009 | 2021 | December | 232,000 |
+------------------+------+-----------+----------+
EVALUATE_AGGR('MAX(%1 by %2, %3 )', "Loan and Debt Interest"."Loan BOP Amount", "Time"."Year","Loans"."Loan ID" )
I am getting this error: [nQSError: 10058] A general error has occurred. [nQSError: 43113] Message returned from OBIS. [nQSError: 43119] Query Failed: Please have your System Administrator look at the log for more details on this error. (HY000)
Below is a table of what I am expecting but instead because of the UNION the traditional MAX and MIN Functions are not working. (MAX = 928K, MIN = 928K)
+------------------+------+------------------+-------------------+
| Loan ID | Year | (MAX)BOP Balance | (MIN)EOP Balance |
+------------------+------+------------------+-------------------+
| L201618100000009 | 2021 | 232,000 | 232,000 |
+------------------+------+------------------+-------------------+

I'm a bit confused by the recent (re-)increase of questions like "I want to do this SQL in OBI". That's not how the tool works. That's not how it is designed.
a) If you are forced to do UNION requests, then your data model is poor to begin with.
b) You can easily create a level-based measure in the RPD which is tied to the year level of your time hierarchy and then set the aggregation rule to MAX. Same for MIN. That requires a proper data model though.
c) In the analysis you can also create a new calculated column using MAX("Balance" by "Loan ID", "Year") and it will also give you the same result.

Collect data by id in hive

I have a table with rows in following format
user | purchase | time_of_purchase|quantity
Sample
1234 | Bread | Jul 7 20:48| 1
1234 | Shaving Cream | July 10 14:20 | 2
5678 | Milk | July 7 3:48 | 1
5678 | Bread | July 7 3:49 | 2
5678 | Bread | July 7 15:30 | 1
I want to create purchase history of the user in following format
1234 | {[Bread , Jul 7 20:48,1] ,[ Shaving Cream , July 10 14:20, 2 ]}
5678 | {[Milk, July 7 3:48 , 1 ] , [Bread , July 7 3:49 , 2], [Bread , July 7 15:30 , 1]}
Is it possible to do this in hive or pig script? I tried collect_list but that does not keep order across columns to combine , Also tried brickhouse collect but that behaves like collect_set and I lose part of the information.

PIG Script
File = LOAD 'file.txt' using PigStorage(',') as (user:int, Purchase:chararray, timeofpurchase:chararray, quantity:int);
GRP_USER = GROUP File by user;
DUMP GRP_USER;
you can refer few example on http://ybhavesh.blogspot.com/
Hope it Helps

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Calculate Average Count Using MapReduce in HBase - hadoop

Related

Want to convert given source table into expected Target

AWS QuickSight filtering based on result of a query or other dataset

Show daily count and the total count up to that day in Quicksight

OBIEE EVALUATE or EVALUATE_AGGR, MAX/MIN Group By

Collect data by id in hive

Categories

Resources