Pig Script Based On Grouping - hadoop

I have a data-set like this.
cus_ID BRAND AMOUNT
1 5 10
2 4 20
3 5 15
1 5 20
1 4 30
2 3 15
I want to find top 5 brands and top 10 customer id's of each of those top 5 brands using PIG.

For your first goal (find top 5 brands), here you go (code not tested):
mydata = LOAD ... <load your data from your file or other source>
grouped = GROUP mydata BY brand;
flattened = FOREACH grouped GENERATE
FLATTEN(group) AS brand,
SUM(mydata.amount) AS amount_per_brand;
topfivebrand = LIMIT (ORDER flattened by amount_per_brand DESC) 5;
dump topfivebrand;
That should get you started! :)

Related

Distinct on two columns with same data type

In my game application I have a combats table:
id player_one_id player_two_id
---- --------------- ---------------
1 1 2
2 1 3
3 3 4
4 4 1
Now I need to know hoy many unique users played the game. How can I apply distinct, count on both columns player_one_id and player_two_id?
Many thanks.
By using union you can get unique distinct value.
$playerone = DB::table("combats")
->select("combats.player_one_id");
$playertwo = DB::table("combats")
->select("combats.player_two_id")
->union($playerone)
->count();

Select and sum multiple columns for statistic purposes with Laravel query

I have one table scores where I have saving users scores. It's looks like this
table `scores`
id | points | user_id
1 5 1
2 2 1
3 4 1
4 1 3
5 10 2
I want to select each user, sum his points and show as a ranking. The result from above should be
user_id | points
1 11
2 10
3 1
The query with which I came up is
$sumPoints = Scores::select( \DB::raw("sum(points) as numberOfPoints"), \DB::raw("count(id) as numberId"))->groupBy("user_id")->first();
The problem is in ->first() because it's return only one result.. it is working as must. If I try to use ->get() instead I've got Undefined property error. How should I use this?
The query which is working in phpmyadmin
SELECT count(id) as numberId, sum(points) as numberOfPoints FROM `points` GROUP BY `user_id`
You can use something like this
$sumPoints = Scores::select( \DB::raw("sum(points) as numberOfPoints"), \DB::raw("count(id) as numberId"))->groupBy("user_id")->get();
foreach($sumPoints as $point){
dd($point); //OR dd($point->numberOfPoints)
}

Sum multiple columns using PIG

I have multiple files with same columns and I am trying to aggregate the values in two columns using SUM.
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
How can I sum the first_count and second_count?
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote but when I execute it I get an error "Could not infer matching function for SUM as multiple of none of them fit.Please use an explicit cast.
A = LOAD '/output/*/part*' AS (id:chararray,first_count:chararray,second_count:chararray,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
Your load statement is wrong. first_count, second_count is loaded as chararray. Sum can't add two strings. If you are sure that these columns will take numbers only then load them as int. Try this-
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
It should work.

SAS Sorting within group

I would like to try and sort this data by descending number of events and from latest date, grouped by ID
I have tried proc sql;
proc sql;
create table new as
select *
from old
group by ID
order by events desc, date desc;
quit;
The result I currently get is
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
2 09/11/2015 2
3 01/01/2015 2
2 16/10/2014 2
3 08/12/2013 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Although the dates and events are sorted descending. Those IDs with multiple events are no longer grouped.
Would it be possible to achieve the below in fewer steps?
ID Date Events
1 09/10/2015 3
1 27/06/2014 3
1 03/01/2014 3
3 01/01/2015 2
3 08/12/2013 2
2 09/11/2015 2
2 16/10/2014 2
4 08/10/2015 1
5 09/11/2014 1
6 02/02/2013 1
Thanks
It looks to me like you're trying to sort by descending event, then by either the earliest or latest date (I can't tell which one from your explanation), also descending, and then by id. In your proc sql query, you could try calculating the min or max of the Date variable, grouped by event and id, and then sort the result by descending event, the descending min/max of the date, and id.

How do i resolve conflict when selecting productid whose product visit have same occurences

My Database Table Contain Records Like this:
ID ProductId Occurences
1 1 1
2 2 5
3 3 3
4 4 3
5 5 5
6 6 8
7 7 9
Now i want to get top 4 ProductId with the highest Occurences.
This is my Query:
var data = (from temp in context.Product
orderby temp.Occurences descending
select temp).Take(4).ToList();
Now here as because ProductId 2 and 5 have same occurences and ProductId 3 and 4 also have same occurences then here I am not getting that how to resolve this means which product id should i take as because they are having same occurences.
Basically i am selecting this productid to display this products on my website.i will display those products which are return by this query.
So can anyone please give me some idea like how to resolve this ???
Expected Output:
ID ProductId Occurences
1 7 9
2 6 8
so Now for 3rd position Which ProductId i should select as because both ProductId 2 and 5 have same Occurences and for 4th position which ProductId i should select among 3 and 4 as because they both too have same occurences.
can I suggest you to use group by technique and pick the first one
context.Product.GroupBy(p => p.Occurance)
.Select(grp => grp.FirstOrDefault())
.Take(4)ToList();
Sorry, I have not tested this but it should do the job. You just need to add OrderBy along with it.

Resources