Pig: Aggregate with specific parameter dynamically

Pig: Aggregate with specific parameter dynamically - hadoop

I have data some data that I want to plot, but the aggregation parameter is dependent on the user. The data looks like this-
Date Country Browser Count
---- ------- ------- -----
2015-07-11,US,Chrome,18
2015-07-11,US,Opera Mini,10
2015-07-11,US,Firefox,21
2015-07-11,US,IE,11
2015-07-11,US,Safari,8
...
2015-07-11,UK,Chrome,102
2015-07-11,UK,IE,45
2015-07-11,UK,Mobile Safari,47
2015-07-11,UK,Firefox,40
...
2015-07-11,DE,Android browser,50
2015-07-11,DE,Chrome,3
2015-07-11,DE,IE,11
2015-07-11,DE,Firefox,20
The user will tell me what to aggregate by (country or browser), and I want to display the counts. For example, grouped by country (aggregated by browser) it would be -
2015-07-11,US,ALL,68
2015-07-11,UK,ALL,234
2015-07-11,DE,ALL,84
whereas grouped by browser (aggregated by country) might be -
2015-07-11,ALL,IE,67
2015-07-11,ALL,Chrome,123
2015-07-11,ALL,Firefox,81
Would I have to write separate scripts based on each column or are there more efficient ways of doing this?

Better to write a macro to which the the alias name and the field name : 'country', 'browser' is passed. Aggregation logic remains the same.
Ref : http://pig.apache.org/docs/r0.10.0/cont.html#macros
Example :
Pig script :
DEFINE AGGR_DATA(alias, field_name) RETURN aggr_alias {
alias_grp = GROUP $alias BY $field_name;
-- REST OF LOGIC
$aggr_alias = -- AGGR DATA;
}
Usage :
aggr_data = AGGR_DATA(browser_data_alias,'country');
Pass the required field name to macro.

Related

How to Select multiple related columns in add calculated fields in Quicksight parameter using ifelse?

I have a parameter 'type' in a table and it can have multiple values as follows -
human
chimpanzee
orangutan
I have 3 columns related to each type in the table -
human_avg_height, human_avg_weight, human_avg_lifespan
chimpanzee_avg_height, chimpanzee_avg_weight, chimpanzee_avg_lifespan
orangutan_avg_height, orangutan_avg_weight, orangutan_avg_lifespan
So if i select the type as human, the quicksight dashboard should only display the three columns -
human_avg_height, human_avg_weight, human_avg_lifespan
and should not display the following columns -
chimpanzee_avg_height, chimpanzee_avg_weight, chimpanzee_avg_lifespan
orangutan_avg_height, orangutan_avg_weight, orangutan_avg_lifespan
I created the parameter type and in the add calculated fields I am trying to use ifelse to select the columns based on the parameter selected as follows -
ifelse(${type}='human',{human_avg_height}, {human_avg_weight}, {human_avg_lifespan},{function})
I also tried -
ifelse(${type}='human',{{human_avg_height}, {human_avg_weight}, {human_avg_lifespan},{function}})
And -
ifelse(${type}='human',{human_avg_height, human_avg_weight, human_avg_lifespan},{function}})
But none of it is working. What am i doing wrong ?

One way to do this would be to use three different calculated fields, one for all the heights, one for weights and one for lifespan. The heights one would look like this:
ifelse(
${type}='human',{human_avg_height}, ifelse(
${type}='chimpanzee', { chimpanzee_avg_height}, ifelse(
${type}='orangutan',{ orangutan_avg_height},
NULL
)))
Make another calculated field for weights and lifespan and then add these calculated fields to your table, and filter by type.
To make it clear to the viewer what data is present, edit the Title of the visual to include the type:
${type} Data

You have to create one calculated field for each measure using the ifelse with the type to choose the correct vale, but is not necessary to create inner ifelse as skabo did, the if else syntax is ifelse(if, then [, if, then ...], else) so you can define the calculated fields as follows:
avg_height = ifelse(${type}='human', {human_avg_height}, ${type}='chimpanzee', {chimpanzee_avg_height},${type}='orangutan', {orangutan_avg_height}, NULL)
avg_weight = ifelse(${type}='human', {human_avg_weight}, ${type}='chimpanzee', {chimpanzee_avg_weight},${type}='orangutan', {orangutan_avg_weight}, NULL)
avg_lifespan = ifelse(${type}='human', {human_avg_lifespan}, ${type}='chimpanzee', {chimpanzee_avg_lifespan},${type}='orangutan', {orangutan_avg_lifespan}, NULL)
Then use those calculated fields in your visuals.

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel

The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)

It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

How do I use the Hive "test in(val1, val2)" built in function?

The Programming Hive book lists a test in built in function in Hive, but it is not obvious how to use it and I've been unable to find examples
Here is the information from Programming Hive:
Return type Signature Description
----------- --------- -----------
BOOLEAN test in(val1, val2, …) Return true if testequals one of the values in the list.
I want to know if it can be used to say whether a value is in a Hive array.
For example if I do the query:
hive > select id, mcn from patients limit 2;
id mcn
68900015 ["7382771"]
68900016 ["8847332","60015163","63605102","63251683"]
I'd like to be able to test whether one of those numbers, say "60015163" is in the mcn list for a given patient.
Not sure how to do it.
I've tried a number of variations, all of which fail to parse. Here are two examples that don't work:
select id, test in (mcn, "60015163") from patients where id = '68900016';
select id, mcn from patients where id = '68900016' and test mcn in('60015163');

The function is not test in bu instead in. In the table 6-5 test is a colum name.
So in order to know whether a value is in a Hive array, you need first to use explode on your array.
Instead of explode the array column, you can create an UDF, as it is explain here http://souravgulati.webs.com/apps/forums/topics/show/8863080-hive-accessing-hive-array-custom-udf-

Play Framework: How to render a table structure from plain SQL table

I would be happy to get a good way to get the "table" structure from a plain SQL table.
In my specific case, I need to render JSON structure used by Google Visualization API "datatable" object:
http://code.google.com/apis/chart/interactive/docs/reference.html#DataTable
However, having an example in HTML would help either.
My "source" is a plain SQL table of "DailySales": its columns are "Day" (date), "Product" and "DailySaleTotal" (daily sale for that product). Please recall that my "model" reflects the 3-column table above.
The table columns should be "products" (suppose we have very small number of such). Each row should represent a specific date, and the row data are the actual sales for that day.
Date Product1 Product2 Product3
01/01/2012 30 50 60
01/02/2012 35 3 15
I was trying to use nested #{list} tags in a template, but unfortunately I failed to find a natural way to provide a template with a "list" to represent the "row data".
Of course, I can build a "helper object" in Java that will build a list of the "sales data" items per date - but this looks very weird to me.
I would be thankful to anyone who can provide an elegant solution.
Max

When you load your model order it by date and product name. Then in your controller build a map with date as index and list of model objects that have the same date as value of the map
Then in your template you have a first list iteration on map keys for the rows and a second list iteration on the list value for the columns.
Something like
[
#{list modelMap.keys, as: 'date'}
[${date},#{list modelMap.get(date), as: 'product'}${product.dailySaleTotal}#{ifnot product_isLast},#{/ifnot}#{/list}]#{ifnot date_isLast},#{/ifnot}
#{/list}
]
you can then adapt your json rendering to the exact structure you want to have. Here it is an array of arrays.

Instead of generating the JSON yourself, like Seb suggested, you can generate it:
private static Result queryToJsonResult(String sql) {
SqlQuery sqlQuery = Ebean.createSqlQuery(sql);
return ok(Json.toJson(sqlQuery.findList()));
}

Change Group Labels in AdvancedDataGrid

I'm trying to use an AdvancedDataGrid to display some grouped data. Normally flex displays this in a "tree view" with a folder icon represent the group. I need to group the data based on an integer ID field in my object, but I'd like the label for the folder icon to display the groupName field in my object.
Here's a little example:
{groupName: group1, ID: 1234}
{groupName: group2, ID: 5678}
<mx:grouping>
<mx:Grouping label="Group"> <--- The label of the whole column
<mx:GroupingField name="ID">
</mx:Grouping>
</mx:grouping>
Resulting output:
=== Group ===
+ 1234
- child
- child
+ 5678
...
But I'd really like to output:
=== Group ===
+ group1
- child
- child
+ group2
...
If anyone has any tips I'd appreciate it.
-- Dan

Have a look at GroupingField#groupingFunction. From the adobe docs:
A function that determines the label for this group. By default, the group displays the text for the field in the data that matches the filed specified by the name property. However, sometimes you want to group the items based on more than one field in the data, or group based on something that is not a simple String field. In such a case, you specify a callback function by using the groupingFunction property.
private function myGroupingFunction(value:Object, field:GroupingField):String

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig: Aggregate with specific parameter dynamically - hadoop

Related

How to Select multiple related columns in add calculated fields in Quicksight parameter using ifelse?

Pig - how to select only some values from the list (not just simple distinct)?

How do I use the Hive "test in(val1, val2)" built in function?

Play Framework: How to render a table structure from plain SQL table

Change Group Labels in AdvancedDataGrid

Categories

Resources