Apache Pig: apply LIMIT inside foreach referencing toplevel field - hadoop

I'm having trouble referencing a "parent" field within a foreach:
grunt> describe METRICS_SOURCE_WITH_CNT
METRICS_SOURCE_WITH_CNT:
{group: (hostname: chararray,site_guid: chararray,timestamp: long),
JOIN_FIELDS_ONLY: {(timestamp: long, unique_pageviews: long)},cnt: long
Note that cnt is total of tuples.
METRICS_SOURCE_TOP3 = foreach METRICS_SOURCE_WITH_CNT {
SORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews DESC;
TOPK = LIMIT SORTED 10;
REVSORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews ASC;
BOTTOMK = LIMIT REVSORTED cnt;
generate TOPK, BOTTOMK;
}
But it seems that when I'm applying the second LIMIT, Pig thinks that the cnt field is within REVSORTED, but it is actually a "parent" field.
Invalid field projection. Projected field [cnt] does not exist in schema: timestamp:long,....
I've tried referencing fields by number $x but it doesn't work. Pig always thinks that the referenced field is within the relation being LIMIT'd

You need to use Pig's dereference operator which allows you to reference the parent with .. For your example:
METRICS_SOURCE_TOP3 = foreach METRICS_SOURCE_WITH_CNT {
SORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews DESC;
TOPK = LIMIT SORTED 10;
REVSORTED = ORDER JOIN_FIELDS_ONLY by unique_pageviews ASC;
BOTTOMK = LIMIT REVSORTED METRICS_SOURCE_WITH_CNT.cnt;
generate TOPK, BOTTOMK;
}
Also it's interesting to note that before 0.10 Pig didn't support scalars in the LIMIT statement, so this kind of statement would have failed.

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

hadoop cascading how to get top N tuples

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources