How to rename the fields in a relation - hadoop

Consider the following code :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
The relation ranked has two fields : the line number and the text. The text is called line and can be referred to by this alias, but the line number generated by RANK has none. As a consequence, the only way I can refer to it is as $0.
How can I give $0 an name, so that I can refer to it more easily once it's been joined to another data set and is no longer $0?

What you want to do is to define a schema for you data. The easiest way to do so is to use the AS keywoard just like you're doing with LOAD.
You can define a schema with three operators : LOAD, STREAM and FOREACH.
Here, the easiest way to do so would be the following :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
renamed_ranked = foreach B generate $0 as rank, $1;
You may find more informations on the associated documentation.
It is also good to know that this operation won't add an iteration to your script. As #ArnonRotem-Gal-Oz said :
Pig doesn't perform the action in a serial manner i.e. it doesn't do all the ranking and then does another iteration on all the records. The pig optimizer will do the rename when it assigns the rank. You can see a similar behaviour explained in the pig cookbook.

You can add a projection with FOREACH as
named_ranked = FOREACH ranked GENERATE $0 as r,*;

Related

Join 2 tables in hadoop

I am very new to hadoop so please bear with me. Any help would be appreciated.
I need to join 2 tables,
Table 1 will have pagename , pagerank
for eg. Actual data set is huge but with the similar pattern
pageA,0.13
pageB,0.14
pageC,0.53
Table 2, it is a simple wordcount kind of table with word , pagename
for eg. actual dataset is huge but with similar pattern
test,pageA:pageB
sample,pageC
json,pageC:pageA:pageD
Now if user searches for any word from second table, I should give him the results of pages based on their pagerank from table 1.
Output when searched for test,
test = pageB,pageA
My approach was to load the first table into distributed cache. Read second table in map method get the list of pages for the word, sort the list using the pagerank info from first table which is loaded into distributed cache. This works for the dataset i am working but wanted to know if there was any better way, also would like to know how can this join be done with pig or hive.
A simple approach using a pig script:
PAGERANK = LOAD 'hdfs/pagerank/dataset/location' USING PigStorage(',')
AS (page:chararray, rank:float);
WORDS_TO_PAGES = LOAD 'hdfs/words/dataset/location' USING PigStorage(',')
AS (word:chararray, pages:chararray);
PAGES_MATCHING = FOREACH (FILTER WORDS_TO_PAGES BY word == '$query_word') GENERATE FLATTEN(TOKENIZE(pages, ':'));
RESULTS = FOREACH (JOIN PAGERANK BY page, PAGES_MATCHING BY $0) GENERATE page, rank;
SORTED_RESULTS = ORDER RESULTS BY rank DESC;
DUMP SORTED_RESULTS;
The script needs one parameter, which is the query word:
pig -f pagerank_join.pig -param query_word=test

Intersection of Intervals in Apache Pig

In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.
This is not an efficient solution, and I recommend writing a UDF to do this.
Self Join the dataset with itself to get a cross product of all the combinations. In pig, it's difficult to join something with itself, so you just act as if you are loading two separate datasets. After the cross product, you end up with data like
1,A,1500,2000,1,A,1500,2000
1,A,1500,2000,2,A,1900,3000
.....
At this point, you need to satisfy four conditionals,
"where" field matches
id one and two from the self join don't match (so you don't get back the same ID intersecting with itself)
start time from second group being compared should be greater than start time for first group and less then end time for first group
This code should work, might have a syntax error somewhere as I couldn't test it but should help you to write what you need.
inputdataone = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
inputdatatwo = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
crossProduct = CROSS inputdataone, inputdatatwo;
crossProduct =
FOREACH crossProduct
GENERATE inputdataone::id as id_one,
inputdatatwo::id as id_two,
(inputdatatwo::start-inputdataone::start>=0 AND inputdatatwo::start-inputdataone::end<=0 AND inputdataone::where==inputdatatwo::where?1:0) as intersect;
find_intersect = FILTER crossProduct BY intersect==1;
final =
FOREACH find_intersect
GENERATE id_one,
id_two;
Crossing large sets inflates the data.
A naive solution without crossing would be to partition the intervals and check for intersections within each interval.
I am working on a similar problem and will provide a code sample when I am done.

Storing data with big and dynamic groupings / paths

I currently have the following pig script (column list truncated for brevity):
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
SPLIT inputData INTO site0 IF (SITE_ID_COL == 0), site3 IF (SITE_ID_COL == 3), site15 IF (SITE_ID_COL == 15);
STORE site0 INTO 'pigsplit1/0/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/0/','2', 'bz2', '\\x7F');
STORE site3 INTO 'pigsplit1/3/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/3/','2', 'bz2', '\\x7F');
STORE site15 INTO 'pigsplit1/15/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/15/','2', 'bz2', '\\x7F');
And it works great for what I wanted it to do, but there's actually at least 22 possible site IDs and I'm not certain there's not more. I'd like to dynamically create the splits and store into paths based on that column. Is the easiest way to do this going to be through a two step usage of the MultiStorage UDF, first splitting by the site ID and then loading all those results and splitting by the date? That seems inefficient. Can I somehow do it through GROUP BYs? It seems like I should be able to GROUP BY the site ID, then flatten each row and run the multi storage on that, but I'm not sure how to concatenate the GROUP into the path.
The MultiStorage UDF is not set up to divide inputs on two different fields, but that's essentially what you're doing -- the use of SPLIT is just to emulate MultiStorage with two parameters. In that case, I'd recommend the following:
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
dataWithKey = FOREACH inputData GENERATE CONCAT(CONCAT(SITE_ID_COL, '-'), EXTRACT_DATE_COL), *;
STORE dataWithKey INTO 'tmpDir' USING org.apache.pig.piggybank.storage.MultiStorage('tmpDir', '0', 'bz2', '\\x7F');
Then go over your output with a simple script to list all the files in your output directories, extract the site and date IDs, and move them to appropriate locations with whatever structure you like.
Not the most elegant workaround, but it could work all right for you. One thing to watch out for is the separator you choose in your key might not be allowed (it might only be alphanumeric). Also, you'll be stuck with that extra field in your output data.
I've actually submitted a patch to the MultiStorage module to allow splitting on multiple tuple fields rather than only one, resulting in a dynamic output tree.
https://issues.apache.org/jira/browse/PIG-3258
It hasn't gotten much attention yet, but I'm using it in production with no issues.

Store some fields from PIG to Hbase

I am trying to extract some part of string and store it to hbase in columns.
Files Content :
msgType1 Person xyz has opened Internet:www.google.com from IP:192.123.123.123 for duration 00:15:00
msgType2 Person xyz denied for opening Internet:202.x.x.x from IP:192.123.123.123 reason:unautheticated
msgType1 Person xyz has opened Internet:202.x.x.x from IP:192.123.123.123 for duration 00:15:00
pattern of messages corresponding to msgType is fixed. Now i am trying to store person name, destination , source , duration etc in hbase.
I am trying to to wrtie script in PIG to do this task.
But i am stuck at extracting part.(extracting IP or website name from 'Internet:202.x.x.x' token inside string).
I tried Regular expression but its not working for me. Regex alway throw this error :
ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT as multiple or none of them fit. Please use an explicit cast.
is there any other way to extract these value and store it to hbase in PIG or other than PIG?
How do you use the REGEX_EXTRACT function ? Have you seen the REGEX_EXTRACT_ALL function ? According to the documentation (http://pig.apache.org/docs/r0.9.2/func.html#regex-extract-all), it should be like this :
test = LOAD 'test.csv' USING org.apache.pig.builtin.PigStorage(',') AS (key:chararray, value:chararray);
test = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL (value, '(\\S+):(\\S+)')) as (match1:chararray, match2:chararray);
DUMP test;
My file is like that :
1,a:b
2,c:d
3,
I know it's easy to be lazy and not take the step, but you really should use a user-defined function here. Pig is good as a data flow language and not much else, so in order to get the full power out of it, you are going to need to use a lot of UDFs to go through text and do more complicated operations.
The UDF will take a single string as a parameter, then return a tuple that represents (person, destination, source, duration). To use it, you'll do:
A = LOAD ...
...
B = FOREACH A GENERATE MyParseUDF(logline);
...
STORE B INTO ...
You didn't mention what your HBase row key was, but be sure that's the first element in the relation before storing it.

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
output = FOREACH data GENERATE title, MAP_PRODUCT(category);
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Resources