running pig script with udf on hadoop - hadoop

I m new to hadoop and pig. I wonder how to run a pig script that internally calls a UDF method? The thing is I dont see the statement "register blah.jar" mentioned like on Pig UDF Manual site:
register myudfs.jar;
A = load 'student_data' as (name: chararray, age: int, gpa: float);
B = foreach A generate flatten(myudfs.Swap(name, age)), gpa;
C = foreach B generate $2;
D = limit B 20;
dump D;
But i do see a "jar" directory that contains "blah.jar". My coworker left already, so i wonder what was the trick? Maybe I can add the jar file to the command line?
Thanks a lot!

If there is no REGISTER statement in the script (and the script is valid), then it does not call any UDFs except possibly any of Pig's builtin functions. If you would like to use a UDF, you will need a REGISTER statement. REGISTER is unnecessary if no UDFs are called, which is probably why you don't see it in the script you have.
Here is a good reference on writing UDFs. After you have written it, you will need to compile it into a jar file, being sure to also include any classes it depends on (such as EvalFunc). This is the jar you will REGISTER.

Related

How to handle repeating code in pig ( modularization )

I have a pig code which does this,
connect to db1, do , connect to db2 and do same
union the outputs to produce final output
Basically how to handle cases where same code needed at multiple places in script(s)
you can define macros for the repeated operations and use these macros in your pig scripts like below:
DEFINE macroPerformUnion() RETURNS union_data {
union_data = -- do your stuff
}
save above in a file with some name macroPerformUnion.pig
now to use your macro in scripts you need to import the pig file
IMPORT 'macroPerformUnion.pig';
and now you can call your macro using
union_data_result = macroPerformUnion();

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

Is there any Conditional IF like operator in Apache PIG?

Actually I am writing PIG Script and want to execute some set of statements if one of the condition is satisfied.
I have set one variable and checking for some value of that variable. Suppose
if flag==0 then
A = LOAD 'file' using PigStorage() as (f1:int, ....);
B = ...;
C = ....;
else
again some Pig Latin statements
Can I do this in PIG Script? If yes, then how can I do this?
Thanks.
Yes, Pig does offer an if-then-else construction, but it is not used in the way you're asking.
Pig's if-then-else is an arithmetic operator invoked with the shorthand "condition ? true_value : false_value" as part of an expression, such as:
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
You have to already have loaded the table A to do this. To execute control flow around entire Pig statements you'll need something like oozie, as suggested by Fakrudeen.
You can create a Python wrapper around your Pig script. See Embedded Pig in the docs.
Pig is data flow language not control flow.
Only construct which comes close is PIG split, but it is very limited.
You can use oozie and its decision construct with two pig scripts.
Create a UDF (say, in Java) and then embed that into your PIG script. You will need to 'register' the jar file that you generate after writing the UDF.
//(something like this), say your Java UDF class is UDFCondition & the generated jar file is PigUDFCondition.jar, then in your PIG Code
register PigUDFCondition.jar
X = foreach A generate UDFCondition(..flag...)
There is a CASE Statement available from version 0.12 onwards.

Get xml Values in Pig Latin

I am using pig latin for a large XML dump. I am trying to get the value of the xml node in pig latin. The file is like
< username>Shujaat< /username>
I want to get the input Shujaat. I tried piggybank XMLLoader but it only separates the xml tags and its values also. The code is
register piggybank.jar;
A = load 'username.xml' using org.apache.pig.piggybank.storage.XMLLoader('username')
as (x: chararray);
B = foreach A generate x;
This code gives me the username tags also and values too. I only need values. Any idea how to do that? I found out regular expression but didnt know much?
Thanks
The example element you gave can be extracted with:
B = foreach A GENERATE REGEX_EXTRACT(x, '<username>(.*)</username>', 1)
AS name:chararray;
A nested element like this:
<user>
<id>456</id>
<username>Taylor</username>
</user>
can be extracted by with something like this:
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
'<user>\\n\\s*<id>(.*)</id>\\n\\s*<username>(.*)</username>\\n\\s*</user>'))
as (id: int, name:chararray);
(456,Taylor)
You will definitely need to define a more sophisticated regex that suits all of your needs, i.e: handles empty elements, attributes...etc. Another option is to write a custom UDF that uses Java libraries to parse the content of the XML so that you can avoid writing (over)complicated, error-prone regular expressions.

Resources