Is there any Conditional IF like operator in Apache PIG? - hadoop

Actually I am writing PIG Script and want to execute some set of statements if one of the condition is satisfied.
I have set one variable and checking for some value of that variable. Suppose
if flag==0 then
A = LOAD 'file' using PigStorage() as (f1:int, ....);
B = ...;
C = ....;
else
again some Pig Latin statements
Can I do this in PIG Script? If yes, then how can I do this?
Thanks.

Yes, Pig does offer an if-then-else construction, but it is not used in the way you're asking.
Pig's if-then-else is an arithmetic operator invoked with the shorthand "condition ? true_value : false_value" as part of an expression, such as:
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
You have to already have loaded the table A to do this. To execute control flow around entire Pig statements you'll need something like oozie, as suggested by Fakrudeen.

You can create a Python wrapper around your Pig script. See Embedded Pig in the docs.

Pig is data flow language not control flow.
Only construct which comes close is PIG split, but it is very limited.
You can use oozie and its decision construct with two pig scripts.

Create a UDF (say, in Java) and then embed that into your PIG script. You will need to 'register' the jar file that you generate after writing the UDF.
//(something like this), say your Java UDF class is UDFCondition & the generated jar file is PigUDFCondition.jar, then in your PIG Code
register PigUDFCondition.jar
X = foreach A generate UDFCondition(..flag...)

There is a CASE Statement available from version 0.12 onwards.

Related

How to handle repeating code in pig ( modularization )

I have a pig code which does this,
connect to db1, do , connect to db2 and do same
union the outputs to produce final output
Basically how to handle cases where same code needed at multiple places in script(s)
you can define macros for the repeated operations and use these macros in your pig scripts like below:
DEFINE macroPerformUnion() RETURNS union_data {
union_data = -- do your stuff
}
save above in a file with some name macroPerformUnion.pig
now to use your macro in scripts you need to import the pig file
IMPORT 'macroPerformUnion.pig';
and now you can call your macro using
union_data_result = macroPerformUnion();

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

Apache Pig - Is it possible to serialize a variable?

Let's take the wordCount example:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Is it possible to serialize the "bag_words" variable so that we don't have to rebuild the entire bag each time we want to execute the script ?
Thanks.
STORE bag_words INTO 'some-output-directory';
Then read it in later to skip the foreach generate, flatten, tokenize.
You can output any alias in pig using the STORE command: you could use standard formats (like CSV) or write your own PigLoader class to implement any specific behaviour. You can then LOAD this output in a separate script, thus bypassing the initial LOAD.

running pig script with udf on hadoop

I m new to hadoop and pig. I wonder how to run a pig script that internally calls a UDF method? The thing is I dont see the statement "register blah.jar" mentioned like on Pig UDF Manual site:
register myudfs.jar;
A = load 'student_data' as (name: chararray, age: int, gpa: float);
B = foreach A generate flatten(myudfs.Swap(name, age)), gpa;
C = foreach B generate $2;
D = limit B 20;
dump D;
But i do see a "jar" directory that contains "blah.jar". My coworker left already, so i wonder what was the trick? Maybe I can add the jar file to the command line?
Thanks a lot!
If there is no REGISTER statement in the script (and the script is valid), then it does not call any UDFs except possibly any of Pig's builtin functions. If you would like to use a UDF, you will need a REGISTER statement. REGISTER is unnecessary if no UDFs are called, which is probably why you don't see it in the script you have.
Here is a good reference on writing UDFs. After you have written it, you will need to compile it into a jar file, being sure to also include any classes it depends on (such as EvalFunc). This is the jar you will REGISTER.

Resources