Store multiple file Using Same Pig Script - hadoop

File has Data :
A 12345
B 32122
C 23232
what is the option to run only one time pig script and store first record(A 12345) in one file , second record(B 32122) in second file and third(c 23232) in third file. Right now if we run the pig script it will run the job for each store. Please let me know the option.

Use the SPLIT operator to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression:
A tuple may be assigned to more than one relation.
A tuple may not be assigned to any relation.
Example
In this example relation A is split into three relations, X, Y, and Z.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3)
(7,8,9)
then STORE X, Y ,Z according to your filename
My aim is to read a file and write the record in to different files based on criteria it will fit to your problem.

Actually pig is not made for this. But still if you wanna do that then will have to write a custom store function. Will have to write some class which extends StoreFunc class. Further inside it will have to use Multiple outputs since you wanna store in 3 different files.
Refer https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions for custom store function.
Otherwise in pig, one store command will store only one alias, only in one file.
For such kind of requirement better you write JAVA MR.

You can try with MultiStorage() option, It will be available in piggybank jar. you need to download pig-0.11.1.jar and set it in your classpath.
Example:
input.txt
A 12345
B 32122
C 23232
PigScript:
A = LOAD 'input.txt' USING PigStorage(' ') AS (f1,f2);
STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');
Now output folder contains 3 dirs A,B,C and filenames(A-0,000 ,B-0,000 and C-0,000 ) contain the actual value
output$ ls
A B C _SUCCESS
output$ cat A/A-0,000
A 12345
output$ cat B/B-0,000
B 32122
output$ cat C/C-0,000
C 23232

Related

How to write for loop to run regression on multiple files contained in same folder?

I need to develop a script to run a simple OLS on multiple csv files stored in the same folder.
All have the same column names and regression will always be based upon the same columns ("x_var" and "y_var").
The below code is used to read in the csvs and rename them.
## Read in files from folder
file.List <- list.files(pattern = "*.csv")
for(i in 1:length(file.List))
{
assign(paste(gsub(".csv","", file.List[i])), read.csv(file.List[i]))
}
However, after this [very initial stage!] I've got a bit lost........
Each dataframe has 7 identical columns. a, b, c, d, x_var, e, y_var.....
I need to run a simple OLS using lm(x_car ~ y_var, data = dataframes) and plot the result on each dataframe and assumed a 'for loop' would be the best option, but am not too sure of how to do so....
After each regression is run I want it to extract the coefficients/R2 etc into a csv and save the plot separately.......
Tried below, but have gone very wrong [and not working at all];
list <- list(gsub(" finalSIRTAnalysis.csv","", file.List))
for(i in length(file.List))
{
lm(x_var ~ y_var, data = [i])
}
Can't even make a start on this........and need some advice, if anyone has any good ideas (such as creating an external function first.....)
I am not sure if the function lm is available to compute the results using multiple variable sources. Try merging the database. I have have a similar issue because I have 5k files and is computationally impossible to merge them all. But maybe this answer can help you.
https://stackoverflow.com/a/63770065/14744492

SHELL Sorting output alphabetically

I have a script with output for example a c d txt iso e z I need to sort it alphabetically. These are file extensions so I cant compile it together in one word and then split up.
Can anyone help me?
If your the name of your script is foo and it writes to stdout a string such as a c d txt iso e z, you can get the sorted list by, for instance:
sorted_output=$(foo|xargs -n 1|sort)
Of course, depending on what you are going to do with the result, it might make more sense to store it into an array.

PIG : how to separate data by positions in a single line

normally if we there any delimiter in a line we do.
load "pigtest.txt" using PigStorage(',') as (year:int,temp:float);
Below is the sample whether data of single line.
0029029070999991901010106004+64333+023450FM12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
I need to extract year 1901(16th position to 4positions) temperature(89th position to 4 positions) so that i can define my key and value.
I also need to trim initial zeroes from temperature.
Thanks in advance
Yes you can use FixedWidthLoaderUDF to extract the specific position from input data. Download piggybank.jar and try the below approach.
input
0029029070999991901010106004+64333+023450FM12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
PigScript:
REGISTER /tmp/piggybank.jar;
A = LOAD 'input' USING org.apache.pig.piggybank.storage.FixedWidthLoader('16-19,89-92') AS(year:int,temp:float);
DUMP A;
Output:
(1901,781.0)
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/FixedWidthLoader.html

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

Apache Pig - Is it possible to serialize a variable?

Let's take the wordCount example:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Is it possible to serialize the "bag_words" variable so that we don't have to rebuild the entire bag each time we want to execute the script ?
Thanks.
STORE bag_words INTO 'some-output-directory';
Then read it in later to skip the foreach generate, flatten, tokenize.
You can output any alias in pig using the STORE command: you could use standard formats (like CSV) or write your own PigLoader class to implement any specific behaviour. You can then LOAD this output in a separate script, thus bypassing the initial LOAD.

Resources