How do I get matching values in PIG without using UDF? - hadoop

Consider these as my input files,
Input 1: (File 1)
12,23,14,15,9
1,2,3,4,5
34,17,8
.
.
Input 2: (File 2)
12 Twelve
23 TwentyThree
34 ThirtyFour
.
.
I will be reading each line from "Input 1" file using my PIG script and I would like to get the results as below, based on the "Input 2" file.
Output:
Twelve,TwentyThree,Fourteen,Fifteen,Nine
One,Two,Three,Four,Five
.
.
Is it possible to achieve this without UDF ? Please let me know your suggestions.
Thanks in Advance !

This violates your criteria of 'No UDF' but the UDF is built-in so I suspect it will suffice.
Query:
data1 = LOAD 'file1' AS (val:chararray);
data2 = LOAD 'file2' AS (num:chararray, desc:chararray);
A = RANK data1; /* creates row number*/
B = FOREACH A GENERATE rank_data1, FLATTEN(TOKENIZE(val, ',')) AS num;
C = RANK B; /* used to keep tuple elements sorted in bag*/
D = JOIN C BY num, data2 BY num;
E = FOREACH D GENERATE C::rank_data1 AS rank_1:long
, C::rank_B AS rank_2:long
, data2::desc AS description;
grpd = GROUP E BY rank_1;
F = FOREACH grpd {
sorted = ORDER E BY rank_2;
GENERATE sorted;
};
X = FOREACH F GENERATE FLATTEN(BagToTuple(sorted.description));
DUMP X;
Output:
(Twelve,TwentyThree,Fourteen,Fifteen,Nine)
(One,Two,Three,Four,Five)
(ThirtyFour,Seventeen,Eight)

Here is a Hive solution:
--Load the data into Hive
CREATE TABLE file1 (
line array<string>
)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY ',';
LOAD DATA INPATH '/tmp/test2/file1' OVERWRITE INTO TABLE file1;
CREATE TABLE file2 (
name string,
value string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/tmp/test2/file2' OVERWRITE INTO TABLE file2;
--explode the rows from the first table and create a newid to use for correlation
CREATE TABLE file1_exploded
AS
WITH tmp
AS
(SELECT RAND() newid, line from file1)
SELECT newid, item FROM tmp
LATERAL VIEW EXPLODE (line) a AS item;
--apply substitions using the second table, then join lines back together
SELECT CONCAT_WS(',', COLLECT_LIST(value))
FROM
file1_exploded
JOIN file2 ON item = name
GROUP BY newid;

Related

Filter records in Pig

Below is the data
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,dollar
61,62,63,64,pound
col1,col2,col3 will form the combination of unique keys. The use case is to filter the data based on col5.
For the unique key combination we need to filter the record where col5 value is "dollar", only if the same combination has "pound" value.
The expected output is
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,pound
How to proceed further since there is no special operators in Pig like Hive.
A = load 'test1.csv' using PigStorage(',') as (col1:int,col2:int,col3:int,col4:int,col5:chararray);
B = FILTER A BY col5 == 'pound';
Get all the records with 'pound', then get all records with 'dollar' that does not match with the id combination with 'pound' in col5. Finally, marry them off ... UNION.
B = FILTER A BY col5 == 'pound';
C = JOIN A BY (col1,col2,col3) LEFT OUTER,B BY (col1,col2,col3);
D = FILTER C BY (B::col1 is null);
E = FOREACH D GENERATE A::col1,A::col2,A::col3,A::col4,A::col5;
F = UNION B,E;
DUMP F;
Output

Creating a massive filter by in pig

I have this code.
large = load 'a super large file'
CC = FILTER large BY $19 == 'abc OR $20 == 'abc'
OR $19 == 'def' or $20 == 'def' ....;
The number of OR conditions can go up to 100s or even thousands.
Is there a better way to do this ?
Yes,put those conditions in another file.Load it into a relation and join the two relations on the column.If you have to filter on multiple columns then create as many filter files as the conditions.Below is an example for 2 columns
large = load 'a super large file'
filter1 = load 'file with values needed to compare with $19';
filter2 = load 'file with values needed to compare with $20';
f1 = JOIN large BY $19,filter1 BY $0;
f2 = JOIN large BY $20,filter2 BY $0;
final = UNION f1,f2;
DUMP final;
You can probably use 1 filter file with multiple columns and join on those to get different filtered results and then just union the relations.
large = load 'a super large file'
filter_file = load 'file with values in different columns';
f1 = JOIN large BY $19,filter_file BY $0;
f2 = JOIN large BY $20,filter_file BY $1;
final = UNION f1,f2;
DUMP final;

PIG: How to remove '::' in the column name

I have a pig relation like below:
FINAL= {input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray,test_1:: type: chararray,test_2::name:chararray}
I am trying to store all columns for input_md5 relation to a hive table.
like all input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray not taking test_1:: type: chararray,test_2::name:chararray
is there any command in pig which filters only columns of input_md5.Something like below:
STORE= FOREACH FINAL GENERATE all input_md5::type .
I know that pig have :
FOREACH FINAL GENERATE all input_md5::type as type syntax, but i have many columns so I cannot use as in my code.
Because when i try:
STORE= FOREACH FINAL GENERATE input_md5::type .. bus_input_md5::name;
Pig throws an error:
org.apache.hive.hcatalog.common.HCatException : 2007 : Invalid column position in partition schema : Expected column <type> at position 1, found column <input_md5::type>
Thanks in advance,
Resolved this issue , below is the fix:
Create a relation with some filter condition as below:
DUMMY_RELATION= FILTER SOURCE_TABLE BY type== ''; (I took a column named type ,this can be filtered by any column in the table , all that matters is we need its schema)
FINAL_DATASET= UNION DUMMY_RELATION,SCHEMA_1,SCHEMA_2;
(this new DUMMY_RELATIONn should be placed 1st in the union)
Now you no more have :: operator And your column names would match hive table's column names, provided your source table (to DUMMY_RELATION) and target table have same column order.
Thanks to myself :)
I implemented Neethu's example this way. May have typos, but it shows how to implement this idea.
tableA = LOAD 'default.tableA' USING org.apache.hive.hcatalog.pig.HCatLoader();
tableB = LOAD 'default.tableB' USING org.apache.hive.hcatalog.pig.HCatLoader();
--load empty table
finalTable = LOAD 'default.finalTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
--example operations that end up with '::' in column names
g = group tableB by (id);
j = JOIN tableA by id LEFT, g by group;
result = foreach j generate tableA::id, tableA::col2, g::tableB;
--union empty finalTable and result
result2 = union finalTable, result;
--bob's your uncle
STORE result2 INTO 'finalTable' USING org.apache.hive.hcatalog.pig.HCatStorer();
Thanks to Neethu!

Pig latin programming

I have a table loaded in a variable in pig whose schema looks like this:
What I want to accomplish through a pig-latin script is to populate the value "JKL", "PQR" and so on.. in col 4 that is blank for the rest of the rows. The blank rows must copy only the values in the previous cell in the col 4. Check the example below.
The target table should like this:
if your requirement is to update Col4 value to XYZ for all the records which are having values null or empty then you can use the following code snippet to do the same
--Load input data
input_data = LOAD 'input.txt' USING PigStorage() AS (Col1:chararray, Col2:int, Col3:int, Col4:chararray);
--Perform operation on each record
input_data = FOREACH input_data GENERATE Col1, Col2, Col3, ((Col4 is null or TRIM(Col4) == '') ? 'XYZ' : Col4) as Col4;
here assuming that you are holding your input_data then for each record check whether the Col4 value is null or empty, if it is then update it with the desired value (XYZ) or else just use the existing value
Is the Col1 is same for all the rows. If Yes, then Use two set of Filter else u have to find the uniq value between col1 & Col4 and remove the NULL value thn use below steps
Filter_One will capture Col1 & Col4 where Col4 is not NULL
Filter_Two will capture Col1, Col2, Col3. Use Join Filter_one &
Filter_Two, where Filter_two will be printed 1st, 2nd , 3rd Column
and Filter_one 2nd Column will be pronted at 4th Position,
hope the same will help
The Pig script will be like :
Filter_one = foreach Load_Data generate $0 as col1, $3 as col4;
Filter_one_temp = filter Filter_one by ($1 is not null);
Filter_two = foreach Load_Data generate $0 as col1, $1 as col2, $2 as col3;
Join_filter = JOIN Filter_two by $0 LEFT, Filter_one_temp by $0;
generetate_output = foreach Join_filter generate $0 as col1, $1 as col2 , $2 as col3,$4 as col4;
store generetate_output into 'dfs_path' using PigStorage(',');
as am storing the same with , delimeter so the output will be like
(ABC,34,23,XYZ)
(ABC,12,78,XYZ)
(ABC,4,21,XYZ)
(ABC,22,54,XYZ)
(DEF,32,455,JKL)
(DEF,21,45,JKL)
(DEF,45,687,JKL)
(DEF,232,565,JKL)
(DEF,23,32,JKL)

Pig multiply a number from a table to all the values of another table

I have two tables:
A: (feature:chararray, value:float)
B:(multiplier:charray, value:float)
where A is a table with thousands of rows and B has only one row.
What I wanna do is take all the rows in A and multiply A.value by B.value.
e.g.
A:[('f1', 1.5) , ('f2', 2.3)]
B:[('mul', 2)]
I'd like to product a table C
C: [('f1', 3), ('f2', 4.6)]
Is there an easy way to do so?
You can do a CROSS and a FOREACH ... GENERATE.
X = A CROSS B;
Y = FOREACH X GENERATE A::feature, A::value * B::value;
The above code has not been tested.
If You are very sure that the 2nd table has only one row then take the first column
of 2nd table and hardcode the same value as last column in 1st table and then
do the inner join and the you can easily multiply
Let say first file as plain.txt
(f1,1.5)
(f2,2)
here is the second file as multi.txt
(mul,2)
A = load '/user/cloudera/inputfiles/plain.txt' USING PigStorage(',') AS(feature:chararray,value:double);
B = load '/user/cloudera/inputfiles/multi.txt' USING PigStorage(',') AS(operation:chararray,no:int);
C = foreach A generate feature,value,'mul' as ope;
D = join C by ope, B by operation;
E = foreach D generate feature,(value*no) as multiplied_value;

Resources