Pig cross join and replace - hadoop

I have two files. One file having the data as below
Ram,C,Bnglr
Shyam,A,Kolkata
The another file is having a reference
C,Calicut
A,Ahmedabad
Now using pig, I want to search and replace the data in the original file to create a new file ,so that I can create a new file using these two files.
Ram,Class,Bnglr
Shyam,Ahmedabad,Kolkata
Is it possible in pig. I know how to do that in MR but want to try out in pig.

Yes.Join the files and select the required columns and write to the new file
A = LOAD 'file1.txt' AS (a1:chararray,a2:chararray,a3:chararray);
B = LOAD 'file2.txt' AS (b1:chararray,b2:chararray);
C = JOIN A BY a2, B BY b1;
D = FOREACH C GENERATE A::a1,B::b2,A::a3;
STORE D INTO 'file3.txt'

Above logic will work, but if you don't have matching records in second file in that case you will miss record from file1

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

how can I merge sparse tables in hadoop?

I have a number of csv files containing a single column of values:
File1:
ID|V1
1111|101
4444|101
File2:
ID|V2
2222|102
4444|102
File3:
ID|V3
3333|103
4444|103
I want to combine these to get:
ID|V1|V2|V3
1111|101||
2222||102|
3333|||103
4444|101|102|103
There are many (100 million) rows, and about 100 columns/tables.
I've been trying to use Pig, but I'm a beginner, and am struggling.
For two files, I can do:
s1 = load 'file1.psv' using PigStorage('|') as (ID,V1);
s2 = load 'file2.psv' using PigStorage('|') as (ID,V2);
cg = cogroup s1 by ID, s2 by ID
merged = foreach cg generate group, flatten((IsEmpty(s1) ? null : s1.V1)), flatten((IsEmpty(s2) ? null : s2.V2));
But I would like to do this with whatever files are present, up to 100 or so, and I don't think I can cogroup that many big files without running out of memory. So I'd rather get the column name from the header than just hard-coding it. In other words, this 2-file toy example doesn't scale.

Need to omit the data that matches in two tables in pig

I am trying to solve the below problem , please suggest
I have two tables wanted to remove the only matched records that are present in the Table 2.
NOTE: Even if there are common keys available in tables if table 2 has 1 record then it should only
remove 1 record of the table 1 comprising of the same key,
INPUT:
Table 1:
1,Sam,5000
1,Sam,5000
1,Sam,5000
2,Boo,3000
Table 2:
1,Sam,5000
2,Boo,3000
OUTPUT:
1,Sam,5000
1,Sam,5000
You need to get the Set Difference between the two relations.
Source: See here .You will have to download the jar file that supports the functions from here .The jar is distributed under Apache License
register datafu-pig-incubating-1.3.0.jar
define SetDifference datafu.pig.sets.SetDifference();
A = LOAD 'test1.txt' as (a1:int,a2:chararray,a3:chararray);
B = LOAD 'test2.txt' as (b1:int,b2:chararray,b3:chararray);
diff = FOREACH A {
a1 = ORDER A by a1;
b1 = ORDER B by b1;
GENERATE SetDifference(a1,b1);
}
DUMP diff;

How to specify keys in Apache Spark while joining two datasets

I am loading two files as below -
f1 = sc.textFile("s3://testfolder1/file1")
f2 = sc.textFile("s3://testfolder2/file2")
This load operation gives me list of tuples. For each row one tuple is created.
Schema for file1 and file2 is as below -
f1 (a,b,c,d,e,f,g,h,i)
f2 (x,y,z,a,b,c,f,r,u)
I want to join these two datasets based on fields a,b,c. I did some research and found that there below method that might be useful.
rdd.keyBy(func)
However, I can't find a easy way to specify keys and join two datasets.
Can anyone demonstrate how to do it without using DataFrames ? Use of SparkSQL is okay but if it can be done without SparkSQL that would be best.
This load operation gives me list of tuples
No, it will give you an RDD[String].
You can take the string, and convert it to anything you want.
For your use case, you can convert each line to ((a,b,c),(d,e,f,g,h,i))
f1 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
f2 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
and then, f1.join(f2) should just work.

Pig how to assign name to columns?

I have a csv file which have hundreds of columns, when I load the file into Pig, I dont want to assign each column like
A = load 'path/to/file' as (a,b,c,d,e......)
Since I'll filter a lot of them at the second step:
B = foreach A generate $0,$2,....;
But here, can I assign a name and type to each column of B? something like
B = foreach A generate $0,$2,... AS (a:int,b:int,c:float)
I tried the above code but it doesn't work.
Thanks.
You have to specify them between each comma.
B = foreach A generate $0 as a, $2 as b,...
Note that it just assumes the type that it is already.

Resources