Need to omit the data that matches in two tables in pig - hadoop

I am trying to solve the below problem , please suggest
I have two tables wanted to remove the only matched records that are present in the Table 2.
NOTE: Even if there are common keys available in tables if table 2 has 1 record then it should only
remove 1 record of the table 1 comprising of the same key,
INPUT:
Table 1:
1,Sam,5000
1,Sam,5000
1,Sam,5000
2,Boo,3000
Table 2:
1,Sam,5000
2,Boo,3000
OUTPUT:
1,Sam,5000
1,Sam,5000

You need to get the Set Difference between the two relations.
Source: See here .You will have to download the jar file that supports the functions from here .The jar is distributed under Apache License
register datafu-pig-incubating-1.3.0.jar
define SetDifference datafu.pig.sets.SetDifference();
A = LOAD 'test1.txt' as (a1:int,a2:chararray,a3:chararray);
B = LOAD 'test2.txt' as (b1:int,b2:chararray,b3:chararray);
diff = FOREACH A {
a1 = ORDER A by a1;
b1 = ORDER B by b1;
GENERATE SetDifference(a1,b1);
}
DUMP diff;

Related

How to specify keys in Apache Spark while joining two datasets

I am loading two files as below -
f1 = sc.textFile("s3://testfolder1/file1")
f2 = sc.textFile("s3://testfolder2/file2")
This load operation gives me list of tuples. For each row one tuple is created.
Schema for file1 and file2 is as below -
f1 (a,b,c,d,e,f,g,h,i)
f2 (x,y,z,a,b,c,f,r,u)
I want to join these two datasets based on fields a,b,c. I did some research and found that there below method that might be useful.
rdd.keyBy(func)
However, I can't find a easy way to specify keys and join two datasets.
Can anyone demonstrate how to do it without using DataFrames ? Use of SparkSQL is okay but if it can be done without SparkSQL that would be best.
This load operation gives me list of tuples
No, it will give you an RDD[String].
You can take the string, and convert it to anything you want.
For your use case, you can convert each line to ((a,b,c),(d,e,f,g,h,i))
f1 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
f2 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
and then, f1.join(f2) should just work.

Pig cross join and replace

I have two files. One file having the data as below
Ram,C,Bnglr
Shyam,A,Kolkata
The another file is having a reference
C,Calicut
A,Ahmedabad
Now using pig, I want to search and replace the data in the original file to create a new file ,so that I can create a new file using these two files.
Ram,Class,Bnglr
Shyam,Ahmedabad,Kolkata
Is it possible in pig. I know how to do that in MR but want to try out in pig.
Yes.Join the files and select the required columns and write to the new file
A = LOAD 'file1.txt' AS (a1:chararray,a2:chararray,a3:chararray);
B = LOAD 'file2.txt' AS (b1:chararray,b2:chararray);
C = JOIN A BY a2, B BY b1;
D = FOREACH C GENERATE A::a1,B::b2,A::a3;
STORE D INTO 'file3.txt'
Above logic will work, but if you don't have matching records in second file in that case you will miss record from file1

How to count on two columns of group by items in pig

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Resources