How to specify keys in Apache Spark while joining two datasets - hadoop

I am loading two files as below -
f1 = sc.textFile("s3://testfolder1/file1")
f2 = sc.textFile("s3://testfolder2/file2")
This load operation gives me list of tuples. For each row one tuple is created.
Schema for file1 and file2 is as below -
f1 (a,b,c,d,e,f,g,h,i)
f2 (x,y,z,a,b,c,f,r,u)
I want to join these two datasets based on fields a,b,c. I did some research and found that there below method that might be useful.
rdd.keyBy(func)
However, I can't find a easy way to specify keys and join two datasets.
Can anyone demonstrate how to do it without using DataFrames ? Use of SparkSQL is okay but if it can be done without SparkSQL that would be best.

This load operation gives me list of tuples
No, it will give you an RDD[String].
You can take the string, and convert it to anything you want.
For your use case, you can convert each line to ((a,b,c),(d,e,f,g,h,i))
f1 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
f2 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
and then, f1.join(f2) should just work.

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

how can I merge sparse tables in hadoop?

I have a number of csv files containing a single column of values:
File1:
ID|V1
1111|101
4444|101
File2:
ID|V2
2222|102
4444|102
File3:
ID|V3
3333|103
4444|103
I want to combine these to get:
ID|V1|V2|V3
1111|101||
2222||102|
3333|||103
4444|101|102|103
There are many (100 million) rows, and about 100 columns/tables.
I've been trying to use Pig, but I'm a beginner, and am struggling.
For two files, I can do:
s1 = load 'file1.psv' using PigStorage('|') as (ID,V1);
s2 = load 'file2.psv' using PigStorage('|') as (ID,V2);
cg = cogroup s1 by ID, s2 by ID
merged = foreach cg generate group, flatten((IsEmpty(s1) ? null : s1.V1)), flatten((IsEmpty(s2) ? null : s2.V2));
But I would like to do this with whatever files are present, up to 100 or so, and I don't think I can cogroup that many big files without running out of memory. So I'd rather get the column name from the header than just hard-coding it. In other words, this 2-file toy example doesn't scale.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

How to get count along with rest of the fields in Pig?

I have following data set.
f1,f2,f3,f4,f5,f6
I am looking for count of f6 along with rest of the fields.
f1,f2,f3,f4,f5,5
f1,f2,f3,f4,f5,3
and so on.
I tried this code but it takes too long to execute
A = LOAD 'file a'
B = GROUP A BY f6
C = FOREACH B GENERATE FLATTEN (group) as f6, FLATTEN(f1), FLATTEN(f2),FLATTEN(f3),FLATTEN(f4),FLATTEN(f5),COUNT(f6)
Is there any better way to achieve what I am looking for ?
If I simply try to get count without flatten then fields end up in bag but I want final output as tuple.
So trying this gives me output as bag
C = FOREACH B GENERATE FLATTEN (group) as f6, A.f1,A.f2.A.f3,A.f4,A.f5, COUNT(f6)
All inputs are appreciated.
Cheers
It is also possible to flatten the projection which was grouped.
A = LOAD 'file a';
B = GROUP A BY f6;
C = FOREACH B GENERATE FLATTEN(A), COUNT(A) as f6_count;
EDIT 1:
The key is using FLATTEN(A) instead of FLATTEN(group).
FLATTEN(A) will produce a tuple with all of the columns from the original relation and will get rid of the bag, even those that were not used in the group by statement (f1, f2, f3, f4, f5, f6).
FLATTEN(group) will return only columns used in the group by, in this case f6.
Advantage of this approach is that it is very efficient and requires a single Map Reduce job to execute. Any solution that involves JOIN operation adds extra Map Reduce job.
As a general rule of thumb in pig, hive and MR, group by and join operations usually are executed as separate MR Jobs and reducing the number of MR jobs leads to improved performance.

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Resources