how can I merge sparse tables in hadoop? - hadoop

I have a number of csv files containing a single column of values:
File1:
ID|V1
1111|101
4444|101
File2:
ID|V2
2222|102
4444|102
File3:
ID|V3
3333|103
4444|103
I want to combine these to get:
ID|V1|V2|V3
1111|101||
2222||102|
3333|||103
4444|101|102|103
There are many (100 million) rows, and about 100 columns/tables.
I've been trying to use Pig, but I'm a beginner, and am struggling.
For two files, I can do:
s1 = load 'file1.psv' using PigStorage('|') as (ID,V1);
s2 = load 'file2.psv' using PigStorage('|') as (ID,V2);
cg = cogroup s1 by ID, s2 by ID
merged = foreach cg generate group, flatten((IsEmpty(s1) ? null : s1.V1)), flatten((IsEmpty(s2) ? null : s2.V2));
But I would like to do this with whatever files are present, up to 100 or so, and I don't think I can cogroup that many big files without running out of memory. So I'd rather get the column name from the header than just hard-coding it. In other words, this 2-file toy example doesn't scale.

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

How to specify keys in Apache Spark while joining two datasets

I am loading two files as below -
f1 = sc.textFile("s3://testfolder1/file1")
f2 = sc.textFile("s3://testfolder2/file2")
This load operation gives me list of tuples. For each row one tuple is created.
Schema for file1 and file2 is as below -
f1 (a,b,c,d,e,f,g,h,i)
f2 (x,y,z,a,b,c,f,r,u)
I want to join these two datasets based on fields a,b,c. I did some research and found that there below method that might be useful.
rdd.keyBy(func)
However, I can't find a easy way to specify keys and join two datasets.
Can anyone demonstrate how to do it without using DataFrames ? Use of SparkSQL is okay but if it can be done without SparkSQL that would be best.
This load operation gives me list of tuples
No, it will give you an RDD[String].
You can take the string, and convert it to anything you want.
For your use case, you can convert each line to ((a,b,c),(d,e,f,g,h,i))
f1 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
f2 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
and then, f1.join(f2) should just work.

Pig cross join and replace

I have two files. One file having the data as below
Ram,C,Bnglr
Shyam,A,Kolkata
The another file is having a reference
C,Calicut
A,Ahmedabad
Now using pig, I want to search and replace the data in the original file to create a new file ,so that I can create a new file using these two files.
Ram,Class,Bnglr
Shyam,Ahmedabad,Kolkata
Is it possible in pig. I know how to do that in MR but want to try out in pig.
Yes.Join the files and select the required columns and write to the new file
A = LOAD 'file1.txt' AS (a1:chararray,a2:chararray,a3:chararray);
B = LOAD 'file2.txt' AS (b1:chararray,b2:chararray);
C = JOIN A BY a2, B BY b1;
D = FOREACH C GENERATE A::a1,B::b2,A::a3;
STORE D INTO 'file3.txt'
Above logic will work, but if you don't have matching records in second file in that case you will miss record from file1

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

Resources