How to filter Record values of files in hadoop mapreduce? - hadoop

I am working with a program in MapReduce. I have two files and I want to delete some information from file1 that exists in file2. Every line has an ID as its key and some numbers (separated by comma) as its value.
file1:
1 1,2,10
2 2,7,8,5
3 3,9,12
and
file2:
1 1
2 2,5
3 3,9
I want output like this:
output:
1 2,10
2 7,8
3 12
I want to delete values of file1 that have the same key in file2. One way to do this is have the two files as input files and in the map step produce: (ID, line). Then in the reduce step I filter the values. But, my files are very very large and therefore I can't do it this way.
Or, would it be efficient if file1 was the input file and in the map I open file2 and seek to that line and then compare the value? But as I have a million keys and for every key I must open file1, I think it will have excessive I/O.
What can I do?

You can make both file1 and file2 inputs of your mapper. In the mapper you'd add the source (file1 or file2) to the records. Then use secondary sort to make sure records from file2 always come first. So, the combined input for your reducer would look like that:
1 file2,1
1 file1,1,2,10
2 file2,2,5
2 file1,2,7,8,5
3 file2,3,9
3 file1,3,9,12
You can take the design of the reducer from here.

Related

How to parse csv file into multiple csv based on row spacing

I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a
header_b
header_c
One
Two
Three
One
Two
Three
<-Always two spaced rows between data sets
dataset N <-part of csv file giving details on data
header_d
header_e
header_f
header_g
One
Two
Three
Four
One
Two
Three
Four
out:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.

How to compare 2 schema files so that I can add the columns in the other file and fill with some default value?

I have two files that have tabluar data. In file A, I have 5 columns and in file B I have 3 columns. I want to find the position of the columns in file A and add the new columns in file B. I also need to fill all the rows in file B with a default value that I will hard code.
I tried to run a loop to get the length of the files and then fetch the data but it's failing.
I'm using PuTTY and vim editor. Need to write a script to compare and add.
Without having a concrete example of fileA and fileB it's difficult to figure out the right solution, but you could use diff and patch, something like this:
diff fileA fileB | patch -R fileB

advanced concatenation of lines based on the specific number of compared columns in csv

this is the question based on the previous solved problem.
i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same):
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1
name3,address3,town3,zip3,,,,,,category3_2
name3,address3,town3,zip3,,,,,,category3_3
name4,address4,town4,zip4,,,,,,category4_1
name4,address4,town4,zip4,email4,,,,,category4_2
name4,address4,town4,zip4,email4,,,,,category4_3
name4,address4,town4,zip4,,,,,,category4_4
name5,address5,town5,zip5,,,,,,category5_1
name5,address5,town5,zip5,,web5,,,,category5_2
name6,address6,town6,zip6,,,,,,category6
first 4 records in columns are always populated, other columns are not always, except the last one - category
empty space between "," delimiter means that there is no data for the particular line or name
if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated
i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,web5,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example:
name1,address1,town1,zip1,email1,web1,,,,category1
name2,address2,town2,zip2,email2,,,,,category2
name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3
name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,,,,,,category5_1;category5_2
name6,address6,town6,zip6,,,,,,category6
does the .csv have to be sorted before using the script?
thank you again!
sed -n 's/.*/²&³/;H
$ { g
:cat
s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/
t fields
b clean
:fields
s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/
t fields
s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/
t fields
s/~~ ~~//g
b cat
:clean
s/.//;s/[²³]//g
p
}' YourFile
Posix version (so --posixwith GNU sed) and without sorting your file previously
2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available).
loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group)
ungroup sub field to original place
finaly, remove marker and first new line
Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker)
Note:
For performance on a hundred MB file, i guess awk will be lot more efficient.
Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db...
SQL - GROUP BY to combine/concat a column
db: mysql through wamp

using awk to grab random lines and append to a new column?

So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line.
So like
awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt
prints out
10.Qg3 Bb4+
First function of the donor
when I want it to be:
10 Qg3 Bb4+ First function of the donor
Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R)
If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.

Hadoop PIG ouput is not split in mutliple files with PARALLEL operator

Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:
1 hello
2 bla
1 hi
2 works
2 end
But this data doesn't split:
1 hello
3 bla
1 hi
3 works
3 end
The code that I used that works fine for one and not for the other is
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?
Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.
That's because the number of the reducer that a key goes to is determined as hash(key) % reducersCount. If the key is an integer, hash(key) == key. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.

Resources