How to filter Record values of files in hadoop mapreduce? - hadoop
I am working with a program in MapReduce. I have two files and I want to delete some information from file1 that exists in file2. Every line has an ID as its key and some numbers (separated by comma) as its value.
file1:
1 1,2,10
2 2,7,8,5
3 3,9,12
and
file2:
1 1
2 2,5
3 3,9
I want output like this:
output:
1 2,10
2 7,8
3 12
I want to delete values of file1 that have the same key in file2. One way to do this is have the two files as input files and in the map step produce: (ID, line). Then in the reduce step I filter the values. But, my files are very very large and therefore I can't do it this way.
Or, would it be efficient if file1 was the input file and in the map I open file2 and seek to that line and then compare the value? But as I have a million keys and for every key I must open file1, I think it will have excessive I/O.
What can I do?
You can make both file1 and file2 inputs of your mapper. In the mapper you'd add the source (file1 or file2) to the records. Then use secondary sort to make sure records from file2 always come first. So, the combined input for your reducer would look like that:
1 file2,1
1 file1,1,2,10
2 file2,2,5
2 file1,2,7,8,5
3 file2,3,9
3 file1,3,9,12
You can take the design of the reducer from here.
Related
How to parse csv file into multiple csv based on row spacing
I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's. dataset1 header_a header_b header_c One Two Three One Two Three <-Always two spaced rows between data sets dataset N <-part of csv file giving details on data header_d header_e header_f header_g One Two Three Four One Two Three Four out: dataset1.csv datasetn.csv Based on my research i think my solution might lie in awk searching for the double spaces? EDIT: In plain text as requested. table1 details1, table1 details2, table1 details3, header_a,header_b,header_c, 1,2,3 1,2,3 tableN details1, tableN details2, tableN details3, header_a, header_b,header_c,header_N, 1,2,3,4 1,2,3,4
Always two spaced rows between data sets If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk with its record separator RS set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR: awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv This reads from input.csv and writes the chunks to output_1.csv, output_2.csv, output_3.csv and so forth. If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.
How to compare 2 schema files so that I can add the columns in the other file and fill with some default value?
I have two files that have tabluar data. In file A, I have 5 columns and in file B I have 3 columns. I want to find the position of the columns in file A and add the new columns in file B. I also need to fill all the rows in file B with a default value that I will hard code. I tried to run a loop to get the length of the files and then fetch the data but it's failing. I'm using PuTTY and vim editor. Need to write a script to compare and add.
Without having a concrete example of fileA and fileB it's difficult to figure out the right solution, but you could use diff and patch, something like this: diff fileA fileB | patch -R fileB
advanced concatenation of lines based on the specific number of compared columns in csv
this is the question based on the previous solved problem. i have the following type of .csv files(they aren't all sorted!, but the structure of columns is the same): name1,address1,town1,zip1,email1,web1,,,,category1 name2,address2,town2,zip2,email2,,,,,category2 name3,address3,town3,zip3,email3,,,,,category3_1 name3,address3,town3,zip3,,,,,,category3_2 name3,address3,town3,zip3,,,,,,category3_3 name4,address4,town4,zip4,,,,,,category4_1 name4,address4,town4,zip4,email4,,,,,category4_2 name4,address4,town4,zip4,email4,,,,,category4_3 name4,address4,town4,zip4,,,,,,category4_4 name5,address5,town5,zip5,,,,,,category5_1 name5,address5,town5,zip5,,web5,,,,category5_2 name6,address6,town6,zip6,,,,,,category6 first 4 records in columns are always populated, other columns are not always, except the last one - category empty space between "," delimiter means that there is no data for the particular line or name if the nameX doesnt contain addressX but addressY, it is a different record(not the same line) and should not be concatenated i need the script in sed or awk, maybe the bash(but this solution is little slower on bigger files[hundreds of MB+]), that will take first 4 columns(in this case) compares them and if matched, will merge every category with the ";" delimiter and will keep the structure and the most possible data in other columns of those matched lines of a .csv file: name1,address1,town1,zip1,email1,web1,,,,category1 name2,address2,town2,zip2,email2,,,,,category2 name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3 name4,address4,town4,zip4,email4,,,,,category4_1;category4_2;category4_3;category4_4 name5,address5,town5,zip5,,web5,,,,category5_1;category5_2 name6,address6,town6,zip6,,,,,,category6 if that is not possible, solution could be to retain data from the first line of the duped data(the one with categoryX_1). example: name1,address1,town1,zip1,email1,web1,,,,category1 name2,address2,town2,zip2,email2,,,,,category2 name3,address3,town3,zip3,email3,,,,,category3_1;category3_2;category3_3 name4,address4,town4,zip4,,,,,,category4_1;category4_2;category4_3;category4_4 name5,address5,town5,zip5,,,,,,category5_1;category5_2 name6,address6,town6,zip6,,,,,,category6 does the .csv have to be sorted before using the script? thank you again!
sed -n 's/.*/²&³/;H $ { g :cat s/\(²\([^,]*,\)\{4\}\)\(\([^,]*,\)\{5\}\)\([^³]*\)³\(.*\)\n\1\(\([^,]*,\)\{5\}\)\([^³]*\)³/\1~\3~ ~\7~\5;\9³\6/ t fields b clean :fields s/~\([^,]*\),\([^~]*~\) ~\1,\([^~]*~\)/\1,~\2 ~\3/ t fields s/~\([^,]*\),\([^~]*~\) ~\([^,]*,\)\([^~]*~\)/\1\3~\2 ~\4/ t fields s/~~ ~~//g b cat :clean s/.//;s/[²³]//g p }' YourFile Posix version (so --posixwith GNU sed) and without sorting your file previously 2 recursive loop after loading the full file in buffer, adding marker for easier manipulation and lot of fun with sed group substitution (hopefully just reach the maximum group available). loop to add category (1 line after the other, needed for next loop on each field) per line and a big sub field temporary structured (2 group of field from the 2 concatened lines. field 5 to 9 are 1 group) ungroup sub field to original place finaly, remove marker and first new line Assuming there is no ²³~ character because used as marker (you can use other marker and adapt the script with your new marker) Note: For performance on a hundred MB file, i guess awk will be lot more efficient. Sorting the data previoulsy may help certainly in performance reducing amount of data to manipulate after each category loop
i found, that this particular problem is faster being processed through db... SQL - GROUP BY to combine/concat a column db: mysql through wamp
using awk to grab random lines and append to a new column?
So I have a document "1", which is one column. I have 3 files with one column each and I want to append a randomly selected line from each of those columns onto the document 1's line. So like awk 'NR==10' moves.txt 'NR==1' propp_tasks.txt prints out 10.Qg3 Bb4+ First function of the donor when I want it to be: 10 Qg3 Bb4+ First function of the donor Is there a good way to do this with awk? I had been trying to set up a bash script with a for loop but I didn't know how to cycle the indices so on line n of document 1, columns 2,3 and 4 would be appended on there. I feel like this should be really, really simple...
paste 1 <(cat 2 3 4 | sort -R) If the length of the first file and the length of the combination of the other 3 files are different, then some more work is required.
Hadoop PIG ouput is not split in mutliple files with PARALLEL operator
Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key: 1 hello 2 bla 1 hi 2 works 2 end But this data doesn't split: 1 hello 3 bla 1 hi 3 works 3 end The code that I used that works fine for one and not for the other is InputData = LOAD 'above_data.txt'; GroupReq = GROUP InputData BY $0 PARALLEL 2; FinalOutput = FOREACH GroupReq GENERATE flatten(InputData); STORE FinalOutput INTO 'output/GroupReq' USING PigStorage (); The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys? Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.
That's because the number of the reducer that a key goes to is determined as hash(key) % reducersCount. If the key is an integer, hash(key) == key. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.