Hadoop - Properly sort by key and group by reducer - sorting

I have some data coming out from the reducer which are like this :
9,2 3
5,7 2
2,3 0
1,5 3
6,3 0
4,2 2
7,1 1
And I would like to sort them according to the number on the second column. Like this :
2,3 0
6,3 0
7,1 1
5,7 2
4,2 2
1,5 3
9,2 3
When I run my program locally, I use :
sort -k2,2n
But I don't know how to do the same thing on Hadoop. I've tried several option which are not working, such as :
-D mapreduce.partition.keycomparator.options=-k2,2n
And moreover, I would like that all the data which have the same key to go on the same reducer.
So in this case :
2,3 0
and
6,3 0
should be processed by the same reducer.
Any ideas of the option I should put on hadoop ?
Thank you in advance !

In default configuration of job, first columns are the keys of result from reducer, second is the value. To produce result, reducer is processing all records with same keys. So in your case you need run a additional mapreduce job. The map will put second column as key and first as value. This job will group data according to your request. But if you have small amount of data as result, you setup only one reducer per your job -D mapred.reduce.tasks=1.

Related

Google Sheets: Data Validation - Unique row values across multiple columns

Good day,
I have seen from here a solution to control duplicate entries into a single column. A Data validation with this custom formula works well for one column.
I would like to achieve the same effect over multiple columns ... i.e. unique row entries across multiple columns. Take for example below three columns A-C. Only when values {1,2,1} are entered for the second time will the input be rejected.
A B C
1 1 1
1 2 1
1 2 2
2 2 2
1 2 1 X Entry should be rejected.
Is there a quick way to do this using Data Validation - custom formulae?
use custom formula for data validation:
=INDEX(COUNTIF($A$1:$A&"×"&$B$1:$B&"×"&$C$1:$C, $A1&"×"&$B1&"×"&$C1)<2)

OpenOffice Calc move only unique values to new column

I looked around for a bit and didn't see any question quite like the one I have. I have a sheet with over 80k values in column A. What I need, is to remove every occurrence of a duplicate. If the value 5 appears more than once, I don't want the value at all. For example, if I have something like this:
A
1
2
2
3
4
3
I ONLY want the values of 1 and 4, because they only appear once. I'd like every other value deleted, or to have only the values like 1 and 4 appear in another column.
Any help is greatly appreciated.
Work on a copy as the following deletes records from source data. In B1 (adjust 90000 to suit):
=COUNTIF(A$1:A$90000;A1)>1
and copy down to suit. Filter A:B, select 1 for ColumnB and delete the selected rows. Change filter to select All.

Parallel running MapReduce on Hadoop

I'm trying to understand how the parallel processing works with Hadoop & MapReduce.
I understand how Map can be run in parallel but I don't understand how Reduce can. For example if I want to find the average of the following list:
COMPUTER | YEAR | RUNS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
compA | 1989 | 20
compA | 1990 | 10
compB | 1991 | 300
Where compA & compB are two data nodes
If the average function in the Reduce is run on compA & compB and then the result of the two data noes is averaged it'll be wrong.
Multiple reducer tasks will not be spawned for all type of logic. For aggregation, averaging logic the data will run on a single reducer.
Mapper class gets data as key-value pair. It sends the output to reducer class. Before reaching the reducer, the keys will be sorted and the values corresponding to the same key will reach the same reducer. In this way, the results will be correct. If your requirement involves averaging the entire data, then the output from all the map tasks will be send to a single reducer.
Eg: I want to find the average number of vehicles sold per month over an year.
I have data in 12 files, each file contains the sales details corresponding to a month.
I have to write the following logic.
mapper class
Mapper will get every record as input.
Parse the number of sales and month details from every record.
Write a constant as key and the sales as value (eg: 2000 cars sold in january will be something like [sales,2000]. Similarly for 1000 sale in february will be something like [sales,1000]). I used the key sales so that all the values with the same key will reach the same reducer.
Sample Input data
jan.txt
car-sold,2000
feb.txt
cars-sold,1000
.....
.....
dec.txt
cars-sold,5000
Mapper output
[sales,2000]
[sales,1000]
....
....
[sales,5000]
Reducer Input
{sales,[1000,2000,.....,5000]}
Ie (key, list of values)

How to filter Record values of files in hadoop mapreduce?

I am working with a program in MapReduce. I have two files and I want to delete some information from file1 that exists in file2. Every line has an ID as its key and some numbers (separated by comma) as its value.
file1:
1 1,2,10
2 2,7,8,5
3 3,9,12
and
file2:
1 1
2 2,5
3 3,9
I want output like this:
output:
1 2,10
2 7,8
3 12
I want to delete values of file1 that have the same key in file2. One way to do this is have the two files as input files and in the map step produce: (ID, line). Then in the reduce step I filter the values. But, my files are very very large and therefore I can't do it this way.
Or, would it be efficient if file1 was the input file and in the map I open file2 and seek to that line and then compare the value? But as I have a million keys and for every key I must open file1, I think it will have excessive I/O.
What can I do?
You can make both file1 and file2 inputs of your mapper. In the mapper you'd add the source (file1 or file2) to the records. Then use secondary sort to make sure records from file2 always come first. So, the combined input for your reducer would look like that:
1 file2,1
1 file1,1,2,10
2 file2,2,5
2 file1,2,7,8,5
3 file2,3,9
3 file1,3,9,12
You can take the design of the reducer from here.

Hadoop PIG ouput is not split in mutliple files with PARALLEL operator

Looks like I'm missing something. Number of reducers on my data creates that many number of files in HDFS, but my data is not split into multiple files. What I noticed is that if I do a group by on a key that is in sequential order it works fine, like the data below split nicely into two files based on the key:
1 hello
2 bla
1 hi
2 works
2 end
But this data doesn't split:
1 hello
3 bla
1 hi
3 works
3 end
The code that I used that works fine for one and not for the other is
InputData = LOAD 'above_data.txt';
GroupReq = GROUP InputData BY $0 PARALLEL 2;
FinalOutput = FOREACH GroupReq GENERATE flatten(InputData);
STORE FinalOutput INTO 'output/GroupReq' USING PigStorage ();
The above code creates two output part files but in first input it splits the data nicely and put the key 1 in part-r-00000 and key 2 in part-r-00001. But for the second input it creates two part files but all the data ends up in part-r-00000. What is it I'm missing, what can I do to force the data to split in to multiple output files based on the unique keys?
Note: for the second input if I use PARALLEL 3 (3 reducers), it creates three part files and add all the data for key 1 in part-0 and all the data for key 3 in part-3 file. I found this behavior strange. BTW I'm using Cloudera CDH3B4.
That's because the number of the reducer that a key goes to is determined as hash(key) % reducersCount. If the key is an integer, hash(key) == key. When you have more data, they will be distributed more or less evenly, so you shouldn't worry about it.

Resources