Why is MRJob sorting my keys? - sorting

I'm running a fairly big MRJob job (1,755,638 keys) and the keys are being written to the reducers in sorted order. This happens even if I specify that Hadoop should use the hash partitioner, with:
class SubClass(MRJob):
PARTITIONER = "org.apache.hadoop.mapred.lib.HashPartitioner"
...
I don't understand why the keys are sorted, when I am not asking for them to be sorted.

The HashPartitioner is used by default when you don't specify any partitioner explicitly.

MR sorts the key/value pairs by key so that it can ensure that all values for a given key are passed to the reducer together. In fact, the Iterable passed into the reduce() method just reads that sorted list until it finds a new key and then it stops iterating. That's why the keys will always appear in order.

Keys are not sorted by default, but the HashPartitioner will give the appearance of sorting keys if the dataset is small. When I increased the size of the dataset from 50M to 10G the keys stopped being sorted.

Related

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

ordering of list of values for each keys of reducer output

I am new to hadoop, little confuse about the hadoop.
In mapreduce job the reducer get a list of values for each keys. I want to know, what is the default ordering of values for each keys. Is the the same order as it has been written out from the mapper. Can you change the ordering ( eg asc or desc ) of the values in each key.
Is the the same order as it has been written out from the mapper. - Yes
It is true for single mapper. But, if your job has more than one mapper, you may not see the same order for two runs with same input as different mappers may end different times.
Can you change the ordering ( eg asc or desc ) of the values in each key - Yes
It is done using a technique called 'secondary sort'(you may Google for more reading on this).
In MapReduce, there are a few properties that affect the emission of map output. This is referred to as the secondary sort. Namely, two factors affect this:
Partitioner, which divides the map output among the reducers. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job.
Comparator, which compares values with the same key.
The default partitioner is the org.apache.hadoop.mapred.lib.HashPartitioner class, which hashes a record’s key to determine which partition the record belongs in.
Comparators differ by data type. If you want to control the sort order, override compare(WritableComparable,WritableComparable) of the WritableComparator() interface. See documentation here.

What is the point of using a Partitioner for Secondary Sorting in MapReduce?

If you need to have the values sorted for a given key when passed to the reduce phase, such as for a moving average, or to mimick the LAG/LEAD Analytic functions in SQL, you need to implement a Secondary Sort in MapReduce.
After searching around on Google, the common suggestion is to:
A) Emit composite key, which includes the , in the map phase
B) Create a "composite key comparator" class, the purpose of which is for the secondary sort, comparing the values to sort on after comparing the key, so that the Iterable passed to the reducer is sorted.
C) Create a "natural key grouping comparator" class, the purpose of which is for the primary sort, comparing only the key to sort on, so that the Iterable passed to the reducer contains all of the values belonging to a given key.
D) Create a "natural key partitioner class", the purpose of which I do not know and is the purpose of my question.
From here:
The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only consider the “natural” key.
By natural key he of course means the actual key, not the composite key + value.
From here:
The default partition will calculate a hash over the entire key resulting in different hashes and the potential that the records are sent to separate reducers. To ensure that both records are sent to the same reducer let's implement a customer partitioner.
From here:
In a real Hadoop cluster, there are many reducers running in different nodes. If the data for the same zone and day don’t land in the same reducer after the map reduce shuffle, we are in trouble. The way to ensure that is taking charge of defining our own partitioning logic.
Every source I've presented plus all the others I've seen recommends the partioner class to be written according to the following pseudo code:
naturalKey = compositeKey.getNaturalKey()
return naturalKey.hashCode() % NUMBER_OF_REDUCERS
Now, I was under the impression that Hadoop guarentees that for a given key, ALL the values corresponding to that key will be directed to the same reducer.
Is the reason we create a custom Partitioner the same for which we created the "natural key grouping comparator" class, to prevent MapReduce from sending the composite key instead of the reducer key?
The question is almost as good as an answer :), Everything you mentioned above is correct, I guess a different way of explaining the concept should help out.
So let me give it a shot.
Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name.
With the composite key out of the way, now lets look at the secondary sorting mechanism
The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have lead to incorrect results.

Hadoop map/reduce sort

I have a map-reduce job and I am using just the mapper because the output of each mapper will definitely have a unique key. My question is when this job is run and I get the output files, which are like part-m-00000, part-m-00001 ... Will they be sorted in order of key?
Or Do I need to implement a reducer which does nothing but just writes them to files like part-r-00000, part-r-000001. And does these guarantee that the output is sorted in the order of the key.
If you want to sort the keys within the file and make sure that the keys in the file are less than the keys in file j when i is less than j, you not only need to use a reducer, but also a partitioner. You might want to consider using something like Pig to do this as it will be trivial. If you want to do it with MR, use the sorted field as your key and write a partitioner to make sure that your keys end up in the correct reducer.
When your map function outputs the keys, it goes to the partition function which does a sort. Therefore by default the keys will be in sorted order and you can use the identity reducer.
If you want to guarantee sorted order, you can simply use a single IdentityReducer.
If you want it to be more parallelizable, you can specify more reducers, but then the output will by default only be sorted within files, not across files. I.e., each file will be sorted, but part-r-00000 will not necessarily come before part-r-00001. If you DO want it to be sorted across files, you can use a custom partitioner that partitions based on the sorting order. I.E., reducer 0 gets all of the lowest keys, then reducer 1, ... and reducer N gets all of the highest keys.

hadoop + one key to every reducer

Is there a way in Hadoop to ensure that every reducer gets only one key that is output by the mapper ?
This question is a bit unclear for me. But I think I have a pretty good idea what you want.
First of all if you do nothing special every time a reduce is called it gets only one single key with a set of one or more values (via an iterator).
My guess is that you want to ensure that every reducer gets exactly one 'key-value pair'.
There are essentially two ways of doing that:
Ensure in the mapper that all keys that are output are unique. So for each key there is only one value.
Force the reducer to do this by forcing a group comparator that simply classifies all keys as different.
So if I understand your question correctly. You should implement a GroupComparator that simply states that all keys are different and should therefor be sent to a different reducer call.
Because of other answers in this question I'm adding a bit more detail:
There are 3 methods used for comparing keys (I pulled these code samples from a project I did using the 0.18.3 API):
Partitioner
conf.setPartitionerClass(KeyPartitioner.class);
The partitioner is only to ensure that "things that must be the same end up on the same partition". If you have 1 computer there is only one partition, so this won't help much.
Key Comparator
conf.setOutputKeyComparatorClass(KeyComparator.class);
The key comparator is used to SORT the "key-value pairs" in a group by looking at the key ... which must be different somehow.
Group Comparator
conf.setOutputValueGroupingComparator(GroupComparator.class);
The group comparator is used to group keys that are different, yet must be sent o the same reducer.
HTH
You can get some control over which keys get sent to which reducers by implementng the Partitioner interface
From the Hadoop API docs:
Partitioner controls the partitioning
of the keys of the intermediate
map-outputs. The key (or a subset of
the key) is used to derive the
partition, typically by a hash
function. The total number of
partitions is the same as the number
of reduce tasks for the job. Hence
this controls which of the m reduce
tasks the intermediate key (and hence
the record) is sent for reduction.
The following book does a great job of describing partitioning, key sorting strategies and tradeoffs along with other issues in map reduce algorithm design: http://www.umiacs.umd.edu/~jimmylin/book.html
Are you sure you want to do this? Can you elaborate your problem, so that I can understand
why you want to do this.
You have to do two things, as mentioned in earlier answers
Write a partitioner such that each key gets associated with an unique reducer.
Ensure that that the number of reducer slots in your cluster is more than or equal
to the number of unique keys you will have
Pranab
My guess is same as above, just you can sort the keys if possible and try to assign it reducer based on your partitioning criteria, refer youtube mapreduce ucb 61a lecture-34, they talk about this stuff.

Resources