Hadoop map/reduce sort - hadoop

I have a map-reduce job and I am using just the mapper because the output of each mapper will definitely have a unique key. My question is when this job is run and I get the output files, which are like part-m-00000, part-m-00001 ... Will they be sorted in order of key?
Or Do I need to implement a reducer which does nothing but just writes them to files like part-r-00000, part-r-000001. And does these guarantee that the output is sorted in the order of the key.

If you want to sort the keys within the file and make sure that the keys in the file are less than the keys in file j when i is less than j, you not only need to use a reducer, but also a partitioner. You might want to consider using something like Pig to do this as it will be trivial. If you want to do it with MR, use the sorted field as your key and write a partitioner to make sure that your keys end up in the correct reducer.

When your map function outputs the keys, it goes to the partition function which does a sort. Therefore by default the keys will be in sorted order and you can use the identity reducer.

If you want to guarantee sorted order, you can simply use a single IdentityReducer.
If you want it to be more parallelizable, you can specify more reducers, but then the output will by default only be sorted within files, not across files. I.e., each file will be sorted, but part-r-00000 will not necessarily come before part-r-00001. If you DO want it to be sorted across files, you can use a custom partitioner that partitions based on the sorting order. I.E., reducer 0 gets all of the lowest keys, then reducer 1, ... and reducer N gets all of the highest keys.

Related

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

MapReduce: Given a file of numbers, output the amount of distinct / unique numbers

If the Input file is: 1,1,2,2,3,4,4,4,5,5,5,5,6,6,6, then the output of MapReduce should be 6 (i.e. the size of the set of unique integers {1,2,3,4,5,6}).
I need help with implementing the above. I know that we can filter out duplicates by emitting each number vs. a null value in map(), and then similarly output the key vs. a null value in reduce() to a resultant file / console.
But if I directly need to get the number of distinct numbers, how would I go about with this?
My current implementation is to build a Set, pass it as the output of the Mapper, and in the Reducer, combine all Sets passed to it, and return the count of that resultant Set. Do note that this is more of a design question than a library-specific (say, Hadoop) implementation question.
Use a mapper to build a Hashset. Make the output of IntWritable and NullWritable.
Add all the input values to the Set.
Write out the size of the Hashset.
Set number of Reduce Tasks to 0, since it's not needed.
If you must use a Reducer, output (null, value) from the mapper.
Do the same as above.
Alternative (simpler) methods exist if you can use Hive, Pig, or Spark

Why is MRJob sorting my keys?

I'm running a fairly big MRJob job (1,755,638 keys) and the keys are being written to the reducers in sorted order. This happens even if I specify that Hadoop should use the hash partitioner, with:
class SubClass(MRJob):
PARTITIONER = "org.apache.hadoop.mapred.lib.HashPartitioner"
...
I don't understand why the keys are sorted, when I am not asking for them to be sorted.
The HashPartitioner is used by default when you don't specify any partitioner explicitly.
MR sorts the key/value pairs by key so that it can ensure that all values for a given key are passed to the reducer together. In fact, the Iterable passed into the reduce() method just reads that sorted list until it finds a new key and then it stops iterating. That's why the keys will always appear in order.
Keys are not sorted by default, but the HashPartitioner will give the appearance of sorting keys if the dataset is small. When I increased the size of the dataset from 50M to 10G the keys stopped being sorted.

ordering of list of values for each keys of reducer output

I am new to hadoop, little confuse about the hadoop.
In mapreduce job the reducer get a list of values for each keys. I want to know, what is the default ordering of values for each keys. Is the the same order as it has been written out from the mapper. Can you change the ordering ( eg asc or desc ) of the values in each key.
Is the the same order as it has been written out from the mapper. - Yes
It is true for single mapper. But, if your job has more than one mapper, you may not see the same order for two runs with same input as different mappers may end different times.
Can you change the ordering ( eg asc or desc ) of the values in each key - Yes
It is done using a technique called 'secondary sort'(you may Google for more reading on this).
In MapReduce, there are a few properties that affect the emission of map output. This is referred to as the secondary sort. Namely, two factors affect this:
Partitioner, which divides the map output among the reducers. Each partition is processed by a reduce task, so the number of partitions is equal to the number of reduce tasks for the job.
Comparator, which compares values with the same key.
The default partitioner is the org.apache.hadoop.mapred.lib.HashPartitioner class, which hashes a record’s key to determine which partition the record belongs in.
Comparators differ by data type. If you want to control the sort order, override compare(WritableComparable,WritableComparable) of the WritableComparator() interface. See documentation here.

Hadoop and Cassandra processing rows in sorted order

I want to fill a Cassandra database with a list of strings that I then process using Hadoop. What I want to do it run through all the strings in order using a Hadoop cluster and record how much overlap there is between each string in order to find the Longest Common Substring.
My question is, will the InputFormat object allow me to read out the data in a sorted order or will my strings be read out "randomly" (according to how Cassandra decides to distribute them) throughout every machine in the cluster? Is the MapReduce process designed to process each row by itself w/out the intent of looking at two rows consecutively like I'm asking for?
First of all, the Mappers will read the data in whatever order they get it from the InputFormat. I'm not a Cassandra expert, but I don't expect that will be in sorted order.
If you want sorted order, you should use an identity mapper (one that does nothing) whose output key is the string itself. Then they will be sorted before passed to the reduce step. But it gets a little more complicated since you can have more than one reducer. With only one reducer, everything is globally sorted. With more than one, each reducer's input is sorted, but the input across reducers might not be sorted. That is, adjacent strings might not go to the same reducer. You would need a custom partitioner to handle that.
Lastly, you mentioned that you're doing longest common substring- are you looking for the longest substring among each pair of strings? Among consecutive pairs of strings? Among all strings? Each of these possibilities will affect how you need to structure your MapReduce job.

Resources