hadoop + one key to every reducer - hadoop

Is there a way in Hadoop to ensure that every reducer gets only one key that is output by the mapper ?

This question is a bit unclear for me. But I think I have a pretty good idea what you want.
First of all if you do nothing special every time a reduce is called it gets only one single key with a set of one or more values (via an iterator).
My guess is that you want to ensure that every reducer gets exactly one 'key-value pair'.
There are essentially two ways of doing that:
Ensure in the mapper that all keys that are output are unique. So for each key there is only one value.
Force the reducer to do this by forcing a group comparator that simply classifies all keys as different.
So if I understand your question correctly. You should implement a GroupComparator that simply states that all keys are different and should therefor be sent to a different reducer call.
Because of other answers in this question I'm adding a bit more detail:
There are 3 methods used for comparing keys (I pulled these code samples from a project I did using the 0.18.3 API):
Partitioner
conf.setPartitionerClass(KeyPartitioner.class);
The partitioner is only to ensure that "things that must be the same end up on the same partition". If you have 1 computer there is only one partition, so this won't help much.
Key Comparator
conf.setOutputKeyComparatorClass(KeyComparator.class);
The key comparator is used to SORT the "key-value pairs" in a group by looking at the key ... which must be different somehow.
Group Comparator
conf.setOutputValueGroupingComparator(GroupComparator.class);
The group comparator is used to group keys that are different, yet must be sent o the same reducer.
HTH

You can get some control over which keys get sent to which reducers by implementng the Partitioner interface
From the Hadoop API docs:
Partitioner controls the partitioning
of the keys of the intermediate
map-outputs. The key (or a subset of
the key) is used to derive the
partition, typically by a hash
function. The total number of
partitions is the same as the number
of reduce tasks for the job. Hence
this controls which of the m reduce
tasks the intermediate key (and hence
the record) is sent for reduction.
The following book does a great job of describing partitioning, key sorting strategies and tradeoffs along with other issues in map reduce algorithm design: http://www.umiacs.umd.edu/~jimmylin/book.html

Are you sure you want to do this? Can you elaborate your problem, so that I can understand
why you want to do this.
You have to do two things, as mentioned in earlier answers
Write a partitioner such that each key gets associated with an unique reducer.
Ensure that that the number of reducer slots in your cluster is more than or equal
to the number of unique keys you will have
Pranab

My guess is same as above, just you can sort the keys if possible and try to assign it reducer based on your partitioning criteria, refer youtube mapreduce ucb 61a lecture-34, they talk about this stuff.

Related

Designing of the "mapper" and "reducer" functions' functionality for hadoop?

I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.

How can I uniformly distribute data to reducers using a MapReduce mapper?

I have only a high-level understanding of MapReduce but a specific question about what is allowed in implementations.
I want to know whether it's easy (or possible) for a Mapper to uniformly distribute the given key-value pairs among the reducers. It might be something like
(k,v) -> (proc_id, (k,v))
where proc_id is the unique identifier for a processor (assume that every key k is unique).
The central question is that if the number of reducers is not fixed (is determined dynamically depending on the size of the input; is this even how it's done in practice?), then how can a mapper produce sensible ids? One way could be for the mapper to know the total number of key-value pairs. Does MapReduce allow mappers to have this information? Another way would be to perform some small number of extra rounds of computation.
What is the appropriate way to do this?
The distribution of keys to reducers is done by a Partitioner. If you don't specify otherwise, the default partitioner uses a simple hashCode-based partitioning algorithm, which tends to distribute the keys very uniformly when every key is unique.
I'm assuming that what you actually want is to process random groups of records in parallel, and that the keys k have nothing to do with how the records should be grouped. That suggests that you should focus on doing the work on the map side instead. Hadoop is pretty good at cleanly splitting up the input into parallel chunks for processing by the mappers, so unless you are doing some kind of arbitrary aggregation I see no reason to reduce at all.
Often the procId technique you mention is used to take otherwise heavily-skewed groups and un-skew them (for example, when performing a join operation). In your case the key is all but meaningless.

What is the point of using a Partitioner for Secondary Sorting in MapReduce?

If you need to have the values sorted for a given key when passed to the reduce phase, such as for a moving average, or to mimick the LAG/LEAD Analytic functions in SQL, you need to implement a Secondary Sort in MapReduce.
After searching around on Google, the common suggestion is to:
A) Emit composite key, which includes the , in the map phase
B) Create a "composite key comparator" class, the purpose of which is for the secondary sort, comparing the values to sort on after comparing the key, so that the Iterable passed to the reducer is sorted.
C) Create a "natural key grouping comparator" class, the purpose of which is for the primary sort, comparing only the key to sort on, so that the Iterable passed to the reducer contains all of the values belonging to a given key.
D) Create a "natural key partitioner class", the purpose of which I do not know and is the purpose of my question.
From here:
The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only consider the “natural” key.
By natural key he of course means the actual key, not the composite key + value.
From here:
The default partition will calculate a hash over the entire key resulting in different hashes and the potential that the records are sent to separate reducers. To ensure that both records are sent to the same reducer let's implement a customer partitioner.
From here:
In a real Hadoop cluster, there are many reducers running in different nodes. If the data for the same zone and day don’t land in the same reducer after the map reduce shuffle, we are in trouble. The way to ensure that is taking charge of defining our own partitioning logic.
Every source I've presented plus all the others I've seen recommends the partioner class to be written according to the following pseudo code:
naturalKey = compositeKey.getNaturalKey()
return naturalKey.hashCode() % NUMBER_OF_REDUCERS
Now, I was under the impression that Hadoop guarentees that for a given key, ALL the values corresponding to that key will be directed to the same reducer.
Is the reason we create a custom Partitioner the same for which we created the "natural key grouping comparator" class, to prevent MapReduce from sending the composite key instead of the reducer key?
The question is almost as good as an answer :), Everything you mentioned above is correct, I guess a different way of explaining the concept should help out.
So let me give it a shot.
Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name.
With the composite key out of the way, now lets look at the secondary sorting mechanism
The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have lead to incorrect results.

Is the input to a Hadoop reduce function complete with regards to its key?

I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources