I am new to hadoop and have been struggling to write a mapreduce algorithm for finding top N values for each A value. Any help or guide to code implementation would be highly appreciated.
Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1
output
a 1,4,7,9
b 1,3,5
I believe we should write a Mapper that would read the line, split the values and allow it to be collected by reducer. And once in the reducer we have to do the sorting part.
If the number of values per key is small enough, the simple approach of just having the reducer read all values associated to a given key and output the top N is probably best.
If the number of values per key is large enough that this would be a poor choice, then a composite key is going to work better, and a custom partitioner and comparator will be needed. You'd want to partition based on the natural key (here 'a' or 'b', so that these end up at the same reducer) but with a secondary sort on the value (so that the reducer will see the largest values first).
The secondary sort trick mentioned by cohoz seems to be what you're looking for.
There's a nice guide here, which even has a similar structure to your problem (in the example, the author is seeking to walk over each integer timestamp (1,2,3) in sorted order for each class (a,b,c). You'll simply need to modify the reducer in the example to just walk over the top n items and emit them, then stop.
Related
I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.
I have a Pair RDD (K, V) with the key containing a time and an ID. I would like to get a Pair RDD of the form (K, Iterable<V>) where the keys are groupped by id and the iterable is ordered by time.
I'm currently using sortByKey().groupByKey() and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).
Is it correct or not?
Thanks!
The answer from Matei, who I consider authoritative on this topic, is quite clear:
The order is not guaranteed actually, only which keys end up in each
partition. Reducers may fetch data from map tasks in an arbitrary
order, depending on which ones are available first. If you’d like a
specific order, you should sort each partition. Here you might be
getting it because each partition only ends up having one element, and
collect() does return the partitions in order.
In that context, a better option would be to apply the sorting to the resulting collections per key:
rdd.groupByKey().mapValues(_.sorted)
The Spark Programming Guide offers three alternatives if one desires predictably ordered data following shuffle:
mapPartitions to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
sortBy to make a globally ordered RDD
As written in the Spark API, repartitionAndSortWithinPartitions is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
The sorting, however, is computed by looking only at the keys K of tuples (K, V). The trick is to put all the relevant informations in the first element of the tuple, like ((K, V), null), defining a custom partitioner and a custom ordering. This article descrives pretty well the technique.
I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).
Is there a way in Hadoop to ensure that every reducer gets only one key that is output by the mapper ?
This question is a bit unclear for me. But I think I have a pretty good idea what you want.
First of all if you do nothing special every time a reduce is called it gets only one single key with a set of one or more values (via an iterator).
My guess is that you want to ensure that every reducer gets exactly one 'key-value pair'.
There are essentially two ways of doing that:
Ensure in the mapper that all keys that are output are unique. So for each key there is only one value.
Force the reducer to do this by forcing a group comparator that simply classifies all keys as different.
So if I understand your question correctly. You should implement a GroupComparator that simply states that all keys are different and should therefor be sent to a different reducer call.
Because of other answers in this question I'm adding a bit more detail:
There are 3 methods used for comparing keys (I pulled these code samples from a project I did using the 0.18.3 API):
Partitioner
conf.setPartitionerClass(KeyPartitioner.class);
The partitioner is only to ensure that "things that must be the same end up on the same partition". If you have 1 computer there is only one partition, so this won't help much.
Key Comparator
conf.setOutputKeyComparatorClass(KeyComparator.class);
The key comparator is used to SORT the "key-value pairs" in a group by looking at the key ... which must be different somehow.
Group Comparator
conf.setOutputValueGroupingComparator(GroupComparator.class);
The group comparator is used to group keys that are different, yet must be sent o the same reducer.
HTH
You can get some control over which keys get sent to which reducers by implementng the Partitioner interface
From the Hadoop API docs:
Partitioner controls the partitioning
of the keys of the intermediate
map-outputs. The key (or a subset of
the key) is used to derive the
partition, typically by a hash
function. The total number of
partitions is the same as the number
of reduce tasks for the job. Hence
this controls which of the m reduce
tasks the intermediate key (and hence
the record) is sent for reduction.
The following book does a great job of describing partitioning, key sorting strategies and tradeoffs along with other issues in map reduce algorithm design: http://www.umiacs.umd.edu/~jimmylin/book.html
Are you sure you want to do this? Can you elaborate your problem, so that I can understand
why you want to do this.
You have to do two things, as mentioned in earlier answers
Write a partitioner such that each key gets associated with an unique reducer.
Ensure that that the number of reducer slots in your cluster is more than or equal
to the number of unique keys you will have
Pranab
My guess is same as above, just you can sort the keys if possible and try to assign it reducer based on your partitioning criteria, refer youtube mapreduce ucb 61a lecture-34, they talk about this stuff.
I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.