Why do we need the "map" part in MapReduce? - hadoop

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.
Consider the following pseudocode:
result = my_list.map(my_mapper).reduce(my_reducer);
This could be shortened to
result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));
How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?

Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.
It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.
Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.
Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.

At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.
In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.

Related

In hadoop what is meant by ability to preserve state across mapper reducer multiple inputs?

The heading of the question explains everything what my question is.
I have been reading through multiple texts, answers where I came across this line
Through use of the combiner and by taking advantage of the ability to
preserve state across multiple inputs, it is often possible to
substantially reduce both the number and size of key-value pairs that
need to be shuffled from the mappers to the reducers.
I am not able to understand this concept. An elaborate answer and explanation with an example would be really helpful. How to develop an intuition to understand such concepts?
If you already feel comfortable with the "reducer" concept, a combiner concept will be easy. A combiner can be seen as a mini-reducer on the map phase. What i mean by that? Lets see an example: suppose that you are doing the classic wordcount problem, you know that for every word a key-value pair is emited by the mapper. Then the reducer will take as input this key-value pairs and summaryze them.
Supose that a mapper collects some key-value pairs like:
<key1,1>,
<key2,1>,
<key1,1>,
<key3,1>,
<key1,1>
If you are not using a combiner, this 4 key-value pairs will be sent to the reducer. but using a combiner we could perform a pre-reduce in the mapper, so the output of the mapper will be:
<key1,3>,
<key2,1>,
<key3,1>
In this simple example by using a combiner, you reduced the total number of key-value pairs from 5 to 3, which will give you less network traffic and better performance in the shuffle phase.

How can I uniformly distribute data to reducers using a MapReduce mapper?

I have only a high-level understanding of MapReduce but a specific question about what is allowed in implementations.
I want to know whether it's easy (or possible) for a Mapper to uniformly distribute the given key-value pairs among the reducers. It might be something like
(k,v) -> (proc_id, (k,v))
where proc_id is the unique identifier for a processor (assume that every key k is unique).
The central question is that if the number of reducers is not fixed (is determined dynamically depending on the size of the input; is this even how it's done in practice?), then how can a mapper produce sensible ids? One way could be for the mapper to know the total number of key-value pairs. Does MapReduce allow mappers to have this information? Another way would be to perform some small number of extra rounds of computation.
What is the appropriate way to do this?
The distribution of keys to reducers is done by a Partitioner. If you don't specify otherwise, the default partitioner uses a simple hashCode-based partitioning algorithm, which tends to distribute the keys very uniformly when every key is unique.
I'm assuming that what you actually want is to process random groups of records in parallel, and that the keys k have nothing to do with how the records should be grouped. That suggests that you should focus on doing the work on the map side instead. Hadoop is pretty good at cleanly splitting up the input into parallel chunks for processing by the mappers, so unless you are doing some kind of arbitrary aggregation I see no reason to reduce at all.
Often the procId technique you mention is used to take otherwise heavily-skewed groups and un-skew them (for example, when performing a join operation). In your case the key is all but meaningless.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

Why is MapReduce in CouchDB called "incremental"?

I am reading the O'Reilly CouchDB book. I am puzzled by the reduce/re-reduce/incremental-MapReduce part on page 64. Too much is left to rhetory in the O'Reilly book with the sentence
If you're interested in pushing the ede of CouchDB's incremental reduce functionality, have a look at Google's paper on Sawzall, ...
If I understand the word "incremental" correctly, it refers to some sort of addition -operation in the B-tree data structure. I cannot yet see why it is somehow special over typical map-reduce, probably not yet understanding it. In CouchDB, it mentions that there is no side-effects with map function - does that hold true with reduce too?
Why is MapReduce in CouchDB is called "incremental"?
Helper questions
Explain the quote about incremental MapReduce with Sawzall.
Why two terms for the same thing i.e. reduction? Reduce and re-reduce?
References
A Google paper about Sawzall.
Introduction to CouchDB views in the CouchDB wiki and a lot of blurry blog references.
CouchDB O'Reilly book
This page that you linked explained it.
The view (which is the whole point of map reduce in CouchDB) can be updated by re-indexing only the documents that have changed since the last index update. That's the incremental part.
This can be achieved by requiring the reduce function to be referentially transparent, which means that it always returns the same output for a given input.
The reduce function also must be commutative and associative for the array value input, which means that if you run the reducer on the output of that same reducer, you will receive the same result. In that wiki page it is expressed like:
f(Key, Values) == f(Key, [ f(Key, Values) ] )
Rereduce is where you take the output from several reducer calls and run that through the reducer again. This sometimes is required because CouchDB sends stuff through the reducer in batches, so sometimes not all keys that need to be reduced will be sent through in once shot.
Just to add slightly to what user1087981 said, the reduce functionality is incremental because of the way the reduce process is performed by CouchDB.
CouchDB uses the B-Tree that it creates from the view function, and in essence it performs the reduce calculations in clumps of values. Here's a very simple mockup of a B-Tree from the O'Reilly Guide showing the leaf nodes for the example in the section you quoted from.
So, why is this incremental? Well, the final reduce is only performed at query time and all the reduce calculations are stored in the B-Tree view index. So, let's say that you add a new value to your DB that is another "fr" value. The calculations for the 1st, 2nd and 4th nodes above don't need to be redone. The new "fr" value is added, and the reduce function is re-calculated only for that 3rd leaf node.
Then at query time the final (rereduce=true) calculation is performed on the indexed values, and the final value returned. You can see that this incremental nature of reduce allows the time taken to recalculate relative only to the new values being added, not to the size of the existing data set.
Having no side-effects is another important part of this process. If, for example, your reduce functions relied on some other state being maintained as you walked through all the values, then that might work for the very first run, but then when a new value is added and an incremental reduce calculation is made it wouldn't have that same state available to it - and so it would fail to result in the correct result. This is why reduce functions need to be side-effect free, or as user1087981 puts it "referentially transparent"

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources