I am reading the O'Reilly CouchDB book. I am puzzled by the reduce/re-reduce/incremental-MapReduce part on page 64. Too much is left to rhetory in the O'Reilly book with the sentence
If you're interested in pushing the ede of CouchDB's incremental reduce functionality, have a look at Google's paper on Sawzall, ...
If I understand the word "incremental" correctly, it refers to some sort of addition -operation in the B-tree data structure. I cannot yet see why it is somehow special over typical map-reduce, probably not yet understanding it. In CouchDB, it mentions that there is no side-effects with map function - does that hold true with reduce too?
Why is MapReduce in CouchDB is called "incremental"?
Helper questions
Explain the quote about incremental MapReduce with Sawzall.
Why two terms for the same thing i.e. reduction? Reduce and re-reduce?
References
A Google paper about Sawzall.
Introduction to CouchDB views in the CouchDB wiki and a lot of blurry blog references.
CouchDB O'Reilly book
This page that you linked explained it.
The view (which is the whole point of map reduce in CouchDB) can be updated by re-indexing only the documents that have changed since the last index update. That's the incremental part.
This can be achieved by requiring the reduce function to be referentially transparent, which means that it always returns the same output for a given input.
The reduce function also must be commutative and associative for the array value input, which means that if you run the reducer on the output of that same reducer, you will receive the same result. In that wiki page it is expressed like:
f(Key, Values) == f(Key, [ f(Key, Values) ] )
Rereduce is where you take the output from several reducer calls and run that through the reducer again. This sometimes is required because CouchDB sends stuff through the reducer in batches, so sometimes not all keys that need to be reduced will be sent through in once shot.
Just to add slightly to what user1087981 said, the reduce functionality is incremental because of the way the reduce process is performed by CouchDB.
CouchDB uses the B-Tree that it creates from the view function, and in essence it performs the reduce calculations in clumps of values. Here's a very simple mockup of a B-Tree from the O'Reilly Guide showing the leaf nodes for the example in the section you quoted from.
So, why is this incremental? Well, the final reduce is only performed at query time and all the reduce calculations are stored in the B-Tree view index. So, let's say that you add a new value to your DB that is another "fr" value. The calculations for the 1st, 2nd and 4th nodes above don't need to be redone. The new "fr" value is added, and the reduce function is re-calculated only for that 3rd leaf node.
Then at query time the final (rereduce=true) calculation is performed on the indexed values, and the final value returned. You can see that this incremental nature of reduce allows the time taken to recalculate relative only to the new values being added, not to the size of the existing data set.
Having no side-effects is another important part of this process. If, for example, your reduce functions relied on some other state being maintained as you walked through all the values, then that might work for the very first run, but then when a new value is added and an incremental reduce calculation is made it wouldn't have that same state available to it - and so it would fail to result in the correct result. This is why reduce functions need to be side-effect free, or as user1087981 puts it "referentially transparent"
Related
The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.
Consider the following pseudocode:
result = my_list.map(my_mapper).reduce(my_reducer);
This could be shortened to
result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));
How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?
Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.
It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.
Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.
Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.
At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.
In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.
this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.
Rank amateur at both Map/Reduce and CouchDB here. I have a CouchDB populated with ~600,000 rows of data which indicate views of records. My desire is to produce a graph showing hits per record, over the entire data set.
I have implemented Map/Reduce functions to do the grouping, as so:
function(doc) {
emit(doc.id, doc);
}
and:
function(key, values) {
return values.length;
}
Now because there's still a fair amount of reduced values and we only want, say, 100 data points on the graph, this isn't very usable. Plus, it takes forever to run.
I could just retrieve every Xth row, but would would be ideal would be to pass these reduced results back into another reduce function which takes the mean of its values so I eventually get a nice set of, say, 100 results, which are useful for throwing into a high level overview graph to see distribution of hits.
Is this possible? (and if so, what would the keys be?) Or have I just messed something up in my MapReduce code that's making it grossly non-performant, thus allowing me to do this in my application code? There are only 33,500 results returned.
Thanks,
Matt
To answer my own question:
According to this article, CouchDB doesn't support passing Map/Reduce output as input to another Map/Reduce function, although the article notes that other projects such as disco do support this.
Custom server-side processing can be performed by way of CouchDB lists - like, for example, sorting by value.
I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.
One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.
In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.
So how is this really done? How does this MapReduce sorting algorithm work?
Thanks for helping me understand.
Here are some details on Hadoop's implementation for Terasort:
TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.
I found the paper reference through James Hamilton's Blog Post.
Google Reference: MapReduce: Simplified Data Processing on Large Clusters
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
That link has a PDF and HTML-Slide reference.
There is also a Wikipedia page with description with implementation references.
Also criticism,
David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenge the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation". MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.
I had the same question while reading Google's MapReduce paper. #Yuval F 's answer pretty much solved my puzzle.
One thing I noticed while reading the paper is that the magic happens in the partitioning (after map, before reduce).
The paper uses hash(key) mod R as the partitioning example, but this is not the only way to partition intermediate data to different reduce tasks.
Just add boundary conditions to #Yuval F 's answer to make it complete: suppose min(S) and max(S) is the minimum key and maximum key among the sampled keys; all keys < min(S) are partitioned to one reduce task; vice versa, all keys >= max(S) are partitioned to one reduce task.
There is no hard limitation on the sampled keys, like min or max. Just, more evenly these R keys distributed among all the keys, more "parallel" this distributed system is and less likely a reduce operator has memory overflow issue.
Just guessing...
Given a huge set of data, you would partition the data into some chunks to be processed in parallel (perhaps by record number i.e. record 1 - 1000 = partition 1, and so on).
Assign / schedule each partition to a particular node in the cluster.
Each cluster node will further break (map) the partition into its own mini partition, perhaps by the key alphabetical order. So, in partition 1, get me all the things that starts with A and output it into mini partition A of x. Create a new A(x) if currently there is an A(x) already. Replace x with sequential number (perhaps this is the scheduler job to do so). I.e. Give me the next A(x) unique id.
Hand over (schedule) jobs completed by the mapper (previous step) to the "reduce" cluster nodes. Reduce node cluster will then further refine the sort of each A(x) parts which wil lonly happen when al lthe mapper tasks are done (Can't actually start sorting all the words starting w/ A when there are still possibility that there is still going to be another A mini partition in the making). Output the result in the final sorted partion (i.e. Sorted-A, Sorted-B, etc.)
Once done, combine the sorted partition into a single dataset again. At this point it is just a simple concatenation of n files (where n could be 26 if you are only doing A - Z), etc.
There might be intermediate steps in between... I'm not sure :). I.e. further map and reduce after the initial reduce step.