CouchDB - passing MapReduce results into second MapReduce function - performance

Rank amateur at both Map/Reduce and CouchDB here. I have a CouchDB populated with ~600,000 rows of data which indicate views of records. My desire is to produce a graph showing hits per record, over the entire data set.
I have implemented Map/Reduce functions to do the grouping, as so:
function(doc) {
emit(doc.id, doc);
}
and:
function(key, values) {
return values.length;
}
Now because there's still a fair amount of reduced values and we only want, say, 100 data points on the graph, this isn't very usable. Plus, it takes forever to run.
I could just retrieve every Xth row, but would would be ideal would be to pass these reduced results back into another reduce function which takes the mean of its values so I eventually get a nice set of, say, 100 results, which are useful for throwing into a high level overview graph to see distribution of hits.
Is this possible? (and if so, what would the keys be?) Or have I just messed something up in my MapReduce code that's making it grossly non-performant, thus allowing me to do this in my application code? There are only 33,500 results returned.
Thanks,
Matt

To answer my own question:
According to this article, CouchDB doesn't support passing Map/Reduce output as input to another Map/Reduce function, although the article notes that other projects such as disco do support this.
Custom server-side processing can be performed by way of CouchDB lists - like, for example, sorting by value.

Related

Spark: Getting N elements from a RDD, .take() or .filterByRange()?

I have a RDD called myRdd:RDD[(Long, String)] (Long is an index which it was got using zipWithIndex()) with a number of elements but I need to cut it to get a specific number of elements for the final result.
I am wondering which is better way to do this:
myRdd.take(num)
or
myRdd.filterByRange(0, num)
I don't care about the order of the selected elements, but I do care about the performance.
Any suggestions? Any other way to do this? Thank you!
take is an action, and filterByRange is a transformation. An action sends the results to the driver node, and a transformation does not get executed until an action is called.
The take method will take the first n elements of the RDD and will send it back to the driver. The filterByRange a little bit more sophisticated, since it will take those element whose key is between the specified bounds.
I'd say that there are not so many differences in terms of performance between them. If you just want to send the results to the driver, without caring about the order, use the take method. However, if you want to benefit of the distributed computation and you don't need to send results back to the driver, use filterByRange method and then call to the action.

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting the records.
Here is the example for KNN.
So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.
Example:
Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. So we need to get the model result in our hands. And if it is large file it cannot be loaded into Distributed cache for finding distance.
The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.
How can we parttion the file and perform the operation on these partition ?
ie 1 record <Distance> file1,file2,....filen
2nd record <Distance> file1,file2,...filen
This is what came to my thought.
Is there any further way.
Any pointers would help me.
I think the way you partitionin the data mainly depends on your data itself.
Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop).
That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. This is a hard problem and requires knowledge of the data you operate on.
Additional tricks that you can use to fish out performance are:
Pre-sorting the input data so that the input items that will be compared from the same partition come together. Again depends on the data you operate on.
Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it.
In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. Pre-sorting the data is still helpful there.

Why is MapReduce in CouchDB called "incremental"?

I am reading the O'Reilly CouchDB book. I am puzzled by the reduce/re-reduce/incremental-MapReduce part on page 64. Too much is left to rhetory in the O'Reilly book with the sentence
If you're interested in pushing the ede of CouchDB's incremental reduce functionality, have a look at Google's paper on Sawzall, ...
If I understand the word "incremental" correctly, it refers to some sort of addition -operation in the B-tree data structure. I cannot yet see why it is somehow special over typical map-reduce, probably not yet understanding it. In CouchDB, it mentions that there is no side-effects with map function - does that hold true with reduce too?
Why is MapReduce in CouchDB is called "incremental"?
Helper questions
Explain the quote about incremental MapReduce with Sawzall.
Why two terms for the same thing i.e. reduction? Reduce and re-reduce?
References
A Google paper about Sawzall.
Introduction to CouchDB views in the CouchDB wiki and a lot of blurry blog references.
CouchDB O'Reilly book
This page that you linked explained it.
The view (which is the whole point of map reduce in CouchDB) can be updated by re-indexing only the documents that have changed since the last index update. That's the incremental part.
This can be achieved by requiring the reduce function to be referentially transparent, which means that it always returns the same output for a given input.
The reduce function also must be commutative and associative for the array value input, which means that if you run the reducer on the output of that same reducer, you will receive the same result. In that wiki page it is expressed like:
f(Key, Values) == f(Key, [ f(Key, Values) ] )
Rereduce is where you take the output from several reducer calls and run that through the reducer again. This sometimes is required because CouchDB sends stuff through the reducer in batches, so sometimes not all keys that need to be reduced will be sent through in once shot.
Just to add slightly to what user1087981 said, the reduce functionality is incremental because of the way the reduce process is performed by CouchDB.
CouchDB uses the B-Tree that it creates from the view function, and in essence it performs the reduce calculations in clumps of values. Here's a very simple mockup of a B-Tree from the O'Reilly Guide showing the leaf nodes for the example in the section you quoted from.
So, why is this incremental? Well, the final reduce is only performed at query time and all the reduce calculations are stored in the B-Tree view index. So, let's say that you add a new value to your DB that is another "fr" value. The calculations for the 1st, 2nd and 4th nodes above don't need to be redone. The new "fr" value is added, and the reduce function is re-calculated only for that 3rd leaf node.
Then at query time the final (rereduce=true) calculation is performed on the indexed values, and the final value returned. You can see that this incremental nature of reduce allows the time taken to recalculate relative only to the new values being added, not to the size of the existing data set.
Having no side-effects is another important part of this process. If, for example, your reduce functions relied on some other state being maintained as you walked through all the values, then that might work for the very first run, but then when a new value is added and an incremental reduce calculation is made it wouldn't have that same state available to it - and so it would fail to result in the correct result. This is why reduce functions need to be side-effect free, or as user1087981 puts it "referentially transparent"

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources