Spark: Getting N elements from a RDD, .take() or .filterByRange()? - performance

I have a RDD called myRdd:RDD[(Long, String)] (Long is an index which it was got using zipWithIndex()) with a number of elements but I need to cut it to get a specific number of elements for the final result.
I am wondering which is better way to do this:
myRdd.take(num)
or
myRdd.filterByRange(0, num)
I don't care about the order of the selected elements, but I do care about the performance.
Any suggestions? Any other way to do this? Thank you!

take is an action, and filterByRange is a transformation. An action sends the results to the driver node, and a transformation does not get executed until an action is called.
The take method will take the first n elements of the RDD and will send it back to the driver. The filterByRange a little bit more sophisticated, since it will take those element whose key is between the specified bounds.
I'd say that there are not so many differences in terms of performance between them. If you just want to send the results to the driver, without caring about the order, use the take method. However, if you want to benefit of the distributed computation and you don't need to send results back to the driver, use filterByRange method and then call to the action.

Related

CouchDB - passing MapReduce results into second MapReduce function

Rank amateur at both Map/Reduce and CouchDB here. I have a CouchDB populated with ~600,000 rows of data which indicate views of records. My desire is to produce a graph showing hits per record, over the entire data set.
I have implemented Map/Reduce functions to do the grouping, as so:
function(doc) {
emit(doc.id, doc);
}
and:
function(key, values) {
return values.length;
}
Now because there's still a fair amount of reduced values and we only want, say, 100 data points on the graph, this isn't very usable. Plus, it takes forever to run.
I could just retrieve every Xth row, but would would be ideal would be to pass these reduced results back into another reduce function which takes the mean of its values so I eventually get a nice set of, say, 100 results, which are useful for throwing into a high level overview graph to see distribution of hits.
Is this possible? (and if so, what would the keys be?) Or have I just messed something up in my MapReduce code that's making it grossly non-performant, thus allowing me to do this in my application code? There are only 33,500 results returned.
Thanks,
Matt
To answer my own question:
According to this article, CouchDB doesn't support passing Map/Reduce output as input to another Map/Reduce function, although the article notes that other projects such as disco do support this.
Custom server-side processing can be performed by way of CouchDB lists - like, for example, sorting by value.

What's a simple way to design a memoization system with limited memory?

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.
I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.
Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
Easy, simple to use LRU cache in java
http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
http://www.codeproject.com/KB/java/lru.aspx

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

What is the most efficient data structure to store a list of integers that need to be looked up a lot in .Net?

I need an efficient data structure to store a list of integers. The amount in the list could range from 1 to probably never more than 1000. The list will be queried about 20 times per request. What would be the most efficient collection type to store these in?
UPDATE
To give a little more insight, we'll take www.wikipediamaze.com (a little game I wrote) as an example (not the real scenario but close enough for conversation). For the list of puzzles on any given page, I am currently returning a list from the puzzles table joined to the table that stores which puzzles the current user has played. Instead I want to cache the list of puzzles agnostic to the user. So what I am doing is first loading and caching the list of puzzles from the database. Then I load and cache the list of puzzles the user has played. Then when I am iterating over the puzzles to display them, I want to do this:
protected BestDataStructure<long> PlayedPuzzles {get; set;} //Loaded from session
protected bool HasBeenPlayed(long puzzleId)
{
return PlayedPuzzles.Contains(puzzleId)
}
Whenever they play a new puzzle I will save the record to the database and append it to the list stored in the session.
Thanks!
It depends on how you need to query them, but a simple array, or HashSet<int> springs to mind.
Both are O(1) when you index into them. HashSet.Contains is also O(1).
In response to your comment on the question: Use HashSet, since you need to check whether a specified integer exists. You should use Contains() on HashSet to do this; it will give you the best performance. If you need to store some other value related to the value, perhaps use Dictionary.
If you're looking for a structure that efficiently does an operation such as Contains(), then HashSet<int> is what you're looking for. HashSet<int> is O(1) (constant time) for Contains(), which is as good a complexity as you're going to get.
HashSet(int).
if you need to check for presence of element, then probably bool[] with 1000 elements and set true for existing integer elements?

Boost multi_index ordered_unique Median Value

I would like to quickly retrieve the median value from a boost multi_index container with an ordered_unique index, however the index iterators aren't random access (I don't understand why they can't be, though this is consistent with std::set...).
Is there a faster/neater way to do this other than incrementing an iterator container.size() / 2 times?
Boost.MultiIndex provide random access indexes, but these index don't take care directly of any order. You can however sort these index, using the sort member function, after inserting a new element, so you will be able to get the median efficiently.
It seems you should make a request to Boost.MultiIndex so the insertion can be done using an order directly, as this should be much more efficient.
I ran into the same problem in a different context. It seems that the STL and Boost don't provide an ordered container that has random access to make use of the ordering (e.g. for comparing).
My (not so pretty) solution was to use a Class that performed the input and "filtered" it in a set. After the input operation was finished it just copied all iterators of the set to a vector and used this for random access.
This solution only works in a very limited context: You perform input on the container once. If you change add to the container again all iterators would have to be copied again. It really was very clumsy to use but worked.

Resources