My task is to implement efficiently the findRange(x,y) method, that outputs all values greater than x and less than or equal to y.
How can I do this in a efficient way ?
How does a findRange(x,y) method look like in B-Trees?
I don't want any code, I just want to get the idea behind this whole thing to understand it properly.
Related
I have a RDD called myRdd:RDD[(Long, String)] (Long is an index which it was got using zipWithIndex()) with a number of elements but I need to cut it to get a specific number of elements for the final result.
I am wondering which is better way to do this:
myRdd.take(num)
or
myRdd.filterByRange(0, num)
I don't care about the order of the selected elements, but I do care about the performance.
Any suggestions? Any other way to do this? Thank you!
take is an action, and filterByRange is a transformation. An action sends the results to the driver node, and a transformation does not get executed until an action is called.
The take method will take the first n elements of the RDD and will send it back to the driver. The filterByRange a little bit more sophisticated, since it will take those element whose key is between the specified bounds.
I'd say that there are not so many differences in terms of performance between them. If you just want to send the results to the driver, without caring about the order, use the take method. However, if you want to benefit of the distributed computation and you don't need to send results back to the driver, use filterByRange method and then call to the action.
I was recently asked to implement a sampleStream() method that would choose each element with equal probability, but not use random(). I thought the interviewer was looking for reservoir sampling, but as I stumbled through it, he added that it was an approach called "stratified sampling". Admittedly, I may have been thrown off by that, because there is a statistical method called stratified sampling, and I was trying to think of how I could use that to sample elements from a stream without random. The inputs he specified were the number of items to sample, and a rate at which I should sample (something like 1000/100,000).
Anyway, I'm still stuck on this problem, even though I already didn't get the job for not answering it properly. Googling has failed me here. Can anyone help me understand it?
One way to implement stratified sampling is to sort the list by the keys used for stratification and then do a 1 in n sampling.
Technically, sorting isn't necessary if the keys are categories. In this (typical) case, a hashing method can be used. The idea is still the same: 1 in n sampling on an "ordered" list.
Perhaps this is what the interviewer was referring to.
EDIT:
You can implement stratified sampling on a stream, you would essentially be reading the stream and doing a "bucket" count for each group of similar key values. When the bucket has some arbitrary value, you would output the record. When the bucket hits some value (based on the overall frequencies), then you would reset the counter and repeat (or use modulo arithmetic).
However, this doesn't have an equal probability of getting each record. For that, I really do think you need some sort of randomization. An approach that comes close would be to store the records for each group in a bucket and then choose a random record when the bucket is full. You can emulate randomness by using a hash key on some other value (such as the time of insert) and then choosing the the minimum or maximum hash key value. (And, you can make this more efficient by just storing one record.)
There is a particular class of algorithm coding problems which require us to evaluate multiple queries which can be of two kind :
Perform search over a range of data
Update the data over a given range
One example which I've been recently working on is this(though not the only one) : Quadrant Queries
Now, to optimize my algorithm, I have had one idea :
I can use dynamic programming to keep the search results for a particular range, and generate data for other ranges as required.
For example, if I have to calculate sum of numbers in an array from index 4 to 7, I can already keep sum of elements upto 4 and sum of elements upto 7 which is easy and then I'll just need the difference of the two + 4th element which is O(1). But this raises another problem : During the update operation, I'll have to update my stored search data for all the elements following the updated element. This seems to be inefficient, though I did not try it practically.
Someone suggested me that I can combine subsequent update operations using some special data structure.(Actually read it on some forum).
Question: Is there a known way to optimize these kind of problems? Is there a special data structure that does it? The idea I mentioned;Is it possible that it might be more efficient than direct approach? Should I try it out?
It might help:
Segment Trees (Range-Range part)
I fail to understand, why is using listFindNoCase() and ListFind() the preferred way of doing a series of OR and IS/EQ comparison? Wouldn't the JVM be able to optimize it and produce efficient code, rather then making a function call that has to deal with tokenizing a string? Or is CF doing something much more inefficient??
Use listFindNoCase() or listFind() instead of the is and or operators
to compare one item to multiple items. They are much faster.
http://www.adobe.com/devnet/coldfusion/articles/coldfusion_performance.html
The answer is simple: Type conversion. You can can compare a 2 EQ "2" or now() EQ "2011-01-01", or true EQ "YES". The cost of converting (to multiple types ) and comparing is quite high.
ListFind() does not need to try multiple conversions, so it is much faster.
This is the price of dynamic typing.
I find this odd too. The only thing I can think of is that the list elements are added to a fast collection that check if an element exists based on some awesome hash of the elements it contains. This would in fact be faster for large or very large lists. The smaller lists should show little or no speed boost.
I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic