Parallel Binary Search - parallel-processing

I'm just starting to learn about parallel programming, and I'm looking at binary search.
This can't really be optimized by throwing more processors at it right? I know it's supposedly dividing and conquering, but you're really "decreasing and conquering" (from Wikipedia).
Or could you possibly parallelize the comparisons? (if X is less than array[mid], search from low to mid - 1; else if X is greater than array[mid] search from mid + 1 to high, else return mid, the index of X)
Or how about you give half of the array to one processor to do binary search on, and the other half to another? Wouldn't that be wasteful though? Because it's decreasing and conquering rather than simply dividing and conquering? Thoughts?

You can easily use parallelism.
For k is less than n processors, split the array into n/k groups and assign a processor to each group.
Run binary search on that group.
Now the time is log(n/k).
There is also a crew method that is logn/log(k+1).

I would think it certainly qualifies for parallelisation. At least, across two threads. Have one thread do a depth-first search, and the other do a breadth-first search. The winner is the algorithm that performs the fastest, which may be different from data-set to data-set.

I don't have much experience in parallel programming, but I doubt this is a good candidate for parallel processing. Each step of the algorithm depends on performing one comparison, and then proceeding down a set "path" based on this comparison (you either found your value, or now have to keep searching in a set "direction" based on the comparison). Two separate threads performing the same comparison won't get you anywhere any faster, and separate threads will both need to rely on the same comparison to decide what to do next, so they can't really do any useful, divided work on their own.
As far as your idea of splitting the array, I think you are just negating the benefit of binary search in this case. Your value (assuming it's in your array), will either be in the top or the bottom half of your array. The first comparison (at the midpoint) in a binary search is going to tell you which half you should be looking in. If you take that even further, consider breaking an array of N elements into N different binary searches (a naive attempt to parallel-ize). You are now doing N comparisons, when you don't need to. You are losing the power of binary search, in that each comparison will narrow down your search to the appropriate subset.
Hope that helps. Comments welcome.

Yes, in the classical sense of parallelization (multi-core), binary search and BST are not much better.
There are techniques like having multiple copies of the BST on L1 cache for each processor. Only one processor is active but the gains from having multiple L1 caches can be great (4 cycles for L1 vs 14 cycles for L2).
In real world problems you are often searching multiple keys at the same time.
Now, there is another kind of parallelization that can help: SIMD! Check out "Fast architecture sensitive tree search on modern CPUs and GPUs" by a team from Intel/UCSC/Oracle (SIGMOD 2010). It's very cool. BTW I'm basing my current research project on this very paper.

Parallel implementation can speed up a binary search, but the improvement is not particularly significant. Worst case, the time required for a binary search is log_2(n) where n is the number of elements in the list. A simple parallel implementation breaks the master list into k sub-lists to be bin-searched by parallel threads. The resulting worst-case time for the binary search is log_2(n/k) realizing a theoretical decrease in the search time.
Example:
A list of 1024 entries takes as many as 10 cycles to binary search using a single thread. Using 4 threads, each thread only would only take 8 cycles to complete the search. And using 8 threads, each thread takes 7 cycles. Thus, an 8 threaded parallel binary search could be up to 30% faster than the single threaded model.
However, his speed-up should not be confused with a improvement in efficiency: The 8 threaded model actually executes 8 * 7 = 56 comparisons to complete the search compared to the 10 comparisons executed by the single threaded binary search. It is up to the discretion of the programmer if the marginal gain in speed of a parallel application of binary search is appropriate or advantageous for their application.

I am pretty sure binary search can be speed up by a factor of log (M) where M is the number of processors. log(n/M) = log(n) - log(M) > log(n)/ log(M) for a constant M. I do not have a proof for a tight lower bound, but if M=n, the execution time is O(1), which cannot be any better. An algorithm sketch follows.
Parallel_Binary_Search(sorted_arraylist)
Divide your sorted_arraylist into M chunks of size n/M.
Apply one step of comparison to the middle element of each chunk.
If a comparator signals equality, return the address and terminate.
Otherwise, identify both adjacent chunks where comparators signaled (>) and (<), respectively.
Form a new Chunk starting from the element following the one that signaled (>) and ending at the element preceding the one that signaled (<).
If they are the same element, return fail and terminate.
Otherwise, Parallel_Binary_Search(Chunk)

Related

Most effective Algorithm to find maximum of double-precision values

What is the most effective way of finding a maximum value in a set of variables?
I have seen solutions, such as
private double findMax(double... vals) {
double max = Double.NEGATIVE_INFINITY;
for (double d : vals) {
if (d > max) max = d;
}
return max;
}
But, what would be the most effective algorithm for doing this?
You can't reduce the complexity below O(n) if the list is unsorted... but you can improve the constant factor by a lot. Use SIMD. For example, in SSE you would use the MAXSS instruction to perform 4-ish compare+select operations in a single cycle. Unroll the loop a bit to reduce the cost of loop control logic. And then outside the loop, find the max out of the four values trapped in your SSE register.
This gives a benefit for any size list... also using multithreading makes sense for really large lists.
Assuming the list does not have elements in any particular order, the algorithm you mentioned in your question is optimal. It must look at every element once, thus it takes time directly proportional to the to the size of the list, O(n).
There is no algorithm for finding the maximum that has a lower upper bound than O(n).
Proof: Suppose for a contradiction that there is an algorithm that finds the maximum of a list in less than O(n) time. Then there must be at least one element that it does not examine. If the algorithm selects this element as the maximum, an adversary may choose a value for the element such that it is smaller than one of the examined elements. If the algorithm selects any other element as the maximum, an adversary may choose a value for the element such that it is larger than the other elements. In either case, the algorithm will fail to find the maximum.
EDIT: This was my attempt answer, but please look at the coments where #BenVoigt proposes a better way to optimize the expression
You need to traverse the whole list at least once
so it'd be a matter of finding a more efficient expression for if (d>max) max=d, if any.
Assuming we need the general case where the list is unsorted (if we keep it sorted we'd just pick the last item as #IgnacioVazquez points in the comments), and researching a little about branch prediction (Why is it faster to process a sorted array than an unsorted array? , see 4th answer) , looks like
if (d>max) max=d;
can be more efficiently rewritten as
max=d>max?d:max;
The reason is, the first statement is normally translated into a branch (though it's totally compiler and language dependent, but at least in C and C++, and even in a VM-based language like Java happens) while the second one is translated into a conditional move.
Modern processors have a big penalty in branches if the prediction goes wrong (the execution pipelines have to be reset), while a conditional move is an atomic operation that doesn't affect the pipelines.
The random nature of the elements in the list (one can be greater or lesser than the current maximum with equal probability) will cause many branch predictions to go wrong.
Please refer to the linked question for a nice discussion of all this, together with benchmarks.

Is the linear formation the best sorting production?

Considering usually a sorting method products linearly sorted productions (such as "1,7,8,13,109..."), which consumes O(N) to inquiry.
Why not sort in non-linear order, consuming O(logN) or something to find element(s) by iteration or Newton method etc.? Is it expensive to make such a high-order sorted structure?
Concisely, is it a possible idea to sort results which allowed to be accessed by finding roots for ax^2 + bx + c = 0? (for contrast, usually it's finding root for ax + c = 0.) For example, we have x1 = 1, x2 = 2 as roots of a quadratic equation and just insert following xi(s). Then it is possible to use smarter ways to inquiry.
I suppose difficulty can be encountered by these aspects:
prediction of data can be rather hard. thus we cannot construct a general formula to describe well the following numbers (may be hash values).
due to the first difficulty, numbers out of certain range can be divergent. example graphed by Google:the graph. the values derived out of [-1,3] are really large, as well as rapid increment in difficulty executing the original formula.
that is actually equivalent to hash, which creates a table that contains the values. and the production rule is a formula.
the execution of a "smarter" inquiry may be expensive because of the complexity of algorithm itself.
Smarter schemes which take advantage of a known statistical distribution are typically faster by some constant. However, that still keeps them at O(log N), which is the same as a trivial binary search. The reason is that in each step, they typically narrow down the range of elements to search by a factor R > 2 , for simple binary search that's just R=2. But you need log(N)/log(R) steps to narrow it down to exactly one element.
Now whether this is a net win depends on log(R) versus the work needed at each step. A simple comparison (for binary search) takes a few cycles. As soon as you need anything more complex than +-*/ (say exp or log) to predict the location of the next element, the profit of needing less steps is gone.
So, in summary: binary search is used because each step is efficient, for many real-world distributions.

Finding the average of large list of numbers

Came across this interview question.
Write an algorithm to find the mean(average) of a large list. This
list could contain trillions or quadrillions of number. Each number is
manageable in hundreds, thousands or millions.
Googling it gave me all Median of Medians solutions. How should I approach this problem?
Is divide and conquer enough to deal with trillions of number?
How to deal with the list of the such a large size?
If the size of the list is computable, it's really just a matter of how much memory you have available, how long it's supposed to take and how simple the algorithm is supposed to be.
Basically, you can just add everything up and divide by the size.
If you don't have enough memory, dividing first might work (Note that you will probably lose some precision that way).
Another approach would be to recursively split the list into 2 halves and calculating the mean of the sublists' means. Your recursion termination condition is a list size of 1, in which case the mean is simply the only element of the list. If you encounter a list of odd size, make either the first or second sublist longer, this is pretty much arbitrary and doesn't even have to be consistent.
If, however, you list is so giant that its size can't be computed, there's no way to split it into 2 sublists. In that case, the recursive approach works pretty much the other way around. Instead of splitting into 2 lists with n/2 elements, you split into n/2 lists with 2 elements (or rather, calculate their mean immediately). So basically, you calculate the mean of elements 1 and 2, that becomes you new element 1. the mean of 3 and 4 is your new second element, and so on. Then apply the same algorithm to the new list until only 1 element remains. If you encounter a list of odd size, either add an element at the end or ignore the last one. If you add one, you should try to get as close as possible to your expected mean.
While this won't calculate the mean mathematically exactly, for lists of that size, it will be sufficiently close. This is pretty much a mean of means approach. You could also go the median of medians route, in which case you select the median of sublists recursively. The same principles apply, but you will generally want to get an odd number.
You could even combine the approaches and calculate the mean if your list is of even size and the median if it's of odd size. Doing this over many recursion steps will generate a pretty accurate result.
First of all, this is an interview question. The problem as stated would not arise in practice. Also, the question as stated here is imprecise. That is probably deliberate. (They want to see how you deal with solving an imprecisely specified problem.)
Write an algorithm to find the mean(average) of a large list.
The word "find" is rubbery. It could mean calculate (to some precision) or it could mean estimate.
The phrase "large list" is rubbery. If could mean a list or array data structure in memory, or the "list" could be the result of a database query, the contents of a file or files.
There is no mention of the hardware constraints on the system where this will be implemented.
So the first thing >>I<< would do would be to try to narrow the scope by asking some questions of the interviewer.
But assuming that you can't, then a complete answer would need to cover the following points:
The dataset probably won't fit in memory at the same time. (But if it does, then that is good.)
Calculating the average of N numbers is O(N) if you do it serially. For N this size, it could be an intractable problem.
An alternative is to split into sublists of equals size and calculate the averages, and the average of the averages. In theory, this gives you O(N/P) where P is the number of partitions. The parallelism could be implemented with multiple threads, with multiple processes on the same machine, or distributed.
In practice, the limiting factors are going to be computational, memory and/or I/O bandwidth. A parallel solution will be effective if you can address these limits. For example, you need to balance the problem of each "worker" having uncontended access to its "sublist" versus the problem of making copies of the data so that that can happen.
If the list is represented in a way that allows sampling, then you can estimate the average without looking at the entire dataset. In fact, this could be O(C) depending on how you sample. But there is a risk that your sample will be unrepresentative, and the average will be too inaccurate.
In all cases doing calculations, you need to guard against (integer) overflow and (floating point) rounding errors. Especially while calculating the sums.
It would be worthwhile discussing how you would solve this with a "big data" platform (e.g. Hadoop) and the limitations of that approach (e.g. time taken to load up the data ...)

Could I use a faster data structure than a tree for this?

I have a binary decision tree. It takes inputs as an array of floats, and each branch node splits on an input index and value eventually taking me to a leaf.
I'm performing a massive number of lookups on this tree (about 17% of execution time according to performance analysis (Edit: Having optimised other areas it's now at almost 40%)), and am wondering if I could/should be using a different data structure to improve lookup speed.
Some kind of hash table can't be used, as inputs do not map directly to a leaf node, but I was wondering is anyone had any suggesting as to methods and data-structures I could use in place of the tree (or as well as?) to improve lookup speeds.
Memory is a concern, but less of a concern than speed.
Code is currently written in C#, but obviously any method could be applied.
Edit:
There's a bit too much code to post, but I'll give more detail about the tree.
The tree is generated using information gain calculations, it's not always a 50/50 split, the split value could be any float value. A single input could also be split multiple times increasing the resolution on that input.
I posted a question about performance of the iterator here:
Micro optimisations iterating through a tree in C#
But I think I might need to look at the data structure itself to improve performance further.
I'm aiming for as much performance as possible here. I'm working on a new method of machine learning, and the tree grows itself using a feedback loop. For the process I'm working on, I estimate it'll be running for several months, so a few % saving here and there is massive. The ultimate goal is speed without using too much memory.
If I understand correctly, you have floating point ranges than have to be mapped to a decision. Something like this:
x <= 0.0 : Decision A
0.0 < x <= 0.5 : Decision B
0.5 < x <= 0.6 : Decision C
0.6 < x : Decision D
A binary tree is a pretty good way to handle that. As long as the tree is well balanced and the input values are evenly distributed across the ranges, you can expect O(log2 n) comparisons, where n is the number of possible decisions.
If the tree is not balanced, then you could be doing far more comparisons than necessary. In the worst case: O(n). So I would look at the trees and see how deep they are. If the same tree is used again and again, then the cost spent rebalancing once may be amortized over many lookups.
If the input values are not evenly distributed (and you know that ahead of time), then you might want to special-case the order of the comparisons so that the most common cases are detected early. You can do this by manipulating the tree or by adding special cases in the code before actually checking the tree.
If you've exhausted algorithmic improvements and you still need to optimize, you might look into a data structure with better locality than a general binary tree. For example, you could put the partition boundaries into a contiguous array and perform a binary search on it. (And, if the array isn't too long, you might even try a linear search on the array as it may be friendlier for the cache and the branch prediction.)
Lastly, I'd consider building a coarse index that gives us a headstart into the tree (or array). For example, use a few of the most significant bits of the input value as an index and see if that can cut off the first few layers of the tree. This may help more than you might imagine, as the skipped comparisons probably have a low chance of getting correct branch predictions.
Presuming decisions have a 50/50 chance:
Imagine that you had two binary decisions; possible paths are 00, 01, 10, 11
Imagine instead of tree you had an array with four outcomes; you could turn your array of floats into a binary number which would be index into this array.

parallelize bisection

Consider the bisection algorithm to find square root. Every step depends on the previous, so in my opinion it's not possibile to parallelize it. Am I wrong?
Consider also similar algorithm like binary search.
edit
My problem is not the bisection, but it is very similar. I have a monotonic function f(mu) and I need to find the mu where f(mu)<alpha. One core need 2 minutes to compute f(mu) and I need a very big precision. We have a farm of ~100 cores. My first attemp was to use only 1 core and then scan all value of f with a dynamic step, depending on how close I am to alpha. Now I want to use the whole farm, but my only idea is to compute 100 value of f at equal spaced points.
It depends on what you mean by parallelize, and at what granularity. For example you could use instruction level parallelism (e.g. SIMD) to find square roots for a set of input values.
Binary search is trickier, because the control flow is data-dependent, as is the number of iterations, but you could still conceivably perform a number of binary searches in parallel so long as you allow for the maximum number of iterations (log2 N).
Even if these algorithms could be parallelized (and I'm not sure they can), there is very little point in doing so.
Generally speaking, there is very little point in attempting to parallelize algorithms that already have sub-linear time bounds (that is, T < O(n)). These algorithms are already so fast that extra hardware will have very little impact.
Furthermore, it is not true (in general) that all algorithms with data dependencies cannot be parallelized. In some cases, for example, it is possible to set up a pipeline where different functional units operate in parallel and feed data sequentially between them. Image processing algorithms, in particular, are frequently amenable to such arrangements.
Problems with no such data dependencies (and thus no need to communicate between processors) are referred to as "embarrassingly parallel". Those problems represent a small subset of the space of all problems that can be parallelized.
Many algorithms have several steps that each step depend on previous step,Some those algorithm can changed steps to doing parallel and some impossible to parallel, I think BinarySearch is of second type, You not wrong, But you can paralleled binary search with multiple Search.

Resources