parallelize bisection - algorithm

Consider the bisection algorithm to find square root. Every step depends on the previous, so in my opinion it's not possibile to parallelize it. Am I wrong?
Consider also similar algorithm like binary search.
My problem is not the bisection, but it is very similar. I have a monotonic function f(mu) and I need to find the mu where f(mu)<alpha. One core need 2 minutes to compute f(mu) and I need a very big precision. We have a farm of ~100 cores. My first attemp was to use only 1 core and then scan all value of f with a dynamic step, depending on how close I am to alpha. Now I want to use the whole farm, but my only idea is to compute 100 value of f at equal spaced points.

It depends on what you mean by parallelize, and at what granularity. For example you could use instruction level parallelism (e.g. SIMD) to find square roots for a set of input values.
Binary search is trickier, because the control flow is data-dependent, as is the number of iterations, but you could still conceivably perform a number of binary searches in parallel so long as you allow for the maximum number of iterations (log2 N).

Even if these algorithms could be parallelized (and I'm not sure they can), there is very little point in doing so.
Generally speaking, there is very little point in attempting to parallelize algorithms that already have sub-linear time bounds (that is, T < O(n)). These algorithms are already so fast that extra hardware will have very little impact.
Furthermore, it is not true (in general) that all algorithms with data dependencies cannot be parallelized. In some cases, for example, it is possible to set up a pipeline where different functional units operate in parallel and feed data sequentially between them. Image processing algorithms, in particular, are frequently amenable to such arrangements.
Problems with no such data dependencies (and thus no need to communicate between processors) are referred to as "embarrassingly parallel". Those problems represent a small subset of the space of all problems that can be parallelized.

Many algorithms have several steps that each step depend on previous step,Some those algorithm can changed steps to doing parallel and some impossible to parallel, I think BinarySearch is of second type, You not wrong, But you can paralleled binary search with multiple Search.


Sorting in Computer Science vs. sorting in the 'real' world

I was thinking about sorting algorithms in software, and possible ways one could surmount the O(nlogn) roadblock. I don't think it IS possible to sort faster in a practical sense, so please don't think that I do.
With that said, it seems with almost all sorting algorithms, the software must know the position of each element. Which makes sense, otherwise, how would it know where to place each element according to some sorting criteria?
But when I crossed this thinking with the real world, a centrifuge has no idea what position each molecule is in when it 'sorts' the molecules by density. In fact, it doesn't care about the position of each molecule. However it can sort trillions upon trillions of items in a relatively short period of time, due to the fact that each molecule follows density and gravitational laws - which got me thinking.
Would it be possible with some overhead on each node (some value or method tacked on to each of the nodes) to 'force' the order of the list? Something like a centrifuge, where only each element cares about its relative position in space (in relation to other nodes). Or, does this violate some rule in computation?
I think one of the big points brought up here is the quantum mechanical effects of nature and how they apply in parallel to all particles simultaneously.
Perhaps classical computers inherently restrict sorting to the domain of O(nlogn), where as quantum computers may be able to cross that threshold into O(logn) algorithms that act in parallel.
The point that a centrifuge being basically a parallel bubble sort seems to be correct, which has a time complexity of O(n).
I guess the next thought is that if nature can sort in O(n), why can't computers?
EDIT: I had misunderstood the mechanism of a centrifuge and it appears that it does a comparison, a massively-parallel one at that. However there are physical processes that operate on a property of the entity being sorted rather than comparing two properties. This answer covers algorithms that are of that nature.
A centrifuge applies a sorting mechanism that doesn't really work by means of comparisons between elements, but actually by a property ('centrifugal force') on each individual element in isolation.Some sorting algorithms fall into this theme, especially Radix Sort. When this sorting algorithm is parallelized it should approach the example of a centrifuge.
Some other non-comparative sorting algorithms are Bucket sort and Counting Sort. You may find that Bucket sort also fits into the general idea of a centrifuge (the radius could correspond to a bin).
Another so-called 'sorting algorithm' where each element is considered in isolation is the Sleep Sort. Here time rather than the centrifugal force acts as the magnitude used for sorting.
Computational complexity is always defined with respect to some computational model. For example, an algorithm that's O(n) on a typical computer might be O(2n) if implemented in Brainfuck.
The centrifuge computational model has some interesting properties; for example:
it supports arbitrary parallelism; no matter how many particles are in the solution, they can all be sorted simultaneously.
it doesn't give a strict linear sort of particles by mass, but rather a very close (low-energy) approximation.
it's not feasible to examine the individual particles in the result.
it's not possible to sort particles by different properties; only mass is supported.
Given that we don't have the ability to implement something like this in general-purpose computing hardware, the model may not have practical relevance; but it can still be worth examining, to see if there's anything to be learned from it. Nondeterministic algorithms and quantum algorithms have both been active areas of research, for example, even though neither is actually implementable today.
The trick is there, that you only have a probability of sorting your list using a centrifuge. As with other real-world sorts [citation needed], you can change the probability that your have sorted your list, but never be certain without checking all the values (atoms).
Consider the question: "How long should you run your centrifuge for?"
If you only ran it for a picosecond, your sample may be less sorted than the initial state.. or if you ran it for a few days, it may be completely sorted. However, you wouldn't know without actually checking the contents.
A real world example of a computer based "ordering" would be autonomous drones that cooperatively work with each other, known as "drone swarms". The drones act and communicate both as individuals and as a group, and can track multiple targets. The drones collectively decide which drones will follow which targets and the obvious need to avoid collisions between drones. The early versions of this were drones that moved through way points while staying in formation, but the formation could change.
For a "sort", the drones could be programmed to form a line or pattern in a specific order, initially released in any permutation or shape, and collectively and in parallel they would quickly form the ordered line or pattern.
Getting back to a computer based sort, one issue is that there's one main memory bus, and there's no way for a large number of objects to move about in memory in parallel.
know the position of each element
In the case of a tape sort, the position of each element (record) is only "known" to the "tape", not to the computer. A tape based sort only needs to work with two elements at a time, and a way to denote run boundaries on a tape (file mark, or a record of different size).
IMHO, people overthink log(n). O(nlog(n)) IS practically O(n). And you need O(n) just to read the data.
Many algorithms such as quicksort do provide a very fast way to sort elements. You could implement variations of quicksort that would be very fast in practice.
Inherently all physical systems are infinitely parallel. You might have a buttload of atoms in a grain of sand, nature has enough computational power to figure out where each electron in each atom should be. So if you had enough computational resources (O(n) processors) you could sort n numbers in log(n) time.
From comments:
Given a physical processor that has k number of elements, it can achieve a parallelness of at most O(k). If you process n numbers arbitrarily, it would still process it at a rate related to k. Also, you could formulate this problem physically. You could create n steel balls with weights proportional to the number you want to encode, which could be solved by a centrifuge in a theory. But here the amount of atoms you are using is proportional to n. Whereas in a standard case you have a limited number of atoms in a processor.
Another way to think about this is, say you have a small processor attached to each number and each processor can communicate with its neighbors, you could sort all those numbers in O(log(n)) time.
I worked in an office summers after high school when I started college. I had studied in AP Computer Science, among other things, sorting and searching.
I applied this knowledge in several physical systems that I can recall:
Natural merge sort to start…
A system printed multipart forms including a file-card-sized tear off, which needed to be filed in a bank of drawers.
I started with a pile of them and sorted the pile to begin with. The first step is picking up 5 or so, few enough to be easily placed in order in your hand. Place the sorted packet down, criss-crossing each stack to keep them separate.
Then, merge each pair of stacks, producing a larger stack. Repeat until there is only one stack.
…Insertion sort to complete
It is easier to file the sorted cards, as each next one is a little farther down the same open drawer.
Radix sort
This one nobody else understood how I did it so fast, despite repeated tries to teach it.
A large box of check stubs (the size of punch cards) needs to be sorted. It looks like playing solitaire on a large table—deal out, stack up, repeat.
In general
30 years ago, I did notice what you’re asking about: the ideas transfer to physical systems quite directly because there are relative costs of comparisons and handling records, and levels of caching.
Going beyond well-understood equivalents
I recall an essay about your topic, and it brought up the spaghetti sort. You trim a length of dried noodle to indicate the key value, and label it with the record ID. This is O(n), simply processing each item once.
Then you grab the bundle and tap one end on the table. They align on the bottom edges, and they are now sorted. You can trivially take off the longest one, and repeat. The read-out is also O(n).
There are two things going on here in the “real world” that don’t correspond to algorithms. First, aligning the edges is a parallel operation. Every data item is also a processor (the laws of physics apply to it). So, in general, you scale the available processing with n, essentially dividing your classic complexity by a factor on n.
Second, how does aligning the edges accomplish a sort? The real sorting is in the read-out which lets you find the longest in one step, even though you did compare all of them to find the longest. Again, divide by a factor of n, so finding the largest is now O(1).
Another example is using analog computing: a physical model solves the problem “instantly” and the prep work is O(n). In principle the computation is scaling with the number of interacting components, not the number of prepped items. So the computation scales with n². The example I'm thinking of is a weighted multi-factor computation, which was done by drilling holes in a map, hanging weights from strings passing through the holes, and gathering all the strings on a ring.
Sorting is still O(n) total time. That it is faster than that is because of Parallelization.
You could view a centrifuge as a Bucketsort of n atoms, parallelized over n cores(each atom acts as a processor).
You can make sorting faster by parallelization but only by a constant factor because the number of processors is limited, O(n/C) is still O(n) (CPUs have usually < 10 cores and GPUs < 6000)
The centrifuge is not sorting the nodes, it applies applies a force to them then they react in parallel to it.
So if you were to implement a bubble sort where each node is moving itself in parallel up or down based on it's "density", you'd have a centrifuge implementation.
Keep in mind that in the real world you can run a very large amount of parallel tasks where in a computer you can have a maximum of real parallel tasks equals to the number of physical processing units.
In the end, you would also be limited with the access to the list of elements because it cannot be modified simultaneously by two nodes...
Would it be possible with some overhead on each node (some value or
method tacked on to each of the nodes) to 'force' the order of the
When we sort using computer programs we select a property of the values being sorted. That's commonly magnitude of the number or the alphabetical order.
Something like a centrifuge, where only each element cares about its
relative position in space (in relation to other nodes)
This analogy aptly reminds me of simple bubble sort. How smaller numbers bubble up in each iteration. Like your centrifuge logic.
So to answer this, don't we actually do something of that sort in software based sorting?
First of all, you are comparing two different contexts, one is logic(computer) and the other is physics which (so far) is proven that we can model some parts of it using mathematical formulas and we as programmers can use this formulas to simulate (some parts of) physics in the logic work (e.g physics engine in game engine).
Second We have some possibilities in the computer (logic) world that is nearly impossible in physics for example we can access memory and find the exact location of each entity at each time but in physics that is a huge problem Heisenberg's uncertainty principle.
Third If you want to map centrifuges and its operation in real world, to computer world, it is like someone (The God) has given you a super-computer with all the rules of physics applied and you are doing your small sorting in it (using centrifuge) and by saying that your sorting problem was solved in o(n) you are ignoring the huge physics simulation going on in background...
Consider: is "centrifuge sort" really scaling better? Think about what happens as you scale up.
The test tubes have to get longer and longer.
The heavy stuff has to travel further and further to get to the bottom.
The moment of inertia increases, requiring more power and longer times to accelerate up to sorting speed.
It's also worth considering other problems with centrifuge sort. For example, you can only operate on a narrow size scale. A computer sorting algorithm can handle integers from 1 to 2^1024 and beyond, no sweat. Put something that weighs 2^1024 times as much as a hydrogen atom into a centrifuge and, well, that's a black hole and the galaxy has been destroyed. The algorithm failed.
Of course the real answer here is that computational complexity is relative to some computational model, as mentioned in other answer. And "centrifuge sort" doesn't make sense in the context of common computational models, such as the RAM model or the IO model or multitape Turing machines.
Another perspective is that what you're describing with the centrifuge is analogous to what's been called the "spaghetti sort" ( Say you have a box of uncooked spaghetti rods of varying lengths. Hold them in your fist, and loosen your hand to lower them vertically so the ends are all resting on a horizontal table. Boom! They're sorted by height. O(constant) time. (Or O(n) if you include picking the rods out by height and putting them in a . . . spaghetti rack, I guess?)
You can note there that it's O(constant) in the number of pieces of spaghetti, but, due to the finite speed of sound in spaghetti, it's O(n) in the length of the longest strand. So nothing comes for free.

paralleling sequence of matrix multiplication for speed up

In my function, there is a lot of element wise matrix multiplication which are independent. Is there a way to calculate them in parallel ?
All of them are very simple operations, but 70% of my run time is for these parts of code because this function is invoked millions of times.
function [r1,r2,r3]=backward(A,B,C,D,E,F,r1,r2,r3)
for i=1:300
EDIT: After writing the answer, I observed that you are not multiplying all the input matrices by means of matrix multiplication. Some of them are elementwise multiplications. If this is what you intended, the following answer won't apply.
You are looking for an optimal algorithm for computing product of multiple matrices. People have studied this problem long ago and they have come up with a dynamic programming algorithm to decide the optimal order.
For example, if A is of size 10000 x 1, B is of size 1 x 10000 and C is of size 10000 x 1, it would be a lot more efficient if we computed A*B*C as A*(B*C), instead of (A*B)*C. The proof of correctness of this technique lies in the fact that matrix multiplication is associative. You can read more about this on Wikipedia.
If you want a good quality MATLAB implementation of this, you can find it here. It takes the matrices as input and gives out the product. It seems like this implementation does a decent job of finding the optimal way of computing "upto" 10 matrices.
First thing to note: the last 3 variables that you provide as input are not beeing used. I don't think this will matter much, but it would be better to clean it up.
Now the actual answer:
MATLAB is all about matrix operations, and this has been highly optimized. Even using C++ you will not expect a significant speedup (and be wary of a slowdown). As such, with the information that is provided in the question, the conclusion would be that you cannot do anything to speed up independent matrix calculations.
That being said: If you could reduce the number of sequential function calls, there may be something to gain.
It is hard to say how to do this in general, but two ideas:
If you call the fuction in a for loop, use a parfor loop instead (assuming you have the parallel processing toolbox, otherwise manually break up the loop and open 4 matlab instances to paralellize the loop (can be automated if needed).
See whether you really need this many function calls to small matrix operations. If you could improve your algorithm, that could offer a huge improvement, but otherwise you may still be able to combine multiple matrices (multiple versions of A with multiple versions of B for instance) and do 1 big multiplication, rather than a 100 tiny ones).

A* Algorithm for very large graphs, any thoughts on caching shortcuts?

I'm writing a courier/logistics simulation on OpenStreetMap maps and have realised that the basic A* algorithm as pictured below is not going to be fast enough for large maps (like Greater London).
The green nodes correspond to ones that were put in the open set/priority queue and due to the huge number (the whole map is something like 1-2 million), it takes 5 seconds or so to find the route pictured. Unfortunately 100ms per route is about my absolute limit.
Currently, the nodes are stored in both an adjacency list and also a spatial 100x100 2D array.
I'm looking for methods where I can trade off preprocessing time, space and if needed optimality of the route, for faster queries. The straight-line Haversine formula for the heuristic cost is the most expensive function according to the profiler - I have optimised my basic A* as much as I can.
For example, I was thinking if I chose an arbitrary node X from each quadrant of the 2D array and run A* between each, I can store the routes to disk for subsequent simulations. When querying, I can run A* search only in the quadrants, to get between the precomputed route and the X.
Is there a more refined version of what I've described above or perhaps a different method I should pursue. Many thanks!
For the record, here are some benchmark results for arbitrarily weighting the heuristic cost and computing the path between 10 pairs of randomly picked nodes:
Weight // AvgDist% // Time (ms)
1 1 1461.2
1.05 1 1327.2
1.1 1 900.7
1.2 1.019658848 196.4
1.3 1.027619169 53.6
1.4 1.044714394 33.6
1.5 1.063963413 25.5
1.6 1.071694171 24.1
1.7 1.084093229 24.3
1.8 1.092208509 22
1.9 1.109188175 22.5
2 1.122856792 18.2
2.2 1.131574742 16.9
2.4 1.139104895 15.4
2.6 1.140021962 16
2.8 1.14088128 15.5
3 1.156303676 16
4 1.20256964 13
5 1.19610861 12.9
Surprisingly increasing the coefficient to 1.1 almost halved the execution time whilst keeping the same route.
You should be able to make it much faster by trading off optimality. See Admissibility and optimality on wikipedia.
The idea is to use an epsilon value which will lead to a solution no worse than 1 + epsilon times the optimal path, but which will cause fewer nodes to be considered by the algorithm. Note that this does not mean that the returned solution will always be 1 + epsilon times the optimal path. This is just the worst case. I don't know exactly how it would behave in practice for your problem, but I think it is worth exploring.
You are given a number of algorithms that rely on this idea on wikipedia. I believe this is your best bet to improve the algorithm and that it has the potential to run in your time limit while still returning good paths.
Since your algorithm does deal with millions of nodes in 5 seconds, I assume you also use binary heaps for the implementation, correct? If you implemented them manually, make sure they are implemented as simple arrays and that they are binary heaps.
There are specialist algorithms for this problem that do a lot of pre-computation. From memory, the pre-computation adds information to the graph that A* uses to produce a much more accurate heuristic than straight line distance. Wikipedia gives the names of a number of methods at and says that Hub Labelling is the leader. A quick search on this turns up An older one, using A*, is at
Do you really need to use Haversine? To cover London, I would have thought you could have assumed a flat earth and used Pythagoras, or stored the length of each link in the graph.
There's a really great article that Microsoft Research wrote on the subject:
The original paper is hosted here (PDF):
Essentially there's a few things you can try:
Start from the both the source as well as the destination. This helps to minimize the amount of wasted work that you'd perform when traversing from the source outwards towards the destination.
Use landmarks and highways. Essentially, find some positions in each map that are commonly taken paths and perform some pre-calculation to determine how to navigate efficiently between those points. If you can find a path from your source to a landmark, then to other landmarks, then to your destination, you can quickly find a viable route and optimize from there.
Explore algorithms like the "reach" algorithm. This helps to minimize the amount of work that you'll do when traversing the graph by minimizing the number of vertices that need to be considered in order to find a valid route.
GraphHopper does two things more to get fast, none-heuristic and flexible routing (note: I'm the author and you can try it online here)
A not so obvious optimization is to avoid 1:1 mapping of OSM nodes to internal nodes. Instead GraphHopper uses only junctions as nodes and saves roughly 1/8th of traversed nodes.
It has efficient implements for A*, Dijkstra or e.g. one-to-many Dijkstra. Which makes a route in under 1s possible through entire Germany. The (none-heuristical) bidirectional version of A* makes this even faster.
So it should be possible to get you fast routes for greater London.
Additionally the default mode is the speed mode which makes everything an order of magnitudes faster (e.g. 30ms for European wide routes) but less flexible, as it requires preprocessing (Contraction Hierarchies). If you don't like this, just disable it and also further fine-tune the included streets for car or probably better create a new profile for trucks - e.g. exclude service streets and tracks which should give you a further 30% boost. And as with any bidirectional algorithm you could easily implement a parallel search.
I think it's worth to work-out your idea with "quadrants". More strictly, I'd call it a low-resolution route search.
You may pick X connected nodes that are close enough, and treat them as a single low-resolution node. Divide your whole graph into such groups, and you get a low-resolution graph. This is a preparation stage.
In order to compute a route from source to target, first identify the low-res nodes they belong to, and find the low-resolution route. Then improve your result by finding the route on high-resolution graph, however restricting the algorithm only to nodes that belong to hte low-resolution nodes of the low-resolution route (optionally you may also consider neighbor low-resolution nodes up to some depth).
This may also be generalized to multiple resolutions, not just high/low.
At the end you should get a route that is close enough to optimal. It's locally optimal, but may be somewhat worse than optimal globally by some extent, which depends on the resolution jump (i.e. the approximation you make when a group of nodes is defined as a single node).
There are dozens of A* variations that may fit the bill here. You have to think about your use cases, though.
Are you memory- (and also cache-) constrained?
Can you parallelize the search?
Will your algorithm implementation be used in one location only (e.g. Greater London and not NYC or Mumbai or wherever)?
There's no way for us to know all the details that you and your employer are privy to. Your first stop thus should be CiteSeer or Google Scholar: look for papers that treat pathfinding with the same general set of constraints as you.
Then downselect to three or four algorithms, do the prototyping, test how they scale up and finetune them. You should bear in mind you can combine various algorithms in the same grand pathfinding routine based on distance between the points, time remaining, or any other factors.
As has already been said, based on the small scale of your target area dropping Haversine is probably your first step saving precious time on expensive trig evaluations. NOTE: I do not recommend using Euclidean distance in lat, lon coordinates - reproject your map into a e.g. transverse Mercator near the center and use Cartesian coordinates in yards or meters!
Precomputing is the second one, and changing compilers may be an obvious third idea (switch to C or C++ - see for details).
Extra optimization steps may include getting rid of dynamic memory allocation, and using efficient indexing for search among the nodes (think R-tree and its derivatives/alternatives).
I worked at a major Navigation company, so I can say with confidence that 100 ms should get you a route from London to Athens even on an embedded device. Greater London would be a test map for us, as it's conveniently small (easily fits in RAM - this isn't actually necessary)
First off, A* is entirely outdated. Its main benefit is that it "technically" doesn't require preprocessing. In practice, you need to pre-process an OSM map anyway so that's a pointless benefit.
The main technique to give you a huge speed boost is arc flags. If you divide the map in say 5x6 sections, you can allocate 1 bit position in a 32 bits integer for each section. You can now determine for each edge whether it's ever useful when traveling to section {X,Y} from another section. Quite often, roads are bidirectional and this means only one of the two directions is useful. So one of the two directions has that bit set, and the other has it cleared. This may not appear to be a real benefit, but it means that on many intersections you reduce the number of choices to consider from 2 to just 1, and this takes just a single bit operation.
Usually A* comes along with too much memory consumption rather than time stuggles.
However I think it could be useful to first only compute with nodes that are part of "big streets" you would choose a highway over a tiny alley usually.
I guess you may already use this for your weight function but you can be faster if you use some priority Queue to decide which node to test next for further travelling.
Also you could try reducing the graph to only nodes that are part of low cost edges and then find a way from to start/end to the closest of these nodes.
So you have 2 paths from start to the "big street" and the "big street" to end.
You can now compute the best path between the two nodes that are part of the "big streets" in a reduced graph.
Old question, but yet:
Try to use different heaps that "binary heap". 'Best asymptotic complexity heap' is definetly Fibonacci Heap and it's wiki page got a nice overview:
Note that binary heap has simpler code and it's implemented over array and traversal of array is predictable, so modern CPU executes binary heap operations much faster.
However, given dataset big enough, other heaps will win over binary heap, because of their complexities...
This question seems like dataset big enough.

Parallel Binary Search

I'm just starting to learn about parallel programming, and I'm looking at binary search.
This can't really be optimized by throwing more processors at it right? I know it's supposedly dividing and conquering, but you're really "decreasing and conquering" (from Wikipedia).
Or could you possibly parallelize the comparisons? (if X is less than array[mid], search from low to mid - 1; else if X is greater than array[mid] search from mid + 1 to high, else return mid, the index of X)
Or how about you give half of the array to one processor to do binary search on, and the other half to another? Wouldn't that be wasteful though? Because it's decreasing and conquering rather than simply dividing and conquering? Thoughts?
You can easily use parallelism.
For k is less than n processors, split the array into n/k groups and assign a processor to each group.
Run binary search on that group.
Now the time is log(n/k).
There is also a crew method that is logn/log(k+1).
I would think it certainly qualifies for parallelisation. At least, across two threads. Have one thread do a depth-first search, and the other do a breadth-first search. The winner is the algorithm that performs the fastest, which may be different from data-set to data-set.
I don't have much experience in parallel programming, but I doubt this is a good candidate for parallel processing. Each step of the algorithm depends on performing one comparison, and then proceeding down a set "path" based on this comparison (you either found your value, or now have to keep searching in a set "direction" based on the comparison). Two separate threads performing the same comparison won't get you anywhere any faster, and separate threads will both need to rely on the same comparison to decide what to do next, so they can't really do any useful, divided work on their own.
As far as your idea of splitting the array, I think you are just negating the benefit of binary search in this case. Your value (assuming it's in your array), will either be in the top or the bottom half of your array. The first comparison (at the midpoint) in a binary search is going to tell you which half you should be looking in. If you take that even further, consider breaking an array of N elements into N different binary searches (a naive attempt to parallel-ize). You are now doing N comparisons, when you don't need to. You are losing the power of binary search, in that each comparison will narrow down your search to the appropriate subset.
Hope that helps. Comments welcome.
Yes, in the classical sense of parallelization (multi-core), binary search and BST are not much better.
There are techniques like having multiple copies of the BST on L1 cache for each processor. Only one processor is active but the gains from having multiple L1 caches can be great (4 cycles for L1 vs 14 cycles for L2).
In real world problems you are often searching multiple keys at the same time.
Now, there is another kind of parallelization that can help: SIMD! Check out "Fast architecture sensitive tree search on modern CPUs and GPUs" by a team from Intel/UCSC/Oracle (SIGMOD 2010). It's very cool. BTW I'm basing my current research project on this very paper.
Parallel implementation can speed up a binary search, but the improvement is not particularly significant. Worst case, the time required for a binary search is log_2(n) where n is the number of elements in the list. A simple parallel implementation breaks the master list into k sub-lists to be bin-searched by parallel threads. The resulting worst-case time for the binary search is log_2(n/k) realizing a theoretical decrease in the search time.
A list of 1024 entries takes as many as 10 cycles to binary search using a single thread. Using 4 threads, each thread only would only take 8 cycles to complete the search. And using 8 threads, each thread takes 7 cycles. Thus, an 8 threaded parallel binary search could be up to 30% faster than the single threaded model.
However, his speed-up should not be confused with a improvement in efficiency: The 8 threaded model actually executes 8 * 7 = 56 comparisons to complete the search compared to the 10 comparisons executed by the single threaded binary search. It is up to the discretion of the programmer if the marginal gain in speed of a parallel application of binary search is appropriate or advantageous for their application.
I am pretty sure binary search can be speed up by a factor of log (M) where M is the number of processors. log(n/M) = log(n) - log(M) > log(n)/ log(M) for a constant M. I do not have a proof for a tight lower bound, but if M=n, the execution time is O(1), which cannot be any better. An algorithm sketch follows.
Divide your sorted_arraylist into M chunks of size n/M.
Apply one step of comparison to the middle element of each chunk.
If a comparator signals equality, return the address and terminate.
Otherwise, identify both adjacent chunks where comparators signaled (>) and (<), respectively.
Form a new Chunk starting from the element following the one that signaled (>) and ending at the element preceding the one that signaled (<).
If they are the same element, return fail and terminate.
Otherwise, Parallel_Binary_Search(Chunk)

Best algorithm to minimize an output value by varying input data

I have an incoming stream of data and a set of transformations, which can be applied to the stream in various combinations to get a numerical output value. I need to find which subset of the transformation minimizes the number.
The data is an ordered list of numbers with metadata attached to each one.
The transformations are quasi-linear: they are technically executable code in a Turing-complete language, but they are known to belong to a restricted subset which always halts, and they transform the input number to output number with arithmetic operations, whose flow is dependent on metadata attached. Moreover, the operations are almost all the time linear (but they are not bound to be—meaning this may be a place for optimization, but not restriction).
Basically, a brute-force approach involving 2n steps (where n is a number of transformations) would work, but it is woefully ineffective, and I'm almost absolutely sure this would not scale in production. Are there any algorithms to solve this task faster?
If almost all operations are linear, can't you use linear programming as heuristics?
And maybe in between do checks whether some transformations are particularly slow, in which case you can still switch to brute force.
Do you need to find the optimal output?
