What is the difference between Linear search and Binary search? - algorithm

What is the difference between Linear search and Binary search?

A linear search looks down a list, one item at a time, without jumping. In complexity terms this is an O(n) search - the time taken to search the list gets bigger at the same rate as the list does.
A binary search is when you start with the middle of a sorted list, and see whether that's greater than or less than the value you're looking for, which determines whether the value is in the first or second half of the list. Jump to the half way through the sublist, and compare again etc. This is pretty much how humans typically look up a word in a dictionary (although we use better heuristics, obviously - if you're looking for "cat" you don't start off at "M"). In complexity terms this is an O(log n) search - the number of search operations grows more slowly than the list does, because you're halving the "search space" with each operation.
As an example, suppose you were looking for U in an A-Z list of letters (index 0-25; we're looking for the value at index 20).
A linear search would ask:
list[0] == 'U'? No.
list[1] == 'U'? No.
list[2] == 'U'? No.
list[3] == 'U'? No.
list[4] == 'U'? No.
list[5] == 'U'? No.
...
list[20] == 'U'? Yes. Finished.
The binary search would ask:
Compare list[12] ('M') with 'U': Smaller, look further on. (Range=13-25)
Compare list[19] ('T') with 'U': Smaller, look further on. (Range=20-25)
Compare list[22] ('W') with 'U': Bigger, look earlier. (Range=20-21)
Compare list[20] ('U') with 'U': Found it! Finished.
Comparing the two:
Binary search requires the input data to be sorted; linear search doesn't
Binary search requires an ordering comparison; linear search only requires equality comparisons
Binary search has complexity O(log n); linear search has complexity O(n) as discussed earlier
Binary search requires random access to the data; linear search only requires sequential access (this can be very important - it means a linear search can stream data of arbitrary size)

Think of it as two different ways of finding your way in a phonebook. A linear search is starting at the beginning, reading every name until you find what you're looking for. A binary search, on the other hand, is when you open the book (usually in the middle), look at the name on top of the page, and decide if the name you're looking for is bigger or smaller than the one you're looking for. If the name you're looking for is bigger, then you continue searching the upper part of the book in this very fashion.

A linear search works by looking at each element in a list of data until it either finds the target or reaches the end. This results in O(n) performance on a given list.
A binary search comes with the prerequisite that the data must be sorted. We can leverage this information to decrease the number of items we need to look at to find our target. We know that if we look at a random item in the data (let's say the middle item) and that item is greater than our target, then all items to the right of that item will also be greater than our target. This means that we only need to look at the left part of the data. Basically, each time we search for the target and miss, we can eliminate half of the remaining items. This gives us a nice O(log n) time complexity.
Just remember that sorting data, even with the most efficient algorithm, will always be slower than a linear search (the fastest sorting algorithms are O(n * log n)). So you should never sort data just to perform a single binary search later on. But if you will be performing many searches (say at least O(log n) searches), it may be worthwhile to sort the data so that you can perform binary searches. You might also consider other data structures such as a hash table in such situations.

A linear search starts at the beginning of a list of values, and checks 1 by 1 in order for the result you are looking for.
A binary search starts in the middle of a sorted array, and determines which side (if any) the value you are looking for is on. That "half" of the array is then searched again in the same fashion, dividing the results in half by two each time.

Make sure to deliberate about whether the win of the quicker binary search is worth the cost of keeping the list sorted (to be able to use the binary search). I.e. if you have lots of insert/remove operations and only an occasional search the binary search could in total be slower than the linear search.

Try this: Pick a random name "Lastname, Firstname" and look it up in your phonebook.
1st time: start at the beginning of the book, reading names until you find it, or else find the place where it would have occurred alphabetically and note that it isn't there.
2nd time: Open the book at the half way point and look at the page. Ask yourself, should this person be to the left or to the right. Whichever one it is, take that 1/2 and find the middle of it. Repeat this procedure until you find the page where the entry should be and then either apply the same process to columns, or just search linearly along the names on the page as before.
Time both methods and report back!
[also consider what approach is better if all you have is a list of names, not sorted...]

Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic.
If we have 1000 elements to search, binary search takes about 10 steps, linear search 1000 steps.

binary search runs in O(logn) time whereas linear search runs in O(n) times thus binary search has better performance

A linear search looks down a list, one item at a time, without jumping. In complexity terms this is an O(n) search - the time taken to search the list gets bigger at the same rate as the list does.
A binary search is when you start with the middle of a sorted list, and see whether that's greater than or less than the value you're looking for, which determines whether the value is in the first or second half of the list. Jump to the half way through the sublist, and compare again etc. This is pretty much how humans typically look up a word in a dictionary (although we use better heuristics, obviously - if you're looking for "cat" you don't start off at "M"). In complexity terms this is an O(log n) search - the number of search operations grows more slowly than the list does, because you're halving the "search space" with each operation.

Linear Search looks through items until it finds the searched value.
Efficiency: O(n)
Example Python Code:
test_list = [1, 3, 9, 11, 15, 19, 29]
test_val1 = 25
test_val2 = 15
def linear_search(input_array, search_value):
index = 0
while (index < len(input_array)) and (input_array[index] < search_value):
index += 1
if index >= len(input_array) or input_array[index] != search_value:
return -1
return index
print linear_search(test_list, test_val1)
print linear_search(test_list, test_val2)
Binary Search finds the middle element of the array. Checks that middle value is greater or lower than the search value. If it is smaller, it gets the left side of the array and finds the middle element of that part. If it is greater, gets the right part of the array. It loops the operation until it finds the searched value. Or if there is no value in the array finishes the search.
Efficiency: O(logn)
Example Python Code:
test_list = [1, 3, 9, 11, 15, 19, 29]
test_val1 = 25
test_val2 = 15
def binary_search(input_array, value):
low = 0
high = len(input_array) - 1
while low <= high:
mid = (low + high) / 2
if input_array[mid] == value:
return mid
elif input_array[mid] < value:
low = mid + 1
else:
high = mid - 1
return -1
print binary_search(test_list, test_val1)
print binary_search(test_list, test_val2)
Also you can see visualized information about Linear and Binary Search here: https://www.cs.usfca.edu/~galles/visualization/Search.html

For a clear understanding, please take a look at my codepen implementations https://codepen.io/serdarsenay/pen/XELWqN
Biggest difference is the need to sort your sample before applying binary search, therefore for most "normal sized" (meaning to be argued) samples will be quicker to search with a linear search algorithm.
Here is the javascript code, for html and css and full running example please refer to above codepen link.
var unsortedhaystack = [];
var haystack = [];
function init() {
unsortedhaystack = document.getElementById("haystack").value.split(' ');
}
function sortHaystack() {
var t = timer('sort benchmark');
haystack = unsortedhaystack.sort();
t.stop();
}
var timer = function(name) {
var start = new Date();
return {
stop: function() {
var end = new Date();
var time = end.getTime() - start.getTime();
console.log('Timer:', name, 'finished in', time, 'ms');
}
}
};
function lineerSearch() {
init();
var t = timer('lineerSearch benchmark');
var input = this.event.target.value;
for(var i = 0;i<unsortedhaystack.length - 1;i++) {
if (unsortedhaystack[i] === input) {
document.getElementById('result').innerHTML = 'result is... "' + unsortedhaystack[i] + '", on index: ' + i + ' of the unsorted array. Found' + ' within ' + i + ' iterations';
console.log(document.getElementById('result').innerHTML);
t.stop();
return unsortedhaystack[i];
}
}
}
function binarySearch () {
init();
sortHaystack();
var t = timer('binarySearch benchmark');
var firstIndex = 0;
var lastIndex = haystack.length-1;
var input = this.event.target.value;
//currently point in the half of the array
var currentIndex = (haystack.length-1)/2 | 0;
var iterations = 0;
while (firstIndex <= lastIndex) {
currentIndex = (firstIndex + lastIndex)/2 | 0;
iterations++;
if (haystack[currentIndex] < input) {
firstIndex = currentIndex + 1;
//console.log(currentIndex + " added, fI:"+firstIndex+", lI: "+lastIndex);
} else if (haystack[currentIndex] > input) {
lastIndex = currentIndex - 1;
//console.log(currentIndex + " substracted, fI:"+firstIndex+", lI: "+lastIndex);
} else {
document.getElementById('result').innerHTML = 'result is... "' + haystack[currentIndex] + '", on index: ' + currentIndex + ' of the sorted array. Found' + ' within ' + iterations + ' iterations';
console.log(document.getElementById('result').innerHTML);
t.stop();
return true;
}
}
}

Related

Finding the binary sub-tree that puts each element of a list into its own lowest order bucket

First, I have a list of numbers 'L', containing elements 'x' such that 0 < 'x' <= 'M' for all elements 'x'.
Second, I have a binary tree constructed in the following manner:
1) Each node has three properties: 'min', 'max', and 'vals' (vals is a list of numbers).
2) The root node has 'max'='M', 'min'=0, and 'vals'='L' (it contains all the numbers)
3) Each left child node has:
max=(parent(max) + parent(min))/2
min=parent(min)
4) Each right child node has:
max=parent(max)
min=(parent(max) + parent(min))/2
5) For each node, 'vals' is a list of numbers such that each element 'x' of
'vals' is also an element of 'L' and satisfies
min < x <= max
6) If a node has only one element in 'vals', then it has no children. I.e., we are
only looking for nodes for which 'vals' is non-empty.
I'm looking for an algorithm to find the smallest sub-tree that satisfies the above properties. In other words, I'm trying to get a list of nodes such that each child-less node contains one - and only one - element in 'vals'.
I'm almost able to brute-force it with perl using insanely baroque data structures, but I keep bumping up against the limits of my mental capacity to keep track of all the temporary variables I've used, so I'm asking for help.
It's even cooler if you know an efficient algorithm to do the above.
If you'd like to know what I'm trying to do, it's this: find the smallest covering for a discrete wavelet packet transform to uniquely separate each frequency of the standard even-tempered musical notes. The trouble is that each iteration of the wavelet transform divides the frequency range it handles in half (hence the .../2 above defining the max and min), and the musical notes have frequencies which go up exponentially, so there's no obvious relationship between the two - not one I'm able to derive analytically (or experimentally, obviously, for that matter), anyway.
Since I'm really trying to find an algorithm so I can write a program, and since the problem is put in general terms, I didn't think it appropriate to put it in DSP. If there were a general "algorithms" group, then I think it would be better there, but this seems to be the right group for algorithms in the absence of such.
Please let me know if I can clarify anything, or if you have any suggestions - even in the absence of a complete answer - any help is appreciated!
After taking a break and two cups of coffee, I answered my own question. Indexing below is done starting at 1, MATLAB-style...
L=[] // list of numbers to put in bins
sorted=[] // list of "good" nodes
nodes=[] // array of nodes to construct
sortedidx=1
nodes[1]={ min = 0, max = 22100, val = -1, lvl = 1, row = 1 }
for(j=1;j<=12;j++) { // 12 is a reasonable guess
level=j+1
row=1
for(i=2^j;i<2^(j+1);i++) {
if(i/2 == int(i/2)) { // even nodes are high-pass filters
nodes[i]={ min = (nodes[i/2].min + nodes[i/2].max)/2, // nodes[i/2] is parent
max = nodes[i/2].max,
val = -1,
lvl = level,
row = -1
}
} else { // odd nodes are lo-pass
nodes[i]={ min = nodes[(i-1)/2].min,
max = (nodes[(i-1)/2].min+nodes[(i-1)/2].max)/2,
val = -1,
lvl = level,
row = -1
}
}
temp=[] // array to count matching numbers
tempidx=1
Lidx=0
for (k=1;k<=size(L);k++) {
if (nodes[i].min < L[k] && L[k] <= nodes[i].max) {
temp[tempidx++]=nodes[i]
Lidx=k
}
}
if (size(temp) == 1) {
nodes[i].row = row++
nodes[i].val = temp[1]
delete L[Lidx]
sorted[sortedidx++]=nodes[i]
}
}
}
Now array sorted[] contains exactly what I was looking for!
Hopefully this helps somebody else someday...

Algorithm to find matching real values in a list

I have a complex algorithm which calculates the result of a function f(x). In the real world f(x) is a continuous function. However due to rounding errors in the algorithm this is not the case in the computer program. The following diagram gives an example:
Furthermore I have a list of several thousands values Fi.
I am looking for all the x values which meet an Fi value i.e. f(xi)=Fi
I can solve this problem with by simply iterating through the x values like in the following pseudo code:
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
//loop through the value list to see if the function result matches a value in the list
for j=0 to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[j])<Epsilon then
begin
//mark that element j of the list matches
//and store the corresponding x value in the list
end
end
end
Of course it is necessary to use a high number of checks. Otherwise I will miss some x values. The higher the number of checks the more complete and accurate is the result. It is acceptable that the list is 90% or 95% complete.
The problem is that this brute force approach takes too much time. As I mentioned before the algorithm for f(x) is quite complex and with a high number of checks it takes too much time.
What would be a better solution for this problem?
Another way to do this is in two parts: generate all of the results, sort them, and then merge with the sorted list of existing results.
First step is to compute all of the results and save them along with the x value that generated them. That is:
results = list of <x, result>
for i = 0 to numberOfChecks
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
results.Add(x, FunctionResult)
end for
Now, sort the results list by FunctionResult, and also sort the FunctionResult-ListValues array by result.
You now have two sorted lists that you can move through linearly:
i = 0, j = 0;
while (i < results.length && j < ListValues.length)
{
diff = ListValues[j] - results[i];
if (Abs(diff) < Episilon)
{
// mark this one with the x value
// and move to the next result
i = i + 1
}
else if (diff > 0)
{
// list value is much larger than result. Move to next result.
i = i + 1
}
else
{
// list value is much smaller than result. Move to next list value.
j = j + 1
}
}
Sort the list, producing an array SortedListValues that contains
the sorted ListValues and an array SortedListValueIndices that
contains the index in the original array of each entry in
SortedListValues. You only actually need the second of these and
you can create both of them with a single sort by sorting an array
of tuples of (value, index) using value as the sort key.
Iterate over your range in 0..NumberOfChecks-1 and compute the
value of the function at each step, and then use a binary chop
method to search for it in the sorted list.
Pseudo-code:
// sort as described above
SortedListValueIndices = sortIndices(ListValues);
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
// do a binary chop to find the closest element in the list
highIndex = NumberOfValuesInTheList-1;
lowIndex = 0;
while true do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[lowIndex]])<Epsilon then
begin
// find all elements in the range that match, breaking out
// of the loop as soon as one doesn't
for j=lowIndex to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[j]])>=Epsilon then
break
//mark that element SortedListValueIndices[j] of the list matches
//and store the corresponding x value in the list
end
// break out of the binary chop loop
break
end
// break out of the loop once the indices match
if highIndex <= lowIndex then
break
// do the binary chop searching, adjusting the indices:
middleIndex = (lowIndex + 1 + highIndex) / 2;
if ListValues[SortedListValueIndices[middleIndex] < FunctionResult then
lowIndex = middleIndex;
else
begin
highIndex = middleIndex;
lowIndex = lowIndex + 1;
end
end
end
Possible complications:
The binary chop isn't taking the epsilon into account. Depending on
your data this may or may not be an issue. If it is acceptable that
the list is only 90 or 95% complete this might be ok. If not then
you'll need to widen the range to take it into account.
I've assumed you want to be able to match multiple x values for each FunctionResult. If that's not necessary you can simplify the code.
Naturally this depends very much on the data, and especially on the numeric distribution of Fi. Another problem is that the f(x) looks very jumpy, eliminating the concept of "assumption of nearby value".
But one could optimise the search.
Picture below.
Walking through F(x) at sufficient granularity, define a rough min
(red line) and max (green line), using suitable tolerance (the "air"
or "gap" in between). The area between min and max is "AREA".
See where each Fi-value hits AREA, do a stacked marking ("MARKING") at X-axis accordingly (can be multiple segments of X).
Where lots of MARKINGs at top of each other (higher sum - the vertical black "sum" arrows), do dense hit tests, hence increasing the overall
chance to get as many hits as possible. Elsewhere do more sparse tests.
Tighten this schema (decrease tolerance) as much as you dare.
EDIT: Fi is a bit confusing. Is it an ordered array or does it have random order (as i assumed)?
Jim Mischel's solution would work in a O(i+j) instead of the O(i*j) solution that you currently have. But, there is a (very) minor bug in his code. The correct code would be :
diff = ListValues[j] - results[i]; //no abs() here
if (abs(diff) < Episilon) //add abs() here
{
// mark this one with the x value
// and move to the next result
i = i + 1
}
the best methods will relay on the nature of your function f(x).
The best solution is if you can create the reversing to F(x) and use it
as you said F(x) is continuous:
therefore you can start evaluating small amount of far points, then find ranges that makes sense, and refine your "assumption" for x that f(x)=Fi
it is not bullet proof, but it is an option.
e.g. Fi=5.7; f(1)=1.4 ,f(4)=4,f(16)=12.6, f(10)=10.1, f(7)=6.5, f(5)=5.1, f(6)=5.8, you can take 5 < x < 7
on the same line as #1, and IF F(x) is hard to calculate, you can use Interpolation, and then evaluate F(x) only at the values that are probable.

Algorithm interview from Google

I am a long time lurker, and just had an interview with Google where they asked me this question:
Various artists want to perform at the Royal Albert Hall and you are responsible for scheduling
their concerts. Requests for performing at the Hall are accommodated on a first come first served
policy. Only one performance is possible per day and, moreover, there cannot be any concerts
taking place within 5 days of each other
Given a requested time d which is impossible (i.e. within 5 days of an already sched-
uled performance), give an O(log n)-time algorithm to find the next available day d2
(d2 > d).
I had no clue how to solve it, and now that the interview is over, I am dying to figure out how to solve it. Knowing how smart most of you folks are, I was wondering if you can give me a hand here. This is NOT for homework, or anything of that sort. I just want to learn how to solve it for future interviews. I tried asking follow up questions but he said that is all I can tell you.
You need a normal binary search tree of intervals of available dates. Just search for the interval containing d. If it does not exist, take the interval next (in-order) to the point where the search stopped.
Note: contiguous intervals must be fused together in a single node. For example: the available-dates intervals {2 - 15} and {16 - 23} should become {2 - 23}. This might happen if a concert reservation was cancelled.
Alternatively, a tree of non-available dates can be used instead, provided that contiguous non-available intervals are fused together.
Store the scheduled concerts in a binary search tree and find a feasible solution by doing a binary search.
Something like this:
FindDateAfter(tree, x):
n = tree.root
if n.date < x
n = FindDateAfter(n.right, x)
else if n.date > x and n.left.date < x
return n
return FindDateAfter(n.left, x)
FindGoodDay(tree, x):
n = FindDateAfter(tree, x)
while (n.date + 10 < n.right.date)
n = FindDateAfter(n, n.date + 5)
return n.date + 5
I've used a binary search tree (BST) that holds the ranges for valid free days that can be scheduled for performances.
One of the ranges must end with int.MaxValue, because we have an infinite amount of days so it can't be bound.
The following code searches for the closest day to the requested day, and returns it.
The time complexity is O(H) when H is the tree height (usually H=log(N), but can become H=N in some cases.).
The space complexity is the same as the time complexity.
public static int FindConcertTime(TreeNode<Tuple<int, int>> node, int reqDay)
{
// Not found!
if (node == null)
{
return -1;
}
Tuple<int, int> currRange = node.Value;
// Found range.
if (currRange.Item1 <= reqDay &&
currRange.Item2 >= reqDay)
{
// Return requested day.
return reqDay;
}
// Go left.
else if (currRange.Item1 > reqDay)
{
int suggestedDay = FindConcertTime(node.Left, reqDay);
// Didn't find appropriate range in left nodes, or found day
// is further than current option.
if (suggestedDay == -1 || suggestedDay > currRange.Item1)
{
// Return current option.
return currRange.Item1;
}
else
{
// Return suggested day.
return suggestedDay;
}
}
// Go right.
// Will always find because the right-most node has "int.MaxValue" as Item2.
else //if (currRange.Item2 < reqDay)
{
return FindConcertTime(node.Right, reqDay);
}
}
Store the number of used nights per year, quarter, and month. To find a free night, find the first year that is not fully booked, then the quarter within that year, then the month. Then check each of the nights in that month.
Irregularities in the calendar system makes this a little tricky so instead of using years and months you can apply the idea for units of 4 nights as "month", 16 nights as "quarter", and so on.
Assume, at level 1 all schedule details are available.
Group schedule of 16 days schedule at level 2.
Group 16 level 2 status at level 3.
Group 16 level 3 status at level 4.
Depends on number of days that you want to expand, increase the level.
Now search from higher level and do binary search at the end.
Asymtotic complexity:-
It means runtime is changing as the input grows.
suppose we have an input string “abcd”. Here we traverse through each character to find its length thus the time taken is proportional to the no of characters in the string like n no of char. Thus O(n).
but if we put the length of the string “abcd” in a variable then no matter how long the string be we still can find the length of thestring by looking at the variable len. (len=4).
ex: return 23. no matter what you input is we still have the output as 23.
thus the complexity is O(1). Thus th program will be running in a constant time wrt input size.
for O(log n) - the operations are happening in logarithmic steps.
https://drive.google.com/file/d/0B7eUOnXKVyeERzdPUE8wYWFQZlk/view?usp=sharing
Observe the image in the above link. Over here we can see the bended line(logarithmic line). Here we can say that for smaller inputs the O(log n) notation works good as the time taken is less as we can see in the bended line but when the input grows the linear notation i.e O(n) is considered as better way.
There are also the best and worst case scenarios to be seen. Like the above example.
You can also refer to this cheat for the algorithms: http://bigocheatsheet.com/
It was already mentioned above, but basically keep it simple with a binary tree. You know a binary tree has log N complexity. So you already know what algorithm you need to use.
All you have to do is to come up with a tree node structure and use binary tree insertion algorithm to find next available date:
A possible one:
The tree node has two attributes: d (date of the concert) and d+5 (end date for the blocking period of 5 days). Again to keep it simple, use a timestamp for the two date attributes.
Now it is trivial to find next available date by using binary tree inorder insertion algorithm with initial condition of root = null.
Why not try to use Union-Find? You can group each concert day + the next 5 days as part of one set and then perform a FIND on the given day which would return the next set ID which would be your next concert date.
If implemented using a tree, this gives a O(log n) time complexity.

How to find the number of occurences of a number in a sorted Array in O(logn)

I have a sorted Array of Integers. Given a number, how to find the number of occurences of that number in O(logn) even the worst case.
Do one binary search for the point where all elements to the left are smaller than your number and all to the right are equal or greater and one for the point where all elements are smaller or equal to the left and greater to the right.
In other words: in one search your "test" is <, while in the other search the test is <=.
Both searches are log n, so that's your total.
For example, in C++ the two functions you'd need are std::lower_bound and std::upper_bound - it seems like the existing Java binary search functions (on Collections) always try to look for a specific value, so you probably have to implement the search yourself if you're using that.
It's important that you use a binary search variant that uses only a binary predicate! If you use a variant that checks whether it hit a specific key "by accident" will sometimes terminate too soon for this specific task.
Binary search for the number, then binary search for the start and the end of the run.
Search for the insertion points of number + 0.5 and number - 0.5 using two binary searches. The number of elements with value number in the list is the difference of the positions (indices) of these two insertion points.
Run the function below once as it is and once with the if condition changed to if(a[mid]<=val) and the corresponding changes in the recursive calls. The two values returned are the starting and ending occurence of the particular value.
int binmin(int a[], int start, int end,int val )
{
if(start<end)
{
int mid = (start+end)/2;
if(a[mid]>=val)
{
binmin(a,start,mid-1,val);
}
else if(a[mid]<val)
{
binmin(a,mid+1,end,val);
}
}
else if(start>end)
return start;
}

How to find an element in a linked list of blocks (containing n elements) as fast as possible?

My data structure is a linked list of blocks. A block contains 31 elements of 4 byte and one 4 byte pointer to the next block or NULL(in summary 128 bytes per block). I add elements from time to time. If the last block is full, I add another block via pointer.
One objective is to use as less memory (= blocks) as possible and having no free space between two elements in a block.
This setting is fix. All code runs on a 32-bit ARM Cortex-A8 CPU with NEON pipeline.
Question:
How to find a specific element in that data structure as quickly as possible?
Approach (right now):
I use sorted blocks and binary search to check for an element (9 bit of the 4 byte are the search criteria). If the desired element is not in the current block I jump to the next block. If the element is not in the last block and the last block is not yet full, I use the result of the binary search to insert the new element (if necessary I make space using memmove within this block). Thus all blocks are always sorted.
Do you have an idea to make that faster?
This is how I search right now: (q->getPosition() is an inline function that just extracts the 9-bit position from the element via "& bitmask")
do
{
// binary search algorithm (bsearch)
// from http://www.google.com/codesearch/
// p?hl=en#qoCVjtE_vOw/gcc4/trunk/gcc-
// 4.4.3/libiberty/bsearch.c&q=bsearch&sa=N&cd=2&ct=rc
base = &(block->points[0]);
if (block->next == NULL)
{
pointsInBlock = pointsInLastBlock;
stop = true;
}
else
{
block = block->next;
}
for (lim = pointsInBlock; lim != 0; lim >>= 1)
{
q = base + (lim >> 1);
cmp = quantizedPosition - q->getPosition();
if (cmp > 0)
{
// quantizedPosition > q: move right
base = q + 1;
lim--;
}
else if (cmp == 0)
{
// We found the QuantPoint
*outQuantPoint = q;
return true;
}
// else move left
}
}
while (!stop);
Since the bulk of the time is spent in the within-block search, that needs to be as fast as possible. Since the number of elements is fixed, you can completely unroll that loop, as in:
if (key < a[16]){
if (key < a[8]){
...
}
else { // key >= a[8] && key < a[16]
...
}
}
else { // key >= a[16]
if (key < a[24]){
...
}
else { // key >= a[24]
...
}
}
Study the generated assembly language and single-step it in a debugger, to make sure the compiler's giving you good code.
You might want to write a little program to print out the above code, as it will be hard to write by hand, or possibly generate it with macros.
ADDED: Just noticed your 9-bit search criterion. In that case, just pre-allocate an array of 512 4-byte words, and index it directly. That's the fastest, and the least code.
ALSO ADDED: If you need to keep your block structure, there's another way to do the unrolled binary search. It's the Jon Bentley method:
i = 0;
if (key >= a[i+16]) i += 16;
if (key >= a[i+ 8]) i += 8;
if (key >= a[i+ 4]) i += 4;
if (key >= a[i+ 2]) i += 2;
if (i < 30 && key >= a[i+ 1]) i += 1; // this excludes 31
if (key == a[i]) // then key is found
That's slower than the if-tree above, because of manipulating i, but could be substantially less code.
Let the number of elements in each block be m and the total number of blocks currently in the list be n. Then the current time complexity of you algorithm is O(n log m).
If you cannot move elements once they are added to a block, then I don't think you can do better in terms of time complexity than what you are already doing. (You could keep track of the maximum and minimum elements in a block, and skip the blocks if the element does not lie in this range. But this is not going to give you much gain. This will also waste space keeping track of the minimum and maximum for each block)
If you can afford to spend time while inserting the element and can move elements from one block to another, then here is a scheme that has time complexity O(log (mn)).
Basically, you keep all elements in sorted order. When a new element has to be inserted, binary search across block boundaries and insert it in its correct location, shifting elements to create space. This will lead to O(nm) time while inserting elements but O(log (mn)) when finding an element.
if this search criterion for an element is fixed, you had better to move the searching into a separate index structure, because the maximal number of elements you distinguish by your search criterion is only 2^9 = 512 indexes, so the maximal size of the search index would be (2 + 4)*512 = 3072, but you could surely use other that static one if you needed, saving some memory. Right now, imagine it as a field of 512 pairs <9-bit index, direct address>, that should be very fast (only one NULL-check and dereference call respectively).
Generally the answer on your question also depend on what other operations you want to perform on your structure and how frequently each of them (including the search ability). If all you want is search(9 bits)->add/modify/read, the your block structure would be useless.
You could write them here and maybe add what language you'r using.
Edit 3:
I just noticed you can't change the blocks' size. But is your search for efficiency reasons only, or do you need the elements of list to be unique (by those 9 bits)?

Resources