Binary Search vs. Linear Search (data structures & algorithms) - algorithm

Trying to wrap my head around some basic and common algorithms.. my current understanding of the question is in bold.
( 1 ) Assuming we have a sorted array with n items: at most how many times will a binary search compare elements?
I keep seeing ' 0(log(n)) ' pop up as a general answer for this type of question but I do not understand why. Is there not a whole number that answers this question (i.e. 2 or 3?)
( 2 ) Assuming we have an array with n items: at most how many times will a linear search compare elements?
Again, same as above, but now ' 0(n) ' seems to be the general answer to this question. Again, I do not really understand the power behind this answer and question why there is not some whole number answer?
( 3 ) Can someone explain an example when a linear search would be better than a binary search?
From the information I have gathered, it generally seems like a binary search is a better option, if possible, because it is of it's quickness. I'm having trouble seeing when a linear search would be a better option.

regarding 1&2, an absolute number as an answer would've been possible if an absolute number was provided as the size of the input. since the question asks about an arbitrarily sized array (of length n) then answer is also given in these terms.
you can read more about big O notation for details, but basically O(n) & O(log n) mean order of n & order of log(n) respectively. i.e. if the input size is 100, for example, the number of compared elements using linear search will also be in the order of 100 while using a binary search would require comparing ~ log(100) elements.
as for 3, binary search requires the input to be sorted...

The O notation is about limiting behaviour. So a binary search divides the list into two. Either you have found the item or you have half to search. Hence the limiting bevaviour of O(nlogn) - i,e at the leaf of the search tree.
Linear search just starts at the beginning. Of the worst can (the limit) is the element is at the end).
For (3) if the item is the first in the list you have hit the jackpot. So would be better in that case

First of all, you are dealing with O notation, so a "whole number" isn't possible. Answers are are always in O(f(n)) format where f(n) is some function of n. If you aren't sure what Big O notation is all about, then starting from http://en.wikipedia.org/wiki/Big_O_notation may help.
(1) With binary search on a sorted array, the search space is repeatedly reduced by half, until the item is located. If we think about an ideal binary search implementation where every operations takes constant time, in the worst case we will need to examine approximately logn items - which will take O(logn) time. As to the math behind logn: they aren't terribly difficult but it's hard to type it out on an iPhone. Hint: Google is your friend.
(2) in a linear search of an unsorted array, we may have to examine every item in the array. Again, this is a simplification, but assuming that every operation in our algorithm takes constant time, we must look at least n times. Hence O(n).
(3) What is the difference in the data you must search in (1) and (2)? Remember that sorting is, optimally, O(nlogn).

The following video provides a comprehensive explanation. I hope it will help you to understand the difference between binary search and liner search.
For question3:
binary search requires, sorted sequence.
for sequence 1,2,3,...,100. When you want to find element 1, linear search will be faster. It will just check the first element.

Related

Can my algorithm be done any better?

I have been presented with a challenge to make the most effective algorithm that I can for a task. Right now I came to the complexity of n * logn. And I was wondering if it is even possible to do it better. So basically the task is there are kids having a counting out game. You are given the number n which is the number of kids and m which how many times you skip someone before you execute. You need to return a list which gives the execution order. I tried to do it like this you use skip list.
Current = m
while table.size>0:
executed.add(table[current%table.size])
table.remove(current%table.size)
Current += m
My questions are is this correct? Is it n*logn and can you do it better?
Is this correct?
No.
When you remove an element from the table, the table.size decreases, and current % table.size expression generally ends up pointing at another irrelevant element.
For example, 44 % 11 is 0 but 44 % 10 is 4, an element in a totally different place.
Is it n*logn?
No.
If table is just a random-access array, it can take n operations to remove an element.
For example, if m = 1, the program, after fixing the point above, would always remove the first element of the array.
When an array implementation is naive enough, it takes table.size operations to relocate the array each time, leading to a total to about n^2 / 2 operations in total.
Now, it would be n log n if table was backed up, for example, by a balanced binary search tree with implicit indexes instead of keys, as well as split and merge primitives. That's a treap for example, here is what results from a quick search for an English source.
Such a data structure could be used as an array with O(log n) costs for access, merge and split.
But nothing so far suggests this is the case, and there is no such data structure in most languages' standard libraries.
Can you do it better?
Correction: partially, yes; fully, maybe.
If we solve the problem backwards, we have the following sub-problem.
Let there be a circle of k kids, and the pointer is currently at kid t.
We know that, just a moment ago, there was a circle of k + 1 kids, but we don't know where, at which kid x, the pointer was.
Then we counted to m, removed the kid, and the pointer ended up at t.
Whom did we just remove, and what is x?
Turns out the "what is x" part can be solved in O(1) (drawing can be helpful here), so the finding the last kid standing is doable in O(n).
As pointed out in the comments, the whole thing is called Josephus Problem, and its variants are studied extensively, e.g., in Concrete Mathematics by Knuth et al.
However, in O(1) per step, this only finds the number of the last standing kid.
It does not automatically give the whole order of counting the kids out.
There certainly are ways to make it O(log(n)) per step, O(n log(n)) in total.
But as for O(1), I don't know at the moment.
Complexity of your algorithm depends on the complexity of the operations
executed.add(..) and table.remove(..).
If both of them have complexity of O(1), your algorithm has complexity of O(n) because the loop terminates after n steps.
While executed.add(..) can easily be implemented in O(1), table.remove(..) needs a bit more thinking.
You can make it in O(n):
Store your persons in a LinkedList and connect the last element with the first. Removing an element costs O(1).
Goging to the next person to choose would cost O(m) but that is a constant = O(1).
This way the algorithm has the complexity of O(n*m) = O(n) (for constant m).

Insertion Sort with binary search

When implementing Insertion Sort, a binary search could be used to locate the position within the first i - 1 elements of the array into which element i should be inserted.
How would this affect the number of comparisons required? How would using such a binary search affect the asymptotic running time for Insertion Sort?
I'm pretty sure this would decrease the number of comparisons, but I'm not exactly sure why.
Straight from Wikipedia:
If the cost of comparisons exceeds the cost of swaps, as is the case
for example with string keys stored by reference or with human
interaction (such as choosing one of a pair displayed side-by-side),
then using binary insertion sort may yield better performance. Binary
insertion sort employs a binary search to determine the correct
location to insert new elements, and therefore performs ⌈log2(n)⌉
comparisons in the worst case, which is O(n log n). The algorithm as a
whole still has a running time of O(n2) on average because of the
series of swaps required for each insertion.
Source:
http://en.wikipedia.org/wiki/Insertion_sort#Variants
Here is an example:
http://jeffreystedfast.blogspot.com/2007/02/binary-insertion-sort.html
I'm pretty sure this would decrease the number of comparisons, but I'm
not exactly sure why.
Well, if you know insertion sort and binary search already, then its pretty straight forward. When you insert a piece in insertion sort, you must compare to all previous pieces. Say you want to move this [2] to the correct place, you would have to compare to 7 pieces before you find the right place.
[1][3][3][3][4][4][5] ->[2]<- [11][0][50][47]
However, if you start the comparison at the half way point (like a binary search), then you'll only compare to 4 pieces! You can do this because you know the left pieces are already in order (you can only do binary search if pieces are in order!).
Now imagine if you had thousands of pieces (or even millions), this would save you a lot of time. I hope this helps. |=^)
If you have a good data structure for efficient binary searching, it is unlikely to have O(log n) insertion time. Conversely, a good data structure for fast insert at an arbitrary position is unlikely to support binary search.
To achieve the O(n log n) performance of the best comparison searches with insertion sort would require both O(log n) binary search and O(log n) arbitrary insert.
Binary Insertion Sort - Take this array => {4, 5 , 3 , 2, 1}
Now inside the main loop , imagine we are at the 3rd element. Now using Binary Search we will know where to insert 3 i.e. before 4.
Binary Search uses O(Logn) comparison which is an improvement but we still need to insert 3 in the right place. For that we need to swap 3 with 5 and then with 4.
Due to insertion taking the same amount of time as it would without binary search the worst case Complexity Still remains O(n^2).
I hope this helps.
Assuming the array is sorted (for binary search to perform), it will not reduce any comparisons since inner loop ends immediately after 1 compare (as previous element is smaller). In general the number of compares in insertion sort is at max the number of inversions plus the array size - 1.
Since number of inversions in sorted array is 0, maximum number of compares in already sorted array is N - 1.
For comparisons we have log n time, and swaps will be order of n.
For n elements in worst case : n*(log n + n) is order of n^2.

Algorithm to find the top 3 occurring words in a book of 1000 pages [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
Algorithm to find the top 3 occurring words in a book of 1000 pages. Is there a better solution than using a hashtable?
A potentially better solution is to use a trie-based dictionary. With a trie, you can perform the task in worst-case O(n × N) time where N is the number of words and n is their average length. The difference with a hash table is that the complexity for a trie is independent of any hash function or the book's vocabulary.
There's no way to do better than O(n × N) for arbitrary input since you'll have to scan through all the words.
It is strange, that everybody concentrated on going through the word list and forgot about the main issue - taking k most frequent items. Actually, hash map is good enough to count occurrences, but this implementation still needs sorting, which is de facto O(n*logn) (in best case).
So, hash map implementation needs 1 pass to count words (unguaranteed O(n)) and O(n*logn) to sort it. Tries mentioned here may be better solution for counting, but sorting is still the issue. And again, 1 pass + sorting.
What you actually need is a heap, i.e. tree-based data structure that keeps largest (lowest) elements close to root. Simple implementations of a heap (e.g binary heap) need O(logn) time to insert new elements and O(1) to get highest element(s), so resulting algorithm will take O(n*logn) and only 1 pass. More sophisticated implementations (e.g. Fibonacci heap) take amortized O(1) time for insertion, so resulting algorithm takes O(n) time, which is better than any suggested solution.
You're going to have to go through all of the pages word by word to get an exact answer.
So a linked list implementation that also uses a hashtable interface to store pointers to nodes of the linked list, would do very well.
You need the linked list to grow dynamically and the hashtable to quickly get access to the right needed node so you can update the count.
A simple approach is to use a Dictionary<string, int>(.net) or HashTable and count the occurance of each word while scanning the whole book.
Wikipedia says this:
"For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case."
I would also have guessed a hash tree.
This algorithm solves with a complexity of
n+lg(n)-2 whener n = 3 here.
http://www.seeingwithc.org/topic3html.html

Is binary search optimal in worst case?

Is binary search optimal in worst case? My instructor has said so, but I could not find a book that backs it up. We start with an ordered array, and in worst case(worst case for that algorithm), any algorithm will always take more pairwise comparisons than binary search.
Many people said that the question was unclear. Sorry! So the input is any general sorted array. I am looking for a proof which says that any search algorithm will take at least log2(N) comparisons in worst case(worst case for the algo in consideration).
Yes, binary search is optimal.
This is easily seen by appealing to information theory. It takes log N bits merely to identify a unique element out of N elements. But each comparison only gives you one bit of information. Therefore, you must perform log N comparisons to identify a unique element.
More verbosely... Consider a hypothetical algorithm X that outperforms binary search in the worst case. For a particular element of the array, run the algorithm and record the questions it asks; i.e., the sequence of comparisons it performs. Or rather, record the answers to those questions (like "true, false, false, true").
Convert that sequence into a binary string (1,0,0,1). Call this binary string the "signature of the element with respect to algorithm X". Do this for each element of the array, assigning a "signature" to each element.
Now here is the key. If two elements have the same signature, then algorithm X cannot tell them apart! All the algorithm knows about the array are the answers it gets from the questions it asks; i.e., the comparisons it performs. And if the algorithm cannot tell two elements apart, then it cannot be correct. (Put another way, if two elements have the same signature, meaning they result in the same sequence of comparisons by the algorithm, which one did the algorithm return? Contradiction.)
Finally, prove that if every signature has fewer than log N bits, then there must exist two elements with the same signature (pigeonhole principle). Done.
[update]
One quick additional comment. The above assumes that the algorithm does not know anything about the array except what it learns from performing comparisons. Of course, in real life, sometimes you do know something about the array a priori. As a toy example, if I know that the array has (say) 10 elements all between 1 and 100, and that they are distinct, and that the numbers 92 through 100 are all present in the array... Then clearly I do not need to perform four comparisons even in the worst case.
More realistically, if I know that the elements are uniformly distributed (or roughly uniformly distributed) between their min and their max, again I can do better than binary search.
But in the general case, binary search is still optimal.
Worst case for which algorithm? There's not one universal "worst case." If your question is...
"Is there a case where binary search takes more comparisons than another algorithm?"
Then, yes, of course. A simple linear search takes less time if the element happens to be the first one in the list.
"Is there even an algorithm with a better worst-case running time than binary search?"
Yes, in cases where you know more about the data. For example, a radix tree or trie is at worst constant-time with regard to the number of entries (but linear in length of the key).
"Is there a general search algorithm with a better worst-case running time than binary search?"
If you can only assume you have a comparison function on keys, no, the best worst-case is O(log n). But there are algorithms that are faster, just not in a big-O sense.
... so I suppose you really would have to define the question first!
Binary search has a worst case complexity of O(log(N)) comparisons - which is optimal for a comparison based search of a sorted array.
In some cases it might make sense to do something other than a purely comparison based search - in this case you might be able to beat the O(log(N)) barrier - i.e. check out interpolation search.
It depends on the nature of the data. For example the English language and a dictionary. You could write an algorithm to achieve better than a binary search by making use of the fact that certain letters occur within the English language with different frequencies.
But in general a binary search is a safe bet.
I think the question is a little unclear, but still here are my thoughts.
Worst case of Binary search would be when the element you are searching for is found after all log n comparisons. But the same data can be best case for linear search. It depends on the data arrangement and what you are searching for but the worst case for Binary search would end up to be log n. Now, this cannot be compared with same data and search for linear search since its worst case would be different. The worst case for Linear search could be finding an element which happens to be at the end of the array.
For example: array A = 1, 2, 3, 4, 5, 6 and Binary Search on A for 1 would be the worst case. Whereas for the same array, linear search for 6 would be the worst case, not search for 1.

Algorithmic complexity: why does ordering reduce complexity to O(log n)

I'm reading some texts about algorithmic complexity (and I'm planning to take an algorithms course later), but I don't understand the following.
Say I've to search for an item in an unordered list, the number of steps it takes to find it would be proportional to the number of items on that list. Finding it in a list of 10 items could take 10 steps, doing the same for a list of 100000 items could take 100000 steps. So the algorithmic complexity would be linear, denoted by 'O(n)'.
Now, this text[1] tells me if I were to sort the list by some property, say a social security number, the algorithmic complexity of finding an item would be reduced to O(log n), which is a lot faster, off course.
Now I can see this happening in case of a b-tree, but how does this apply to a list? Do I misunderstand the text, since English isn't my native language?
[1]http://msdn.microsoft.com/en-us/library/ms379571.aspx
This works for any container that is randomly accessible. In the case of a list you would go to the middle element first. Assuming that's not the target, the ordering tells you if the target will be in the upper sublist or the lower sublist. This essentially turns into a binary search, no different than searching through a b-tree.
Binary search, check middle if target is higher it must reside in the right side, if less its the middle number and so on. Each time you divide the list in two which leaves you with O(log n)

Resources