This question is about whether there is some abstract similarity between the solutions that leads to the appearance of log in problems such as sorting and searching. Or, more simply, why does log appear so frequently in algorithmic complexity?
Logarithms often show up when a problem can be repeatedly reduced in size by a multiplicative factor. By definition there's a logarithmic number of steps required to reduce the problem to constant size (e.g. size 1).
A typical example would be repeatedly eliminating half of the dataset, as is done in binary search. This gives O(log2(n)) complexity. Some sorting algorithms work by repeatedly splitting the dataset in half, and therefore also have a logarithmic term in their time complexity.
More generally, logarithms frequently show up in solutions to divide-and-conquer recurrence relations. See Master Theorem in Wikipedia for further discussion.
log appears a lot in algorithm complexity especially in recursive algorithms..
lets take at a binary search for example.
you have a sorted array A of 100 elements and your looking for the number 15..
in a binary search you will look at the middle element (50) and compare it to 15.. if element is greater than 15 then you find the middle element between 50 and 100 which is 75.. and compare again.. if 15 is greater than the element at 75 then you look at the element between 75 and 100 which is element 87... you continue to do this until you find the element or until there is no more middle number...
each time you do this method of checking the middle number you cut the total number of elements remaining to search in half..
so the first pass will give you O(n/2) complexity.. the next pass will be O(n/4)... O(n/8) and so on..
to represent this pattern we use logs..
since we are cutting the number of elements to search in half with each pass of the algorithm that becomes the log base so the binary search will yield a O(log2(n)) complexity
most algorithms try to 'cut' the number of operations down to as few as possible by breaking the original data into separate parts to solve and that is why log shows up so often
log appears very often in computer science because of the boolean logic. Everything can be reduced to true vs false or 1 vs 0 or to be or not to be. If you have an if statement you have one option, otherwise you have the other option. This can be applied for bits (you have 0 or 1) or in high impact problems, but there is a decision. And as it is in the real life, when you take a decision, you don't care about the problems that could have happened if you had decided otherwise. This is why log2(n) appears very often.
Then, every situation that is more complicated ( E.g.: choose one possible state from 3 states ) can be reduced to log2(n) => the logarithm base doesn't matter (a constant doesn't influence trend for a function - it has the same degree ):
Mathematical proof:
loga(y) 1
logx(y) = ------- = -------- * loga(y) = constant * loga(y)
loga(x) loga(x)
Proof for programmers:
switch(a) {
case 1: ... ;
case 2: ... ;
...
default: ... ;
}
is similar to:
if (a == 1) {
...
} else {
if ( a == 2 ) {
...
}
...
}
( a switch for k options is equivalent to k-1 if-else statements, where k = constant )
But why log ? Because it is the inverse for an exponential. At the first decision you break the big problem into 2 parts. Then you break only the "good" half in 2 parts, etc.
n = n/2^0 // step 1
n/2 = n/2^1 // step 2
n/4 = n/2^2 // step 3
n/8 = n/2^3 // step 4
...
n/2^i = n/2^i // step i+1
Q: How many steps are there ?
A: i+1 ( from 0 to i )
Because it stops when you find the wanted element (there are no other decisions you can take) => n = 2^i. If we apply the logarithm, base 2:
log2(n) = log2(2^i)
log2(n) = i
=> i + 1 = log2(n) + 1
But a constant doesn't influence the complexity => you have ~log2(n) steps.
Related
According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.
I am given a set of balls and my ultimate goal is to find if at least half of the set of balls are the same color. I can only pick two balls each time and determine whether they are the same color or not. so how to design a divide and conquer algorithm that takes O(n log n) determinations to solve this problem? if there anybody has any idea on this problem, thank you so much!
Perhaps you can do it sort of backwards - if you don't know the answer in n(log n) comparisons, then less than half the balls are of the same color. Sort of merge-sort-group them...
r g r b r y r r // worst case arrangement
rg rb ry rr
↓ // 3 * (n / 4) comparisons
rr gb rrr y
↓ // 3 * (n / 8) comparisons
rrrrr gby
We can reduce your problem to following.
If the balls in the given set are grouped by colors, you wish to find if largest group is at least half the size of the set.
This is easier to solve recursively. (This problem is not defined for empty set, handle it separately.)
class Group { // with appropriate constructors
int size;
Color color;
}
Group findLargestGroupWithSameColors(Set<Ball> ballSet) {
if (ballSet.size() > 1) {
// Divide set into two (preferably equal) sets.
Group first = recursive call on first half.
Group second = recursive call on second half.
if (first.color.equals(second.color)) {
return new Group(first.color, first.size + second.size);
} else {
if (first.size > second.size) {
return first;
} else {
return second;
}
}
} else {
return single element's color and size = 1
}
}
Good luck
I assume that you can order the colors (e.g. you can just compute a hash for the color into integers which can be sorted; I cannot think of any data type that cannot be hashed). Then you can simply sort the balls by their color in O(n log n) time, and then sweep once through the sorted collection, and determine the runs of consecutive same-colored balls. Your answer is, whether the number of balls in the largest run is >= the number of balls.
Edit
The problem is actually O(n). Use a hash table with O(1) insertion for the n balls. Whenever you have it already in, increase the count on the element in the hash-table group, and keep track of the largest group count somewhere else.
You can even do an early exit whenever the max count reaches n/2, which should half the average run time on random sets.
Edit2
Sketch of an exemplary proof of O(n^2)
I strongly believe, that there is no O(n log n) solution when only equality comparisons are allowed. Look at the following example, which should yield true, there are exactly half As, the rest are all different:
First divide
n = 16
AAAAAAABCDEFGHIA
AAAAAAAB CDEFGHIA
AAAA AAAB CDEF GHIA
AA AA AA AB CD EF GH IA
now conquer. We need to find all groups in each conquer step, since each group could potentially be merged with another large group so that it is the majority of all groups.
In this example, int the left half A is clearly the winner, but we need one additional A in the right half. Since the right half does not have a knowledge of the left half in a divide and conquer setting, the right side also tries to find the largest group, ending with n/2 groups of size 1, before the final merge.
In the following notation I use a number before a letter to denote a found group of that size.
2A 2A 2A 1A1B 1C1D 1E1F 1G1H 1I1A 1+1+1+1 +1+1+1+1 =4+4 =8
4A 3A1B 1C1D1E1F 1G1H1I1A 1*1+2*1 +2*2+2*2 =3+8 =11
7A1B 1C1D1E1F1G1H1I1A 1*2 +4*4 =2+16=18
8A1B1C1D1E1F1G1H1I 2*8 =16
53
n log2 n = 16*4=64
On the right I note the number of comparisons needed to merge the groups. To merge a set with x groups and a set with y groups, you need O(x y) comparisons, needed when the two sets are disjunct (i.e. compare each group of one set with each one of the other).
53 comparisons are needed in this example, which is below the n log2 n of 64.
The comparisons on the left side behave very linear. If you analyze the pattern you get (for n>7)
Log2(n)-2+ Sum{i=0..Log2(n)}(2^i) = n/2 - 3 + Log2(n)
But wait, there is a square term at the right number of comparisons. Let's examine that. Each row (except the last merge) doubles the previous, and it ends with (n/4)^2 comparisons. This gives
Sum{i=0..Log2(n)-2}( (n/4)^2 (1/2)^i ) = 1/8 (n^2 - 2*n)
So, indeed, with this divide and conquer approach, our worst case number of comparisons is O(n^2), which seems logical. If all entries are different, and we only can test two for equality each time, we need to test each against each to find if there is really no pair.
Don't know if I miss something, but the problem seems for me to be not solvable in O(n log n) with divide and conquer when only comarisons are allowed.
I don't know if it's the right place to ask because my question is about how to calculate a computer science algorithm complexity using differential equation growth and decay method.
The algorithm that I would like to prove is Binary search for a sorted array, which has a complexity of log2(n)
The algorithm says: if the target value are searching for is equal to the mid element, then return its index. If if it's less, then search on the left sub-array, if greater search on the right sub-array.
As you can see each time N(t): [number of nodes at time t] is being divided by half. Therefore, we can say that it takes O(log2(n)) to find an element.
Now using differential equation growth and decay method.
dN(t)/dt = N(t)/2
dN(t): How fast the number of elements is increasing or decreasing
dt: With respect to time
N(t): Number of elements at time t
The above equation says that the number of cells is being divided by 2 with time.
Solving the above equations gives us:
dN(t)/N(t) = dt/2
ln(N(t)) = t/2 + c
t = ln(N(t))*2 + d
Even though we got t = ln(N(t)) and not log2(N(t)), we can still say that it's logarithmic.
Unfortunately, the above method, even if it makes sense while approaching it to finding binary search complexity, turns out it does not work for all algorithms. Here's a counter example:
Searching an array linearly: O(n)
dN(t)/dt = N(t)
dN(t)/N(t) = dt
t = ln(N(t)) + d
So according to this method, the complexity of searching linearly takes O(ln(n)) which is NOT true of course.
This differential equation method is called growth and decay and it's very popluar. So I would like to know if this method could be applied in computer science algorithm like the one I picked, and if yes, what did I do wrong to get incorrect result for the linear search ? Thank you
The time an algorithm takes to execute is proportional to the number
of steps covered(reduced here).
In your linear searching of the array, you have assumed that dN(t)/dt = N(t).
Incorrect Assumption :-
dN(t)/dt = N(t)
dN(t)/N(t) = dt
t = ln(N(t)) + d
Going as per your previous assumption, the binary-search is decreasing the factor by 1/2 terms(half-terms are directly reduced for traversal in each of the pass of array-traversal,thereby reducing the number of search terms by half). So, your point of dN(t)/dt=N(t)/2 was fine. But, when you are talking of searching an array linearly, obviously, you are accessing the element in one single pass and hence, your searching terms are decreasing in the order of one item in each of the passes. So, how come your assumption be true???
Correct Assumption :-
dN(t)/dt = 1
dN(t)/1 = dt
t = N(t) + d
I hope you got my point. The array elements are being accessed sequentially one pass(iteration) each. So, the array accessing is not changing in order of N(t), but in order of a constant 1. So, this N(T) order result!
Question 01:
How can I find T (1) when I measure the complexity of an algorithm?
For example
I have this algorithm
Int Max1 (int *X, int N)
{
int a ;
if (N==1) return X[0] ;
a = Max1 (X, N‐1);
if (a > X[N‐1]) return a;
else return X[N‐1];
}
How can I find T(1)?
Question 2 :
T(n)= T(n-1) + 1 ==> O(n)
what is the meaning of the "1" in this equation
cordially
Max1(X,N-1) Is the actual algorithm the rest is a few checks which would be O(1)
as regardless of input the time taken will be the same.
The Max1 function I can only assume is finding array highest number in array this would be O(n) as it will increase in time in a linear fashion to the number of input n.
Also as far as I can tell 1 stands for 1 in most algorithms only letters have variable meanings, if you mean how they got
T(n-1) + 1 to O(n), it is due to the fact you ignore coefficients and lower order terms so the 1 is both cases is ignored to make O(n)
Answer 1. You are looking for a complexity. You must decide what case complexity you want: best, worst, or average. Depending on what you pick, you find T(1) in different ways:
Best: Think of the easiest input of length 1 that your algorithm could get. If you're searching for an element in a list, the best case is that the element is the first thing in the list, and you can have T(1) = 1.
Worst: Think of the hardest input of length 1 that your algorithm could get. Maybe your linear search algorithm executes 1 instruction for most inputs of length 1, but for the list [77], you take 100 steps (this example is a bit contrived, but it's entirely possible for an algorithm to take more or less steps depending on properties of the input unrelated to the input's "size"). In this case, your T(1) = 100.
Average: Think of all the inputs of length 1 that your algorithm could get. Assign probabilities to these inputs. Then, calculate the average T(1) of all possibilities to get the average-case T(1).
In your case, for inputs of length 1, you always return, so your T(n) = O(1) (the actual number depends on how you count instructions).
Answer 2. The "1" in this context indicates a precise number of instructions, in some system of instruction counting. It is distinguished from O(1) in that O(1) could mean any number (or numbers) that do not depend on (change according to, trend with, etc.) the input. Your equation says "The time it takes to evaluate the function on an input of size n is equal to the time it takes to evaluate the function on an input of size n - 1, plus exactly one additional instruction".
T(n) is what's called a "function of n," which is to say, n is a "variable" (meaning that you can substitute in different values in its place), and each particular (valid) value of n will determine the corresponding value of T(n). Thus T(1) simply means "the value of T(n) when n is 1."
So the question is, what is the running-time of the algorithm for an input value of 1?
According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.