Amortized Cost in Data Structure - algorithm

Hi How can I find the amortized cost of constant increment in data structure
for example, if the increment of size of the array by 1000 ie ( from N -> N+1000) every time if its overflow
and also by fixed factor ie from ( N -> 7*N)
I get an idea how it works when it's doubled like 1 -> 2, 2-> 4 but I am having a hard time getting an idea about constant increment

Appending to a list that uses a constant-size resizing strategy takes O(n) time amortized.
Let's say you have n append operations to a list for some arbitrary positive integer n.
Then, for every 1000th operation, you need to resize the underlying array, which requires copying m elements, where m is the size of the list at the time of resizing.
The total costs of resize operations can be expressed as 1000 + 2000 + 3000 ... [(n // 1000) * 1000]. Using the formula for the summation of the first n natural numbers, we can rewrite this as 1000 * (n // 1000) * (n // 1000 + 1) / 2 --> O(n^2), so the amoritized cost of appending to a list using the resizing strategy you mentioned is O(n).

Related

Binary Search through multiple elements

I had this question on an exam,
I have a very slow computer which takes one second for binary search on an array with 1,000 (one thousand) elements. How long should I expect the computer to take for binary search on an array with 1,000,000 (one million) elements?
I have trouble understanding this, would it take longer to search through 1 million elements than 1,000 or does it depend on the computer?
Binary search has O(log(N)) time complexity, completion time is
t = c * log(N, 10)
In given case for N = 1000 we know t and thus we can find out c
1 = c * log(1000, 10)
1 = c * 3
c = 1/3
For N = 1000000 we have
t = 1 / 3 * log(1000000, N) =
= 1 / 3 * 6 =
= 2
So we can expect that binary search within 1000000 items will be completed in 2 seconds.
Please, note that O(log(N)) == O(log(N, m)) since log(N, m) == log(N, k) / log(m, k), which means that when working with O(log(N)) we are free to choose logarithm base. In our case (1000 and 1000000) it's convenient to use base 10 (or 1000)
First, a binary search requires the list or array to be sorted. And since you are dividing each list/sublist in half it will take log2(N) searches to find the correct item where log2 is log to the base 2.
So for 1_000_000 items it should only take about 20 comparisons to home in on the item. So it should be very quick (sans the time to sort the list to begin with).
1000 items would take about 10 comparisons. And one second, imo, is much to long to do such a simple search on any modern day processor.
It does depend on the computer and also depends on the size of the array. Since in this question the same computer is used, we can abstract the effects of the computer and focus on the sample size.
Binary search has logarithmic time complexity. If you take the difference between log(1_000) and log(1_000_000) you will see that you should expect double the time (2 seconds) to search in the 1 million elements array.
This is assuming worst case. For the average case, the calculation gets more complex, you can check this wikipedia page for a deeper analysis.

How to calculate exponent using only arithmetic operations in constant time?

I'm trying to find a way to loop through an array of integers of size N and multiply each of those integers by 128^((N-1) - i), where N is the length of the array and i is the index of the integer, and then adding all those results together.
For example, an array input of [1, 2, 3, 4] would return 1 * (128^3) + 2 * (128^2) + 3 * (128^1) + 4 * (128^0).
My algorithm needs to run in O(N) time, but the exponent operation is expensive, as, for example, 2^3 takes three operations. So, I need to find a way to operate on each integer in the array in O(1) time, using only arithmetic operations (-, +, *, /, %). The most obvious (incorrect) way I could think of is simply multiplying each integer (N-i) times, but that does not take constant time. I was also thinking of using exponentiation by squaring, but this takes log_2(N-i) time for operating on each integer, which is not constant.
128 is 2^7, and multiplying a number by 128^k shifts its binary representation left by 7*k positions.
1 * (128^3) + 2 * (128^2) + 3 * (128^1) + 4 * (128^0)
= 1000000000000000000000 + 1000000000000000 + 110000000 + 100
To answer the title question: it's possible to prove that with a constant number of those operations, you can't make numbers big enough for sufficiently large exponents.
To answer the underlying question: you can use the polynomial evaluation method sometimes attributed to Horner: ((1 * 128 + 2) * 128 + 3) * 128 + 4. Note that unless you're modding by something, manipulating the bignums is still going to cost you Õ(n2) time.
If you are indeed working with bignums, there's a more complicated divide and conquer method that should be faster assuming that bignum multiplication runs faster than the school method. The idea is to split the input in half, evaluate the lower and upper halves separately using recursion, and then put them together. On your example, this looks like
(1 * 128 + 2) * 128^2 + (3 * 128 + 4),
where we compute the term 128^2 (i.e., 128^(n/2)) by repeated squaring. The operation count is still O(n) since we have the recurrence
T(n) = 2 T(n/2) + O(log n),
which falls into Case 1. In practice, the running time will be dominated by the large multiplications, with whatever asymptotic complexity the particular implementation has.

Binary vs Linear searches for unsorted N elements

I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we should use a simple linear search, but if we'll do it 10 times we should use sort array O(n log n). How can I detect threshold when and for which size of input array should I use sorting and after that use binary search?
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t is number of checks and C is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2 x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2 in the above.
Assuming n elements and m searches, with crude approximations
the cost of the sort will be C0.n.log n,
the cost of the m binary searches C1.m.log n,
the cost of the m linear searches C2.m.n,
with C2 ~ C1 < C0.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n, the breakeven point is about C0.log n / C2.
For instance, taking C0 / C2 = 5, n = 1000000 gives m = 100.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n it makes sense to choose the one approach over the other.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n) and the question is to find b via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n) and log(b) use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs and filled it with C(n) and ys and filled it with n*log(n) and used numpy to tell me the slope of that least squares fit, which I expect to equal 1 / log(b). I ran the script and got b inside of [2.16, 2.3] depending on how high I set n to (I varied n from 100 to 100'000'000). The fact that b seems to vary depending on n shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n). I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N exceeds the quantity on the RHS (where n is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.

Examining an algorithm on a sorted array

I have a sorted array of length n and I am using linear search to compare my value to every element in the array, then i perform a linear search on the array of size n/2 and then on a size of n/4, n/8 etc until i do a linear search on an array of length 1. In this case n is a power of 2, what are the number of comparisons performed?
Not sure exactly if this response is correct but I thought that the number of comparisons would be
T(2n) = (n/2) +(n/4) + ... + 1.
My reasoning for this was because you have to go through every element and then you just keep adding it, but I am still not sure. If someone could walk me through this I would appreciate it
The recurrence you have set up in your question is a bit off, since if n is the length of your input, then you wouldn't denote the length of the input by 2n. Instead, you'd write it as n = 2k for some choice of k. Once you have this, then you can do the math like this:
The size of half the array is 2k / 2 = 2k-1
The size of one quarter of the array is 2k / 4 = 2k-2
...
If you sum up all of these values, you get the following:
2k + 2k-1 + 2k-2 + ... + 2 + 1 = 2k+1 - 1
You can prove this in several ways: you can use induction, or use the formula for the sum of a geometric series, etc. This arises frequently in computer science, so it's worth committing to memory.
This means that if n = 2k, your algorithm runs in time
2k+1 - 1 = 2(2k) - 1 = 2n - 1
So the runtime is 2n - 1, which is Θ(n).
Hope this helps!

What does Logn actually mean?

I am just studying for my class in Algorithms and have been looking over QuickSort. I understand the algorithm and how it works, but not how to get the number of comparisons it does, or what logn actually means, at the end of the day.
I understand the basics, to the extent of :
x=logb(Y) then
b^x = Y
But what does this mean in terms of algorithm performance? It's the number of comparisons you need to do, I understand that...the whole idea just seems so unintelligible though. Like, for QuickSort, each level K invocation involves 2^k invocations each involving sublists of length n/2^K.
So, summing to find the number of comparisons :
log n
Σ 2^k. 2(n/2^k) = 2n(1+logn)
k=0
Why are we summing up to log n ? Where did 2n(1+logn) come from? Sorry for the vagueness of my descriptions, I am just so confused.
If you consider a full, balanced binary tree, then layer by layer you have 1 + 2 + 4 + 8 + ... vertices. If the total number of vertices in the tree is 2^n - 1 then you have 1 + 2 + 4 + 8 + ... + 2^(n-1) vertices, counting layer by layer. Now, let N = 2^n (the size of the tree), then the height of the tree is n, and n = log2(N) (the height of the tree). That's what the log(n) means in these Big O expressions.
below is a sample tree:
1
/ \
2 3
/ \ / \
4 5 6 7
number of nodes in tree is 7 but high of tree is log 7 = 3, log comes when you have divide and conquer methods, in quick sort you divide list into 2 sublist, and continue this until rich small lists, divisions takes logn time (in average case), because the high of division is log n, partitioning in each level takes O(n), because in each level in average you partition N numbers, (may be there are too many list for partitioning, but average number of numbers is N in each level, in fact some of count of lists is N). So for simple observation if you have balanced partition tree you have log n time for partitioning, which means high of tree.
1 forget about b-trees for sec
here' math : log2 N = k is same 2^k=N .. its the definion of log
, it could be natural log(e) N = k aka e^k = n,, or decimal log10 N = k is 10^k = n
2 see perfect , balanced tree
1
1+ 1
1 + 1 + 1+ 1
8 ones
16 ones
etc
how many elements? 1+2+4+8..etc , so for 2 level b-tree there are 2^2-1 elements, for 3 level tree 2^3-1 and etc.. SO HERE'S MAGIC FORMULA: N_TREE_ELEMENTS= number OF levels^ 2 -1 ,or using definition of log : log2 number OF levels= number_of_tree_elements (Can forget about -1 )
3 lets say there's a task to find element in N elements b-tree, w/ K levels (aka height)
where how b-tree is constructed log2 height = number_of_tree elements
MOST IMPORTANT
so by how b-tree is constructed you need no more then 'height' OPERATIONS to find element in all N elements , or less.. so WHAT IS HEIGHT equals : log2 number_of_tree_elements..
so you need log2 N_number_of_tree_elements.. or log(N) for shorter
To understand what O(log(n)) means you might wanna read up on Big O notaion. In shot it means, that if your data set gets 1024 times bigger you runtime will only be 10 times longer (or less)(for base 2).
MergeSort runs in O(n*log(n)), which means it will take 10 240 times longer. Bubble sort runs in O(n^2), which means it will take 1024^2 = 1 048 576 times longer. So there are really some time to safe :)
To understand your sum, you must look at the mergesort algorithm as a tree:
sort(3,1,2,4)
/ \
sort(3,1) sort(2,4)
/ \ / \
sort(3) sort(1) sort(2) sort(4)
The sum iterates over each level of the tree. k=0 it the top, k= log(n) is the buttom. The tree will always be of height log2(n) (as it a balanced binary tree).
To do a little math:
Σ 2^k * 2(n/2^k) =
2 * Σ 2^k * (n/2^k) =
2 * Σ n*2^k/2^k =
2 * Σ n =
2 * n * (1+log(n)) //As there are log(n)+1 steps from 0 to log(n) inclusive
This is of course a lot of work to do, especially if you have more complex algoritms. In those situations you get really happy for the Master Theorem, but for the moment it might just get you more confused. It's very theoretical so don't worry if you don't understand it right away.
For me, to understand issues like this, this is a good way to think about it.

Resources