What are the ways of calculating Average Case's - data-structures

Background:
For my Data Structures and Algorithms I am studying the Big O Notation. So far I understand how to workout the time complexity, best and worst case scenario. However, the average case is just baffling my head. The teacher is just throwing at us equations that I don't understand. And he is not willing to explain them in detail.
Question:
So please guys, what is the best way to calculate this? Is there one equation that calculates this or does it vary from algorithm to algorithm?
What are the steps you take to calculate this?
Let's take an example of Insertion sort algorithm?
Research:
I looked on youtube and stackoverflow for answers. But they all use different equations.
Any help would be great
thanks

As mentioned in the comment you have to look at the average input to the algorithm (which in this case means random). A good way to think about it is to try at trace what the algorithm would do if the input was average.
For the example of insertion sort:
In the best case (when the input is already sorted) the algorithm will look through the input but never exchanging anything, clearly resulting in a running time of O(n).
In the worst case (when the input is exactly opposite if the desired order) the algorithm will move every input all the way from it's current position to the start of the list, that is, the object on index 0 will not be moved, the object on index 1 will be moved once, the object on input 2 will be moved twice and so on, resulting in a running time of 0+1+2+3+...+n-1 ≈ 0.5n² = O(n²).
The same way of thinking can be used to find the average case, but instead of each object moving all the way to the start, we can expect that it will on average move halfway down to the start, that is, the object on index 0 will not be moved, the object on index 1 will be moved a half time (of cause this only makes sense on average), the object on input 2 will be moved once, the object on index 3 will be moved 1,5 times and so on, resulting in a running time of 0 + 0.5 + 1 + 1.5 + 2 + ... + (n-1)/2 ≈ 0.25n² (at each index, we have half of what we had in the worst case) = 0(n²).
Of cause not all algorithms are as simple as this, but looking at what the algorithm would do on each step if the input was random usually helps. If you have any kind of information available on the input to the algorithm, (for instance insertion sort is often used as the last step after an other algorithm has done most of the sorting, as it is very efficient if the input is almost sorted, and in such a case we might for example know that no object is going to be moved more than x times) then this can be taken into account when computing the average running time.

Related

Data Structures & Algorithms Optimal Solution Explanation

im currently doing a ds&a udemy course as i am prepping for the heavy recruiting this upcoming fall. i stumbled upon a problem that prompted along the lines of:
"Given to list arrays, figure out what integer is missing from the second list array that was present in the first list array "
There were two solutions given in the course one which was considered a brute force solution and the other one the more optimal.
Here are the solutions:
def finderBasic(list1,list2):
list1.sort()
list2.sort()
for i in range(len(list1)):
if list1[i] != list2[i]:
return list1[i]
def finderOptimal(list1,list2):
d = collections.defaultdict(int)
for num in list2:
d[num] = 1
for num in list1:
if d[num] == 0:
return num
else:
d[num] -= 1
The course explains that the finderOptimal is a more optimal way of solving the problem as it solves it in O(n) or linearly. Can someone please further explain to me why that is. I just felt like the finderBasic was much more simpler and only went through one loop. Any help would be much appreciated thank you!
You would be correct, if it was only about going through loop, the first solution would be better.
-- as you said, going through one for loop (whole) takes O(n) time, and it doesn't matter if you go through it once, twice or c-times (as long as c is small enough).
However the heavy operation here is sorting, as it takes cca n*log(n) time, which is larger than O(n). That means, even if you run through the for loop twice in the 2nd solution, it will be still much better than sorting once.
Please note, that accessing dictionary key takes approximately O(1) time, so the time is still O(n) time with the loop.
Refer to: https://wiki.python.org/moin/TimeComplexity
The basic solution may be better for a reader, as it's very simple and straight forward, however it's more complex.
Disclaimer: I am not familiar with python.
There are two loops you are not accounting for in the first example. Each of those sort() calls would have at least two nested loops to implement the sorting. On top of that, usually the best performance you can get in the general case is O(n log(n)) when doing sorting.
The second case avoids all sorting and simply uses a "playcard" to mark what is present. Additionally, it uses dictionary which is a hash table. I am sure you have already learned that hash tables offer constant time - O(1) - operations.
Simpler does not always mean most efficient. Conversely, efficient is often hard to comprehend.

Is a while loop with a nested for loop O(n) or O(n^2)?

I have 2 blocks of code. One with a single while loop, and the second with a for loop inside the while loop. My professor is telling me that Option 1 has an algorithm complexity of O(n) and Option 2 has an algorithm complexity of O(n^2), however can't explain why that is, other than pointing to the nested for loops. I am confused because both perform the exact same number of calculations for any given size N, which doesn't seem to be indicative that they have different algorithm complexities.
I'd like to know:
a) if my professor is correct, and how they can boast the same calculations but have different big Os.
b) if my professor is incorrect and they are the same complexity, is it O(n) or O(n^2)? Why?
I've used inline comments denoted by '#' to note the computations. Packages to deliver should be N. Self.trucks is a list. self.isWorkDayComplete is a boolean determined by whether all packages have been delivered.
Option 1:
# initializes index for fake for loop
truck_index = 0
while(not self.workDayCompleted):
# checks if truck index has reached end of self.trucks list
if(truck_index != len(self.trucks)):
# does X amount of calculations required for delivery of truck's packages
while(not self.trucks[truck_index].isEmpty()):
trucks[truck_index].travel()
trucks[truck_index].deliverPackage()
if(hub.packagesExist()):
truck[truck_index].travelToHub()
truck[truck_index].loadPackages()
# increments index
truck_index += 1
else:
# resets index to 0 for next iteration set through truck list
truck_index = 0
# does X amount of calculations required for while loop condition
self.workDayCompleted = isWorkDayCompleted()
Option 2:
while(not self.workDayCompleted):
# initializes index (i)
# each iteration checks if truck index has reached end of self.trucks list
# increments index
for i in range(len(trucks)):
# does X amount of calculations required for Delivery of truck's packages
while(not self.trucks[i].isEmpty()):
trucks[i].travel()
trucks[i].deliverPackage()
if(hub.packagesExist()):
truck[i].travelToHub()
truck[i].loadPackages()
# does X amount of calculations required for while loop condition
self.workDayCompleted = isWorkDayCompleted()
Any help is greatly appreciated, thank you!
It certainly seems like these two pieces of code are effectively implementing the same algorithm (i.e. deliver a package with each truck, then check to see if the work day is completed, repeat until the work day is completed). From this perspective you're right to be skeptical.
The question becomes: are they O(n) or O(n2)? As you've described it, this is impossible to determine because we don't know what the conditions are for the work day being completed. Is it related to the amount of work that has been done by the trucks? Without that information we have no ability to reason about when the outer loop exits. For all we know the condition is that each truck must deliver 2n packages and the complexity is actually O(n 2n).
So if your professor is right, my only guess is that there's a difference between the implementations of isWorkDayCompleted() between the two options. Barring something like that, though, the two options should have the same complexity.
Regardless, when it comes to problems like this it is always important to make sure that you're both talking about the same things:
What n means (presumably the number of trucks)
What you're counting (presumably the number of deliveries and maybe also the checks for the work day being done)
What the end state is (this is the red flag for me -- the work day being completed needs better defined)
Subsequent edits lead me to believe both of these options are O(n), since they ultimately perform one or two "travel" operations per package, depending on the number of trucks and their capacity. Given this, I think the answer to your core question (do those different control structures result in different complexity analysis) is no, they don't.
It also seems unlikely that the internals are affecting the code complexity in some important way, so my advice would be to get back together with your professor and see if they can expand on their thoughts. It very well might be that this was an oversight on their part or that they were trying to make a more subtle point about how some of the component you're using were implemented.
If you get their explanation and there is something more complex going on that you still have trouble understanding, that should probably be a separate question (perhaps linked to this one).
a) if my professor is correct, and how they can boast the same calculations but have different big Os.
Two algorithms that do the same number of "basic operations" have the same time complexity, regardless how the code is structured.
b) if my professor is incorrect and they are the same complexity, is it O(n) or O(n^2)? Why?
First you have to define: what is "n"? Is n the number of trucks? Next, does the number of "basic operations" per truck the same or does it vary in some way?
For example: If the number of operations per truck is constant C, the total number of operations is C*n. That's in the complexity class O(n).

Is there a way to calculate the progress of a sorting algorithm?

I'm trying to make a subjective sort based on shell sort. I'm referring to the original (Donald Shell's) algorithm. I already made all the logic, where it is exactly the same as the shell sort, but instead of the computer calculate what is greater, the user determines subjectively what is greater.
But the problem is that I would like to display a percentage or something to the user know how far in the sorting it is already. That's why I want to find a way to know it.
I tried asking here(What is the formula to get the number of passes in a shell sort?), but maybe I didn't express myself well last time and they closed the question.
I tried first associating the progress with the number of passes in the array in the shell sort. But lately, I noticed it is not a fixed number. So if you have an idea of how it is the best way to display the progress of the sorting, I will really appreciate it.
I did this formula displaying it by color based on the number of passes, it is the closest I could get, but it doesn't match perfectly the maximum range for the color list.
(Code in Dart/Flutter)
List<Color> colors = [
Color(0xFFFF0000),//red
Color(0xFFFF5500),
Color(0xFFFFAA00),
Color(0xFFFFFF00),//yellow
Color(0xFFAAFF00),
Color(0xFF00FF00),
Color(0xFF00FF00),//green
];
[...]
style: TextStyle(
color: colors[(((pass - 1) * (colors.length - 1)) / sqrt(a.length).ceil()).floor()]
),
[...]
It doesn't need to be this way I tried to do, so please if you have an idea how to display the progress of the sorting please share it.
EDIT: I think I found the answer!! At least for shell sort, it is working based on the number os passes through the array.
Just changing the sqrt(a.length).ceil() with (log(a.length) / log(2)).floor()
This line:
color: colors[(((pass - 1) * (colors.length - 1)) / (log(a.length) / log(2)).floor()).floor()]),
How far along you are in many types of sorts usually depends on the initial order of the elements to be sorted.
For shellsort you have the individual passes further complicating the determination process.
As an example and to illustrate the problem, take insertion sort:
It is the fastest sort of all in one specific set of circumstances, namely to sort a vector that is already sorted in the intended direction - requiring n-1 comparisons.
It is one of the slowest in the opposite circumstance, sorting a vector that is already sorted but in the opposite direction - requiring (n*(n-1))/2 comparisons
Assuming that n=100, the best case is 99 and the worst 4950 comparisons. That's a factor of 1:50 in the number of comparisons required. So when you've done 50 comparisons, you're 50% through the best case or 1% through the worst.
Shellsort does not have as good a case for already sorted data as insertion sort but it is nonetheless very good. The opposite case - the worst case for insertion sort - is actually not the worst case for shellsort and it is much faster than insertion sort's. Shellsort's worst case is also much better than insertion sorts worst. Which means that for a given n you will know exactly the best and worst cases for insertion sort and you will know that shellsort will be somewhat slower at the best case and significantly faster than the worst - if that helps in your quest.
But however you look at it, you won't be able to reliably predict how far along in a (shell)sort you are unless you know how many comparisons are required for the specific data and you only know that after you have sorted it.
Maybe you should use a progress bar like Microsoft uses in Windows: it starts off really quickly but then suddenly realizes that it is halfway along and maybe it should slow down so as not to reach the end even though a lot of sorting remains. The last few millimeters of its travel may take many minutes in some circumstances.

Calculate "moving" Covariance

I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i, y_i, and x_i*y_i) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
the online mean given by:

How do you determine the best, worse and average case complexity of a problem generated with random dice rolls?

There is a picture book with 100 pages. If dice are rolled randomly to select one of the pages and subsequently rerolled in order to search for a certain picture in the book -- how do I determine the best, worst and average case complexity of this problem?
Proposed answer:
best case: picture is found on the first dice roll
worst case: picture is found on 100th dice roll or picture does not exist
average case: picture is found after 50 dice rolls (= 100 / 2)
Assumption: incorrect pictures are searched at most one time
Given your description of the problem, I don't think your assumption (that incorrect pictures are only "searched" once) sounds right. If you don't make that assumption, then the answer is as shown below. You'll see the answers are somewhat different from what you proposed.
It is possible that you could be successful on the first attempt. Therefore, the first answer is 1.
If you were unlucky, you could keep rolling the wrong number forever. So the second answer is infinity.
The third question is less obvious.
What is the average number of rolls? You need to be familiar with the Geometric Distribution: the number of trials needed to get a single success.
Define p as the probability for a successful trial; p=0.01.
Let Pr(x = k) be the probability that the first successful trial being the k th. Then we're going to have to have (k-1) failures and one success. So Pr(x=k) = (1-p)^(k-1) * p. Verify that this is the "probability mass function" on the wiki page (left column).
The mean of the geometric distribution is 1/p, which is therefore 100. This is the average number of rolls required to find the specific picture.
(Note: We need to consider 1 as the lowest possible value, rather than 0, so use the left hand column of the table on the Wikipedia page.)
To analyze this, think about what the best, worst and average cases actually are. You need to answer three questions to find those three cases:
What is the fewest number of rolls to find the desired page?
What is the largest number of rolls to find the desired page?
What is the average number of rolls to find the desired page?
Once you find the first two, the third should be less tricky. If you need asymptotic notation as opposed to just the number of rolls, think about how the answers to each question change if you change the number of pages in the book (e.g. 200 pages vs 100 pages vs 50 pages).
The worst case is not the page found after 100 dice rolls. That would be is your dice always returned different numbers. The worst case is that you never find the page (the way you stated the problem).
The average case is not average of the best and worst cases, fortunately.
The average case is:
1 * (probability of finding page on the first dice roll)
+ 2 * (probability of finding page on the second dice roll)
+ ...
And yes, the sum is infinite, since in thinking about the worst case we determined that you may have an arbitrarily large number of dice rolls. It doesn't mean that it can't be computed (it could mean that, but it doesn't have to).
The probability of finding the page on the first try is 1/100. What's the probability of finding it on the second dice roll?
You're almost there, but (1 + 2 + ... + 100)/100 isn't 50.
It might help to observe that your random selection method is equivalent to randomly shuffling the whole deck and then searching it in order for your target. Each position is equally likely, so the average is straightforward to calculate. Except of course that you aren't doing all that work up front, just as much as is needed to generate each random number and access the corresponding element.
Note that if your book were stored as a linked list, then the cost of moving from each randomly-selected page to the next selection depends on how far apart they are, which will complicate the analysis quite a lot. You don't actually state that you have constant-time access, and it's possibly debatable whether a "real book" provides that or not.
For that matter, there's more than one way to choose random numbers without repeats, and not all of them have the same running time.
So, you'd need more detail in order to analyse the algorithm in terms of anything other than "number of pages visited".

Resources