Am I correct in my explanation when calculating the time complexity of the following algorithm?
A HashSet, moduleMarksheetFiles, is being used to add the files that contain the moduleName specified.
for (File file: marksheetFiles){
while(csvReader.readRecord()){
String moduleName = csvReader.get(ModuleName);
if (moduleName.equals(module)){
moduleMarksheetFiles.add(file);
}
}
}
Let m be the number of files
Let k be the average number of records per file.
As each file is added only once because HashSet does not allow for duplicates. HashSet.add() is O(1) on average and O(n) for worst case.
Searching for a record with the specified moduleName involves comparing every record in the file to the moduleName, will take O(n) steps.
Therefore, the average time complexity would be: O((m*k)^2).
Is this correct?
Also, how would you calculate the worst case?
Thanks.
PS. It is not homework, just analysing my system's algorithm to evaluate performance.
No, it's not squared, this is O(nk). (Technically, that means it's also O((nk)²), but we don't care.)
Your misconception is that it the worst-case performance of HashSet is what counts here. However, even though a hashtable may have worst-case O(n) insertion time (if it needs to rehash every element), its amortized insertion time is O(1) (assuming your hash function is well behaved; File.GetHashCode presumably is). In other words, if you insert multiple things, so many of them will be O(1) that the occasional O(n) insertion does not matter.
Therefore, we can treat insertions as constant-time operations, so performance is purely dictated by the number of iterations through the inner loop body, which is O(nk).
Related
If an algorithm must make n-1 comparisons to find a certain element, then can we assume that best possible runtime of the algorithm is O(n)?
I know that the lower bound for sorting algorithms is nlogn but since we only return the found one element, I figured it would be possible to do better in terms of run time?
Thanks!
To find a certain element in an unsorted list you need O(n).
But if you sort the array (takes O(n log n) in general) you can find a certain element in O(log n).
So if you want to find often elements in the same list it is most likely worth to sort the list to then be able to find elements much more efficient.
If your array is unsorted and you find some element in it then in worst case Linear search algorithm make n-1 comparisons and time complexity will be O(n).
But if you want to reduce your time complexity then first sort your array and use Binary search algorithm it is take O(logn) in worst case.
So Binary search algorithm is more efficient then linear search.
For unsorted elements, worst case is when you have to go over all the elements, i.e., O(N). If you need many look-ups then you have several pre-processing alternatives that speed up all future accesses.
Option 1: put the elements in a standard hash table. Creating the hash table costs O(N), on average, and later pay O(1) on average for each lookup. This assumes that a reasonable hash-function can be created for this type of elements.
Most languages/libraries implement bucket-based hash-tables, which in pathological cases can put all elements in one bucket, costing O(N) per lookup.
Option 2: there are other hash-table implementations that don't suffer from pathological O(N) cases. The Robin Hood hashing (Wikipedia) (more at Programming.Guide) guarantees O(log N) lookup in the worst case, with average of O(1).
Option 3: another option is to sort elements in O(N log N) once, and then use binary-search to lookup in O(log N). Usually this is slower than Robin Hood hashing (Option 2).
Option 4: If the values are simple integers with limited range, with max-min around N, then it is possible to put the values in an array (list), such that array[value-min] will contain a count of how many times the value appears in the input. It costs O(N) to construct, and O(1) to lookup. Better, the constants for both preprocessing and lookup are significantly lower than in any other method.
Note: I didn't mention the O(N) counting-sort as an alternative to the general case of O(N log N) sorting (option 3), since if max(value)-min(value) is small enough for counting-sort, then option 4 is relevant and is simpler and faster.
If applicable, choose option 4, otherwise if you wish to invest time and code then choose option 2. If 4 isn't applicable, and 2 is not worth the effort in your case, then choose option 2 if you don't mind the pathological worst-case (never choose option 2 when an adversary may want to harm you in a DOS attack).
Your question has nothing to do with sorting, let alone linear search.
If you claim that n-1 comparisons are mandated, then your problem has certainly complexity Ω(n). But with that information alone, you can't guarantee O(n) because it is not said that these n-1 comparisons are sufficient, nor that the algorithm does not perform extra operations, for instance to decide which comparisons to perform. It could turn out that your algorithm is O(n³) with no chance to do better, but we can't tell.
Best case complexity: Ω(n).
Worst case complexity: unknown.
i have a question about O-notation. (big O)
In my code, i am using a for loop to iterate through an array of users.
The for loop has if-statements that makes it break out of the loop, if the rigth user is found.
My question is how i measure the O-notation?
Is the O-notation is O(N) as i loop through all the users in the array?
Or is the O-notation O(1), as the loop breaks and never runs again?
O notation defines an "order of" relationship between an amount of work (however measured) and the number of items processed (usually 'n'). So "O(n)" means "in direct proportion to the number of items n". "O(1)" means simply "constant". If a loop processes every item once then the amount of work is intuitively in direct proportion to n, but let's say that your exit condition gets hit on average half way through, we might be tempted to say that this is O(n/2), but instead we still say that it is O(n) because the relationship to n is still direct/linear. Similarly if you were to assess the relationship to be O(7n^3 + 2n), you'd say the relationship was simply O(n^3) because n^3 is the term that dominates as n grows large.
The answer to your specific question is therefore O(n) because the number of iterations is in direct proportion to n. All that this says is that if N user records take M milliseconds to process, 2N should take about 2M milliseconds.
It is probably worth noting that O notation is strictly concerned with worst case and not the average cost of algorithms (although I have started to find that it is quite common for people to use it in the latter sense). It is always a good idea to specify to avoid ambiguity.
Big O notation answers the following two questions:
If there are N data elements, how many steps will the algorithm take?
How will the performance of the algorithm change if the number of data elements increases?
Best-case scenario in your case is that the user you are searching for is found at the first index. Time complexity in this case would be O(1) because number of steps taken by the algorithm are constant and do not change if the number of elements in the array are changed.
The worst-case scenario is that your loop will have to iterate over all the users. That makes the time complexity to be O(N) because number of steps taken by the algorithm will be directly proportional to the number of elements in the array.
Big O notation generally refers to the worst-case scenario, so you can say that the time complexity in your case is O(N).
Best case complexity of for loop is O(1) and worst case complexity is O(N). In linear search best case is O(N) and worst case is O(N). It also depends on the approach followed by you to solve problem. Like for(int i = n; i>1; i=i/2) in this case complexity is O(log(N). Complexity of if else condition is O(1).
When talking about complexity in general, things like O(3n) tend to be simplified to O(n) and so on. This is merely theoretical, so how does complexity work in reality? Can O(3n) also be simplified to O(n)?
For example, if a task implies that solution must be in O(n) complexity and in our code we have 2 times linear search of an array, which is O(n) + O(n). So, in reality, would that solution be considered as linear complexity or not fast enough?
Note that this question is asking about real implementations, not theoretical. I'm already aware that O(n) + O(n) is simplified to O(n)?
Bear in mind that O(f(n)) does not give you the amount of real-world time that something takes: only the rate of growth as n grows. O(n) only indicates that if n doubles, the runtime doubles as well, which lumps functions together that take one second per iteration or one millennium per iteration.
For this reason, O(n) + O(n) and O(2n) are both equivalent to O(n), which is the set of functions of linear complexity, and which should be sufficient for your purposes.
Though an algorithm that takes arbitrary-sized inputs will often want the most optimal function as represented by O(f(n)), an algorithm that grows faster (e.g. O(n²)) may still be faster in practice, especially when the data set size n is limited or fixed in practice. However, learning to reason about O(f(n)) representations can help you compose algorithms to have a predictable—optimal for your use-case—upper bound.
Yes, as long as k is a constant, you can write O(kn) = O(n).
The intuition behind is that the constant k doesn't increase with the size of the input space and at some point will be incomparably small to n, so it doesn't have much influence on the overall complexity.
Yes - as long as the number k of array searches is not affected by the input size, even for inputs that are too big to be possible in practice, O(kn) = O(n). The main idea of the O notation is to emphasize how the computation time increases with the size of the input, and so constant factors that stay the same no matter how big the input is aren't of interest.
An example of an incorrect way to apply this is to say that you can perform selection sort in linear time because you can only fit about one billion numbers in memory, and so selection sort is merely one billion array searches. However, with an ideal computer with infinite memory, your algorithm would not be able to handle more than one billion numbers, and so it is not a correct sorting algorithm (algorithms must be able to handle arbitrarily large inputs unless you specify a limit as a part of the problem statement); it is merely a correct algorithm for sorting up to one billion numbers.
(As a matter of fact, once you put a limit on the input size, most algorithms will become constant-time because for all inputs within your limit, the algorithm will solve it using at most the amount of time that is required for the biggest / most difficult input.)
I have seen that in most cases the time complexity is related to the space complexity and vice versa. For example in an array traversal:
for i=1 to length(v)
print (v[i])
endfor
Here it is easy to see that the algorithm complexity in terms of time is O(n), but it looks to me like the space complexity is also n (also represented as O(n)?).
My question: is it possible that an algorithm has different time complexity than space complexity?
The time and space complexities are not related to each other. They are used to describe how much space/time your algorithm takes based on the input.
For example when the algorithm has space complexity of:
O(1) - constant - the algorithm uses a fixed (small) amount of space which doesn't depend on the input. For every size of the input the algorithm will take the same (constant) amount of space. This is the case in your example as the input is not taken into account and what matters is the time/space of the print command.
O(n), O(n^2), O(log(n))... - these indicate that you create additional objects based on the length of your input. For example creating a copy of each object of v storing it in an array and printing it after that takes O(n) space as you create n additional objects.
In contrast the time complexity describes how much time your algorithm consumes based on the length of the input. Again:
O(1) - no matter how big is the input it always takes a constant time - for example only one instruction. Like
function(list l) {
print("i got a list");
}
O(n), O(n^2), O(log(n)) - again it's based on the length of the input. For example
function(list l) {
for (node in l) {
print(node);
}
}
Note that both last examples take O(1) space as you don't create anything. Compare them to
function(list l) {
list c;
for (node in l) {
c.add(node);
}
}
which takes O(n) space because you create a new list whose size depends on the size of the input in linear way.
Your example shows that time and space complexity might be different. It takes v.length * print.time to print all the elements. But the space is always the same - O(1) because you don't create additional objects. So, yes, it is possible that an algorithm has different time and space complexity, as they are not dependent on each other.
Time and Space complexity are different aspects of calculating the efficiency of an algorithm.
Time complexity deals with finding out how the computational time of
an algorithm changes with the change in size of the input.
On the other hand, space complexity deals with finding out how much
(extra)space would be required by the algorithm with change in the
input size.
To calculate time complexity of the algorithm the best way is to check if we increase in the size of the input, will the number of comparison(or computational steps) also increase and to calculate space complexity the best bet is to see additional memory requirement of the algorithm also changes with the change in the size of the input.
A good example could be of Bubble sort.
Lets say you tried to sort an array of 5 elements.
In the first pass you will compare 1st element with next 4 elements. In second pass you will compare 2nd element with next 3 elements and you will continue this procedure till you fully exhaust the list.
Now what will happen if you try to sort 10 elements. In this case you will start with comparing comparing 1st element with next 9 elements, then 2nd with next 8 elements and so on. In other words if you have N element array you will start of by comparing 1st element with N-1 elements, then 2nd element with N-2 elements and so on. This results in O(N^2) time complexity.
But what about size. When you sorted 5 element or 10 element array did you use any additional buffer or memory space. You might say Yes, I did use a temporary variable to make the swap. But did the number of variables changed when you increased the size of array from 5 to 10. No, Irrespective of what is the size of the input you will always use a single variable to do the swap. Well, this means that the size of the input has nothing to do with the additional space you will require resulting in O(1) or constant space complexity.
Now as an exercise for you, research about the time and space complexity of merge sort
First of all, the space complexity of this loop is O(1) (the input is customarily not included when calculating how much storage is required by an algorithm).
So the question that I have is if its possible that an algorithm has different time complexity from space complexity?
Yes, it is. In general, the time and the space complexity of an algorithm are not related to each other.
Sometimes one can be increased at the expense of the other. This is called space-time tradeoff.
There is a well know relation between time and space complexity.
First of all, time is an obvious bound to space consumption: in time t
you cannot reach more than O(t) memory cells. This is usually expressed
by the inclusion
DTime(f) ⊆ DSpace(f)
where DTime(f) and DSpace(f) are the set of languages
recognizable by a deterministic Turing machine in time
(respectively, space) O(f). That is to say that if a problem can
be solved in time O(f), then it can also be solved in space O(f).
Less evident is the fact that space provides a bound to time. Suppose
that, on an input of size n, you have at your disposal f(n) memory cells,
comprising registers, caches and everything. After having written these cells
in all possible ways you may eventually stop your computation,
since otherwise you would reenter a configuration you
already went through, starting to loop. Now, on a binary alphabet,
f(n) cells can be written in 2^f(n) different ways, that gives our
time upper bound: either the computation will stop within this bound,
or you may force termination, since the computation will never stop.
This is usually expressed in the inclusion
DSpace(f) ⊆ Dtime(2^(cf))
for some constant c. the reason of the constant c is that if L is in DSpace(f) you only
know that it will be recognized in Space O(f), while in the previous
reasoning, f was an actual bound.
The above relations are subsumed by stronger versions, involving
nondeterministic models of computation, that is the way they are
frequently stated in textbooks (see e.g. Theorem 7.4 in Computational
Complexity by Papadimitriou).
Yes, this is definitely possible. For example, sorting n real numbers requires O(n) space, but O(n log n) time. It is true that space complexity is always a lowerbound on time complexity, as the time to initialize the space is included in the running time.
Sometimes yes they are related, and sometimes no they are not related,
actually we sometimes use more space to get faster algorithms as in dynamic programming https://www.codechef.com/wiki/tutorial-dynamic-programming
dynamic programming uses memoization or bottom-up, the first technique use the memory to remember the repeated solutions so the algorithm needs not to recompute it rather just get them from a list of solutions. and the bottom-up approach start with the small solutions and build upon to reach the final solution.
Here two simple examples, one shows relation between time and space, and the other show no relation:
suppose we want to find the summation of all integers from 1 to a given n integer:
code1:
sum=0
for i=1 to n
sum=sum+1
print sum
This code used only 6 bytes from memory i=>2,n=>2 and sum=>2 bytes
therefore time complexity is O(n), while space complexity is O(1)
code2:
array a[n]
a[1]=1
for i=2 to n
a[i]=a[i-1]+i
print a[n]
This code used at least n*2 bytes from the memory for the array
therefore space complexity is O(n) and time complexity is also O(n)
The way in which the amount of storage space required by an algorithm varies with the size of the problem it is solving. Space complexity is normally expressed as an order of magnitude, e.g. O(N^2) means that if the size of the problem (N) doubles then four times as much working storage will be needed.
space complexity is the total amount of memory space used by an algorithm/program, including input value execution space. whereas the time complexity is the number of operations an algorithm performs to complete its task. These are two different concept, a single algorithm can of low time complexity but still can take up a lot of memory for example hashmaps take more memory than array but take less time.
I am currently reading amortized analysis. I am not able to fully understand how it is different from normal analysis we perform to calculate average or worst case behaviour of algorithms. Can someone explain it with an example of sorting or something ?
Amortized analysis gives the average performance (over time) of each operation in
the worst case.
In a sequence of operations the worst case does not occur often in each operation - some operations may be cheap, some may be expensive Therefore, a traditional worst-case per operation analysis can give overly pessimistic bound. For example, in a dynamic array only some inserts take a linear time, though others - a constant time.
When different inserts take different times, how can we accurately calculate the total time? The amortized approach is going to assign an "artificial cost" to each operation in the sequence, called the amortized cost of an operation. It requires that the total real cost of the sequence should be bounded by the total of the amortized costs of all the operations.
Note, there is sometimes flexibility in the assignment of amortized costs.
Three methods are used in amortized analysis
Aggregate Method (or brute force)
Accounting Method (or the banker's method)
Potential Method (or the physicist's method)
For instance assume we’re sorting an array in which all the keys are distinct (since this is the slowest case, and takes the same amount of time as when they are not, if we don’t do anything special with keys that equal the pivot).
Quicksort chooses a random pivot. The pivot is equally likely to be the smallest key,
the second smallest, the third smallest, ..., or the largest. For each key, the
probability is 1/n. Let T(n) be a random variable equal to the running time of quicksort on
n distinct keys. Suppose quicksort picks the ith smallest key as the pivot. Then we run quicksort recursively on a list of length i − 1 and on a list of
length n − i. It takes O(n) time to partition and concatenate the lists–let’s
say at most n dollars–so the running time is
Here i is a random variable that can be any number from 1 (pivot is the
smallest key) to n (pivot is largest key), each chosen with probability 1/n,
so
This equation is called a recurrence. The base cases for the recurrence are T(0) = 1 and T(1) = 1. This means that sorting a list of length zero or one takes at most one dollar (unit of time).
So when you solve:
The expression 1 + 8j log_2 j might be an overestimate, but it doesn’t
matter. The important point is that this proves that E[T(n)] is in O(n log n).
In other words, the expected running time of quicksort is in O(n log n).
Also there’s a subtle but important difference between amortized running time
and expected running time. Quicksort with random pivots takes O(n log n) expected running time, but its worst-case running time is in Θ(n^2). This means that there is a small
possibility that quicksort will cost (n^2) dollars, but the probability that this
will happen approaches zero as n grows large.
Quicksort O(n log n) expected time
Quickselect Θ(n) expected time
For a numeric example:
The Comparison Based Sorting Lower Bound is:
Finally you can find more information about quicksort average case analysis here
average - a probabilistic analysis, the average is in relation to all of the possible inputs, it is an estimate of the likely run time of the algorithm.
amortized - non probabilistic analysis, calculated in relation to a batch of calls to the algorithm.
example - dynamic sized stack:
say we define a stack of some size, and whenever we use up the space, we allocate twice the old size, and copy the elements into the new location.
overall our costs are:
O(1) per insertion \ deletion
O(n) per insertion ( allocation and copying ) when the stack is full
so now we ask, how much time would n insertions take?
one might say O(n^2), however we don't pay O(n) for every insertion.
so we are being pessimistic, the correct answer is O(n) time for n insertions, lets see why:
lets say we start with array size = 1.
ignoring copying we would pay O(n) per n insertions.
now, we do a full copy only when the stack has these number of elements:
1,2,4,8,...,n/2,n
for each of these sizes we do a copy and alloc, so to sum the cost we get:
const*(1+2+4+8+...+n/4+n/2+n) = const*(n+n/2+n/4+...+8+4+2+1) <= const*n(1+1/2+1/4+1/8+...)
where (1+1/2+1/4+1/8+...) = 2
so we pay O(n) for all of the copying + O(n) for the actual n insertions
O(n) worst case for n operation -> O(1) amortized per one operation.