Worst case time complexity - complexity-theory

i have an algorithm that searches into a directory and search for all the text files in that directory and any sub-directory. Assuming i do not know how many sub-directories and sub-sub directories there are in the parent directory. how do i calculate the complexity?
this is the code i am using
public List<string> GetFilesInDirectory(string directoryPath)
{
// Store results in the file results list.
List<string> files = new List<string>();
// Store a stack of our directories.
Stack<string> stack = new Stack<string>();
// Add initial directory.
stack.Push(Server.MapPath(directoryPath));
// Continue while there are directories to process
while (stack.Count > 0)
{
// Get top directory
string dir = stack.Pop();
try
{
// Add all files at this directory to the result List.
files.AddRange(Directory.GetFiles(dir, "*.txt"));
// Add all directories at this directory.
foreach (string dn in Directory.GetDirectories(dir))
{
stack.Push(dn);
}
}
catch(Exception ex)
{
}
}
return files;
}
thanks

Big-O notation says something about how the problem complexity grows when the argument size grows. In other terms, how the time complexity grows when the set of elements increase. 1 or 8972348932 files/directories does not matter. Your code works in O(N) linear time, assuming directories and files are only visited once. O(123N) is still written as O(N). What does this mean? It means that Big O notation says nothing about the actual initial cost. Only how the complexity grows.
Compare two algorithms for the same problem, which runs in O(N) time and O(N log N) time. The O(N log N) algorithm might be faster for smaller N than the O(N) one, but given a large enough N, the O(N) will catch up.

I'd say it's O(N) on the number of files, together, in all directories.
Navigating those directories is not a complex task, it's just bookkeeping.

Your algorithm pushes all directories on your stack and does work for every directory it encounters, so the complexity is in the order of directories times 2, or O(2n) where n is the number of directories, as far as complexity is concerned this is equivalent to O(n).

Time complexity is calculated in terms of n, where n would be the number of items that are being processed. You don't need exact numbers, and more to the point you can't use exact numbers for big-O complexity, since you're trying to calculate worst-case running time.

the worse case running time is a function of the maximum depth of the directory tree (in windows it is limited by the max path length) and the number of files allowed in a subdirectory

I would say it is O(N^2) because you have a double-nested for loop, but the size of each loop is not the same, so we have to modify it a bit.
The number of directories is likely smaller than the number of files. So let's say that the number of files is N and the number of directories is M then you would get O(N*M). That's my best guess.

Related

How to improve my algorithm time complexity?

We have "n" files and for each file, "m" lines, we want to make some operations, to proceed all files and lines, it is evident to apply the algorithm below:
int n; //n is the number of files
int m; //m is the number of the lines in the file i.
for(i=0;i<n;i++){
for(j=0;j<m;j++){
.....
}
}
Thus, we have a O(nxm) complixity.
My question is :
Is there a possibility to make it O(nlog(n)), or other methods to improve the time complexity of the algorithm by :
1- keeping all files and lines.
2- we can ignore some of them.
Best regards
If you want to play with asymptotic complexities, then be rigorous.
The initial complexity is Θ(F L) (most probably, but you do not specify the processing) where F denotes the number of files and L the average number of lines per file. (Though, as the line lengths may vary, it would be safer to speak in terms of average number of characters.)
If you process as many files as there are bits in F (like you did), the complexity indeed lowers to Θ(log(F) L). But if you process every other file, or even one tenth of them, the complexity remains Θ(F L).
There is no magical recipe to reduce the complexity of a problem. Sometimes you can get improvements because the initial algorithm is not efficient, sometimes you cannot. In the case at hand, you probably cannot (though this depends on the particular processing).
What you are doing by subsampling the files is not a complexity improvement: it is a cheat to reduce the size of the problem, and you are no more solving the initial problem.
You can achieve m log(n) by skipping n / log n files
or
You can achieve n log(m) by skipping m / log m lines per file.

Heapsort and building heaps using linked list

I know that linked list is not a appropriate data structure for building heaps.
One of the answers here (https://stackoverflow.com/a/14584517/5841727) says that heap sort can be done in O(nlogn) using linked list which is same as with arrays.
I think that heapify operation would cost O(n) time in linked list and we would need (n/2) heapify operations leading to time complexity of O(n^2).
Can someone please tell how to achieve O(nlogn) complexity (for heap sort ) using linked list ?
Stackoverflow URL you mentioned is merely someone's claim (at least when I'm writing this) so based on assumption here is my answer. Mostly when people mention "Time complexity", they mean asymptomatic analysis and finding out the proportion to which time taken by algorithm would increase with increasing size of input ignoring all the constants.
To prove the time complexity with linkedlist lets assume there is a function which returns value for given index (i know linked list don't return by index). For efficiency of this function you'd also need to pass in level but we can ignore that for now since it doesn't have any impact on time complexity.
So now it comes down to analyzing time proportion impact on this function with increasing input size. Now you could imagine that for fixing (heapifying) one node you may have to traverse 3 times max (1. find out which one to swap with requires one traverse to compare one of two possible children, 2. going back to parent for swaping 3. coming back down to the one you just swapped). Now even though it may seem that you are doing max n/2 traversal for 3 times; for asymptomatic analysis this is just equals to 'n'. Now you'll have to do this for log n exactly same way how you do for an array. Therefore time-complexity is O(n log n). On wikipedia time-complexity table for heaps URL https://en.wikipedia.org/wiki/Binary_heap#Summary_of_running_times

Confusion related to the time complexity of an algorithm

I was going through this algorithm https://codereview.stackexchange.com/questions/63921/print-all-nodes-from-root-to-leaves
In one of the comments it is mentioned that printing the paths from the root to leaf itself has average time complexity of O(nlogn). I am not quite sure how he came up with that. Any clarification will be much appreciated.
I think this is what they mean:
in the best case, the tree is perfectly balanced, and it contains N nodes, where log(N)+1 is the number of levels. The tree has N/2 leaves.
Every time we move to a lower level, we duplicate the currently accumulated path. If you assume copying an array of length k as an O(k) operation, then when we move from the second to last level to a leaf we do an O(log(N)) operation. As there are N/2 leaves, and for each we do an O(log(N)) operation, you get O(N*log(N)).
Instead of duplicating arrays, the function could pass recursively the same array, and the current level number, making sure that the path is printed only up to the level of the leaf.

How to find the time complexity? [duplicate]

Am I correct in my explanation when calculating the time complexity of the following algorithm?
A HashSet, moduleMarksheetFiles, is being used to add the files that contain the moduleName specified.
for (File file: marksheetFiles){
while(csvReader.readRecord()){
String moduleName = csvReader.get(ModuleName);
if (moduleName.equals(module)){
moduleMarksheetFiles.add(file);
}
}
}
Let m be the number of files
Let k be the average number of records per file.
As each file is added only once because HashSet does not allow for duplicates. HashSet.add() is O(1) on average and O(n) for worst case.
Searching for a record with the specified moduleName involves comparing every record in the file to the moduleName, will take O(n) steps.
Therefore, the average time complexity would be: O((m*k)^2).
Is this correct?
Also, how would you calculate the worst case?
Thanks.
PS. It is not homework, just analysing my system's algorithm to evaluate performance.
No, it's not squared, this is O(nk). (Technically, that means it's also O((nk)²), but we don't care.)
Your misconception is that it the worst-case performance of HashSet is what counts here. However, even though a hashtable may have worst-case O(n) insertion time (if it needs to rehash every element), its amortized insertion time is O(1) (assuming your hash function is well behaved; File.GetHashCode presumably is). In other words, if you insert multiple things, so many of them will be O(1) that the occasional O(n) insertion does not matter.
Therefore, we can treat insertions as constant-time operations, so performance is purely dictated by the number of iterations through the inner loop body, which is O(nk).

What does "O(1) access time" mean? [duplicate]

This question already has answers here:
What is a plain English explanation of "Big O" notation?
(43 answers)
Closed 8 months ago.
I have seen this term "O(1) access time" used to mean "quickly" but I don't understand what it means. The other term that I see with it in the same context is "O(n) access time". Could someone please explain in a simple way what these terms mean?
See Also
What is Big O notation? Do you use it?
Big-O for Eight Year Olds?
You're going to want to read up on Order of complexity.
http://en.wikipedia.org/wiki/Big_O_notation
In short, O(1) means that it takes a constant time, like 14 nanoseconds, or three minutes no matter the amount of data in the set.
O(n) means it takes an amount of time linear with the size of the set, so a set twice the size will take twice the time. You probably don't want to put a million objects into one of these.
In essence, It means that it takes the same amount of time to look up a value in your collection whether you have a small number of items in your collection or very very many (within the constraints of your hardware)
O(n) would mean that the time it takes to look up an item is proportional to the number of items in the collection.
Typical examples of these are arrays, which can be accessed directly, regardless of their size, and linked lists, which must be traversed in order from the beginning to access a given item.
The other operation usually discussed is insert. A collection can be O(1) for access but O(n) for insert. In fact an array has exactly this behavior, because to insert an item in the middle, You would have to move each item to the right by copying it into the following slot.
O(1) means the time to access something is independent of the number of items in the collection.
O(N) would mean the time to access an item is a proportional to the number (N) of items in the collection.
Every answer currently responding to this question tells you that the O(1) means constant time (whatever it happens to measuring; could be runtime, number of operations, etc.). This is not accurate.
To say that runtime is O(1) means that there is a constant c such that the runtime is bounded above by c, independent of the input. For example, returning the first element of an array of n integers is O(1):
int firstElement(int *a, int n) {
return a[0];
}
But this function is O(1) too:
int identity(int i) {
if(i == 0) {
sleep(60 * 60 * 24 * 365);
}
return i;
}
The runtime here is bounded above by 1 year, but most of the time the runtime is on the order of nanoseconds.
To say that runtime is O(n) means that there is a constant c such that the runtime is bounded above by c * n, where n measures the size of the input. For example, finding the number of occurrences of a particular integer in an unsorted array of n integers by the following algorithm is O(n):
int count(int *a, int n, int item) {
int c = 0;
for(int i = 0; i < n; i++) {
if(a[i] == item) c++;
}
return c;
}
This is because we have to iterate through the array inspecting each element one at a time.
O(1) does not necessarily mean "quickly". It means that the time it takes is constant, and not based on the size of the input to the function. Constant could be fast or slow. O(n) means that the time the function takes will change in direct proportion to the size of the input to the function, denoted by n. Again, it could be fast or slow, but it will get slower as the size of n increases.
Here is a simple analogy;
Imagine you are downloading movies online, with O(1), if it takes 5 minutes to download one movie, it will still take the same time to download 20 movies. So it doesn't matter how many movies you are downloading, they will take the same time(5 minutes) whether it's one or 20 movies. A normal example of this analogy is when you go to a movie library, whether you are taking one movie or 5, you will simply just pick them at once. Hence spending the same time.
However, with O(n), if it takes 5 minutes to download one movie, it will take about 50 minutes to download 10 movies. So time is not constant or is somehow proportional to the number of movies you are downloading.
It's called the Big O notation, and describes the search time for various algorithms.
O(1) means that the worst-case run time is constant. For most situations it means that you don't actually need to search the collection, you can find what you are searching for right away.
"Big O notation" is a way to express the speed of algorithms. n is the amount of data the algorithm is working with. O(1) means that, no matter how much data, it will execute in constant time. O(n) means that it is proportional to the amount of data.
Basically, O(1) means its computation time is constant, while O(n) means it will depend lineally on the size of input - i.e. looping through an array has O(n) - just looping -, because it depends on the number of items, while calculating the maximum between two ordinary numbers has O(1).
Wikipedia might help as well: http://en.wikipedia.org/wiki/Computational_complexity_theory
O(1) always execute in the same time regardless of dataset n.
An example of O(1) would be an ArrayList accessing its element with index.
O(n) also known as Linear Order, the performance will grow linearly and in direct proportion to the size of the input data.
An example of O(n) would be an ArrayList insertion and deletion at random position. As each subsequent insertion/deletion at random position will cause the elements in the ArrayList to shift left right of its internal array in order to maintain its linear structure, not to mention about the creation of a new arrays and the copying of elements from the old to new array which takes up expensive processing time hence, detriment the performance.
The easiest way to differentiate O(1) and O(n) is comparing array and list.
For array, if you have the right index value, you can access the data instantly.
(If you don't know the index and have to loop through the array, then it won't be O(1) anymore)
For list, you always need to loop through it whether you know the index or not.
It means that access time is constant. Whether you're accessing from 100 or 100,000 records, the retrieval time will be the same.
In contrast, O(n) access time would indicate that the retrieval time is directly proportional to the number of records you're accessing from.
It means that the access takes constant time i.e. does not depend on the size of the dataset. O(n) means that the access will depend on the size of the dataset linearly.
The O is also known as big-O.
Introduction to Algorithms: Second Edition by Cormen, Leiserson, Rivest & Stein says on page 44 that
Since any constant is a degree-0
polynomial, we can express any
constant function as Theta(n^0), or
Theta(1). This latter notation is a
minor abuse, however, because it is
not clear what variable is tending to
infinity. We shall often use the
notation Theta(1) to mean either a
constant or a constant function with
respect to some variable. ... We
denote by O(g(n))... the set of
functions f(n) such that there exist
positive constants c and n0 such that
0 <= f(n) <= c*g(n) for all n >= n0.
... Note that f(n) = Theta(g(n))
implies f(n) = O(g(n)), since Theta
notation is stronger than O notation.
If an algorithm runs in O(1) time, it means that asymptotically doesn't depend upon any variable, meaning that there exists at least one positive constant that when multiplied by one is greater than the asymptotic complexity (~runtime) of the function for values of n above a certain amount. Technically, it's O(n^0) time.
O(1) means Random Access. In any Random Access Memory, the time taken to access any element at any location is the same. Here time can be any integer, but the only thing to remember is time taken to retrieve the element at (n-1)th or nth location will be same(ie constant).
Whereas O(n) is dependent on the size of n.
According to my perspective,
O(1) means time to execute one operation or instruction at a time is one, in time complexity analysis of algorithm for best case.

Resources