Big O Notation - Including Data Structure Costs? - algorithm

For the purpose of my question, I'll include a sample problem.
Say we need to iterate through a vector of N Elements and remove duplicates. So, we'd probably use a set right? (Let's use a C++ Set that's a tree)
O(N) cost to iterate through each element - then insert into the Set Data Structure.
My question Has a log n cost with the Set structure, and we insert N times, is this algorithm O(N log N) or simply O(N)? I was discussing this with a professor, and I'm not sure. The Leetcode/SO/online community seems to disregard data structure costs, but from an academic point of view, N inserts into a red/black tree with log N worst case - This is Log N, N times no?
For clarification - Yes It'd make more sense to use unordered_set, but that doesn't make my question valid.

Complexities express the count of some reference operation.
For example, you can very well count the inserts in some black-box structure and enumerate O(N) inserts.
But if you focus on, say, comparisons and you know that an insert costs Log N comparisons on average, the total number of comparisons is O(N Log N).
Now if you are comparing strings of Log N characters, you will count O(N Log²N) character comparisons...

Yes, it is O(n * log(n)). If you have a method like
public void foo(int n) {
for (int i = 0; i < n; i++) {
// Call a method that is in O(log n)
someLogNMethod();
}
}
then the method foo runs in O(n * log n) time.
Example
There are many non-constructed examples. Like computing the median-value in an array of integer. Take a look at the following solution to this problem which solves it by sorting the array first. Sorting is in Theta(n log n) (see comparison based sorting).
public int median(int[] values) {
int[] sortedValues = sort(values);
// Let's ignore special cases (even, empty, ...) for simplicity
int indexOfMedian = values.length / 2;
return sortedValues[indexOfMedian];
}
Obviously you wouldn't call this median method to be in Theta(1) though all it does runs in constant time (excluding the sort method).
However, the problem depends on the sort method. You can't solve the problem of finding the median of general arrays in O(1). You need to include the sort in your analysis. The method thus actually runs in Theta(n log n + 1) which is Theta(n log n).
Note that the problem can actually be solved in Theta(n) (see Find median of unsorted array in O(n) time).

Related

Search a Sorted Array for First Occurrence of K

I'm trying to solve question 11.1 in Elements of Programming Interviews (EPI) in Java: Search a Sorted Array for First Occurrence of K.
The problem description from the book:
Write a method that takes a sorted array and a key and returns the index of the first occurrence of that key in the array.
The solution they provide in the book is a modified binary search algorithm that runs in O(logn) time. I wrote my own algorithm also based on a modified binary search algorithm with a slight difference - it uses recursion. The problem is I don't know how to determine the time complexity of my algorithm - my best guess is that it will run in O(logn) time because each time the function is called it reduces the size of the candidate values by half. I've tested my algorithm against the 314 EPI test cases that are provided by the EPI Judge so I know it works, I just don't know the time complexity - here is the code:
public static int searchFirstOfKUtility(List<Integer> A, int k, int Lower, int Upper, Integer Index)
{
while(Lower<=Upper){
int M = Lower + (Upper-Lower)/2;
if(A.get(M)<k)
Lower = M+1;
else if(A.get(M) == k){
Index = M;
if(Lower!=Upper)
Index = searchFirstOfKUtility(A, k, Lower, M-1, Index);
return Index;
}
else
Upper=M-1;
}
return Index;
}
Here is the code that the tests cases call to exercise my function:
public static int searchFirstOfK(List<Integer> A, int k) {
Integer foundKey = -1;
return searchFirstOfKUtility(A, k, 0, A.size()-1, foundKey);
}
So, can anyone tell me what the time complexity of my algorithm would be?
Assuming that passing arguments is O(1) instead of O(n), performance is O(log(n)).
The usual theoretical approach for analyzing recursion is calling the Master Theorem. It is to say that if the performance of a recursive algorithm follows a relation:
T(n) = a T(n/b) + f(n)
then there are 3 cases. In plain English they correspond to:
Performance is dominated by all the calls at the bottom of the recursion, so is proportional to how many of those there are.
Performance is equal between each level of recursion, and so is proportional to how many levels of recursion there are, times the cost of any layer of recursion.
Performance is dominated by the work done in the very first call, and so is proportional to f(n).
You are in case 2. Each recursive call costs the same, and so performance is dominated by the fact that there are O(log(n)) levels of recursion times the cost of each level. Assuming that passing a fixed number of arguments is O(1), that will indeed be O(log(n)).
Note that this assumption is true for Java because you don't make a complete copy of the array before passing it. But it is important to be aware that it is not true in all languages. For example I recently did a bunch of work in PL/pgSQL, and there arrays are passed by value. Meaning that your algorithm would have been O(n log(n)).

Time Complexity when processing output

I'm struggling to figure out what the time complexity for this code would be.
def under_ten(input_list : List[int]) -> List[int]:
res = []
for i in input_list:
if i < 10:
res.append(i)
res.sort()
return res
Since the loop iterates over every element of n, I think the best case should be O(n). What I'm not sure about is how sorting the result list affects the time complexity of the entire function. Is the worst case O(nlogn) (all numbers in n are under 10, so the result list is the same size as the input list)? And what would be the average case?
EDIT: Changed input name from n to input_list and added type hints, sorry if that caused some confusion (added type hints as well).
Your first observation is correct that iterating the input collection would be an O(N) operation, where N here is the length of the array called n. The running time of the sort operation at the end would depend on how large the res array is. In the worst case scenario, every number in n would be less than 10, and therefore would end up in res. The internal algorithm Python would be using for sort() would likely be either quicksort or mergesort (q.v. this SO question). Both of these algorithms use a divide-and-conquer approach which run in O(N*lgN). So, in the worst case, your under_ten() function would run in O(N*lgN).
Let N be the length of the list and K the number of elements smaller than 10.
The complexity is O(N + K log K), assuming that append is done in amortized constant time.
In the worst case, K=N, hence O(N Log N), provided the sort truly has a worst case O(N Log N). Otherwise, it could be O(N²).

Efficient algorithm to determine if two sets of numbers are disjoint

Practicing for software developer interviews and got stuck on an algorithm question.
Given two sets of unsorted integers with array of length m and other of
length n and where m < n find an efficient algorithm to determine if
the sets are disjoint. I've found solutions in O(nm) time, but haven't
found any that are more efficient than this, such as in O(n log m) time.
Using a datastructure that has O(1) lookup/insertion you can easily insert all elements of first set.
Then foreach element in second set, if it exists not disjoint, otherwise it is disjoint
Pseudocode
function isDisjoint(list1, list2)
HashMap = new HashMap();
foreach( x in list1)
HashMap.put(x, true);
foreach(y in list2)
if(HashMap.hasKey(y))
return false;
return true;
This will give you an O(n + m) solution
Fairly obvious approach - sort the array of length m - O(m log m).
For every element in the array of length n, use binary search to check if it exists in the array of length m - O(log m) per element = O(n log m). Since m<n, this adds up to O(n log m).
Here's a link to a post that I think answers your question.
3) Sort smaller O((m + n)logm)
Say, m < n, sort A
Binary search for each element of B into A
Disadvantage: Modifies the input
Looks like Cheruvian beat me to it, but you can use a hash table to get O(n+m) in average case:
*Insert all elements of m into the table, taking (probably) constant time for each, assuming there aren't a lot with the same hash. This step is O(m)
*For each element of n, check to see if it is in the table. If it is, return false. Otherwise, move on to the next. This takes O(n).
*If none are in the table, return true.
As I said before, this works because a hash table gives constant lookup time in average case. In the rare event that many unique elements in m have the same hash, it will take slightly longer. However, most people don't need to care about hypothetical worst cases. For example, quick sort is used more than merge sort because it gives better average performance, despite the O(n^2) upper bound.

What is the n in big-O notation?

The question is rather simple, but I just can't find a good enough answer. On the most upvoted SO question regarding the big-O notation, it says that:
For example, sorting algorithms are typically compared based on comparison operations (comparing two nodes to determine their relative ordering).
Now let's consider the simple bubble sort algorithm:
for (int i = arr.length - 1; i > 0; i--) {
for (int j = 0; j < i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...)
}
}
}
I know that worst case is O(n²) and best case is O(n), but what is n exactly? If we attempt to sort an already sorted algorithm (best case), we would end up doing nothing, so why is it still O(n)? We are looping through 2 for-loops still, so if anything it should be O(n²). n can't be the number of comparison operations, because we still compare all the elements, right?
When analyzing the Big-O performance of sorting algorithms, n typically represents the number of elements that you're sorting.
So, for example, if you're sorting n items with Bubble Sort, the runtime performance in the worst case will be on the order of O(n2) operations. This is why Bubble Sort is considered to be an extremely poor sorting algorithm, because it doesn't scale well with increasing numbers of elements to sort. As the number of elements to sort increases linearly, the worst case runtime increases quadratically.
Here is an example graph demonstrating how various algorithms scale in terms of worst-case runtime as the problem size N increases. The dark-blue line represents an algorithm that scales linearly, while the magenta/purple line represents a quadratic algorithm.
Notice that for sufficiently large N, the quadratic algorithm eventually takes longer than the linear algorithm to solve the problem.
Graph taken from http://science.slc.edu/~jmarshall/courses/2002/spring/cs50/BigO/.
See Also
The formal definition of Big-O.
I think two things are getting confused here, n and the function of n that is being bounded by the Big-O analysis.
By convention, for any algorithm complexity analysis, n is the size of the input if nothing different is specified. For any given algorithm, there are several interesting functions of the input size for which one might calculate asymptotic bounds such as Big-O.
The commonest such function for a sorting algorithm is the worst case number of comparisons. If someone says a sorting algorithm is O(n^2), without specifying anything else, I would assume they mean the worst case comparison count is O(n^2), where n is the input size.
Another interesting function is the amount of work space, of space in addition to the array being sorted. Bubble sort's work space is O(1), constant space, because it only uses a few variables regardless of the array size.
Bubble sort can be coded to do only n-1 array element comparisons in the best case, by finishing after any pass that does no exchanges. See this pseudo code implementation, which uses swapped to remember whether there were any exchanges. If the array is already sorted the first pass does no exchanges, so the sort finishes after one pass.
n is usually the size of the input. For array, that would be the number of elements.
To see the different cases, you would need to change the algorithm:
for (int i = arr.length - 1; i > 0 ; i--) {
boolean swapped = false;
for (int j = 0; j<i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...);
swapped = true;
}
}
if(!swapped) {
break;
}
}
Your algorithm's best/worst cases are both O(n^2), but with the possibility of returning early, the best-case is now O(n).
n is array length. You want to find T(n) algorithm complexity.
It is much expensive to access memory then check condition if. So, you define T(n) to be number of access memory.
In the given algorithm BC and WC use O(n^2) accesses to memory because you check the if-condition O(n^2) times.
Make the complexity better: Hold a flag and if you don't do any swaps in the main-loop, it means your array is sorted and you can put a break.
Now, in BC the array is sorted and you access all elements once so O(n).
And in WC still O(n^2).

Find pairs with given difference

Given n, k and n number of integers. How would you find the pairs of integers for which their difference is k?
There is a n*log n solution, but I cannot figure it out.
You can do it like this:
Sort the array
For each item data[i], determine its two target pairs, i.e. data[i]+k and data[i]-k
Run a binary search on the sorted array for these two targets; if found, add both data[i] and data[targetPos] to the output.
Sorting is done in O(n*log n). Each of the n search steps take 2 * log n time to look for the targets, for the overall time of O(n*log n)
For this problem exists the linear solution! Just ask yourself one question. If you have a what number should be in the array? Of course a+k or a-k (A special case: k = 0, required an alternative solution). So, what now?
You are creating a hash-set (for example unordered_set in C++11) with all values from the array. O(1) - Average complexity for each element, so it's O(n).
You are iterating through the array, and check for each element Is present in the array (x+k) or (x-k)?. You check it for each element, in set in O(1), You check each element once, so it's linear (O(n)).
If you found x with pair (x+k / x-k), it is what you are looking for.
So it's linear (O(n)). If you really want O(n lg n) you should use a set on tree, with checking is_exist in (lg n), then you have O(n lg n) algorithm.
Apposition: No need to check x+k and x-k, just x+k is sufficient. Cause if a and b are good pair then:
if a < b then
a + k == b
else
b + k == a
Improvement: If you know a range, you can guarantee linear complexity, by using bool table (set_tab[i] == true, when i is in table.).
Solution similar to one above:
Sort the array
set variables i = 0; j = 1;
check the difference between array[i] and array[j]
if the difference is too small, increase j
if the difference is too big, increase i
if the difference is the one you're looking for, add it to results and increase j
repeat 3 and 4 until the end of array
Sorting is O(n*lg n), the next step is, if I'm correct, O(n) (at most 2*n comparisons), so the whole algorithm is O(n*lg n)

Resources