Time complexity to get average of large data set using subsets - algorithm

Say you're given a large set of numbers (size n) and asked to compute the average of the data. You only have enough space and memory for c numbers at one time. What is the run-time complexity of this data?

To compute an average for the whole dataset, the complexity would be O(n). Consider the following algorithm:
set sum = 0;
for(i = 0; i < n; i++){ // Loop n times
add value of n to sum;
}
set average = sum / n;
Since we can disregard the two constant time operations, the main operation (adding value to sum) occurs n times.
In this particular example, you only have data for 'c' numbers at one time. For each individual group, you'll need a time complexity of O(c). However, this will not change your overall complexity, because ultimately you will be making n passes.
To provide a concrete example, consider the case n = 100 and c = 40, and your values are passed in an array. Your first loop would have 40 passes, the second another 40, and the third only twenty. Regardless, you have made 100 passes through the loop.
This assumes also that it is a constant time operation to get the second set of numbers.

It is O(n).
A basic (though not particularly stable) algorithm computes it iteratively as follows:
mean = 0
for n = 0,1,2,.. length(arr)-1
mean = (mean*n + arr[n])/(n+1)
A variant of this algorithm can be used to parse the data from the array in sets of c numbers, but it is still linear in n.
Ie, spelling out the seralization:
To spell it out, you can do this:
mean = 0
for m = 0, c, 2c, ..., arr_length -1
sub_arr = request_sub_arr_between(m,min(m+c-1, total_length(arr)-1))
for i = 0, 1, ..., length(sub_arr)
n = m + i
mean = (mean*n + sub_arr[i])/(n+1)
This is still O(n), as we are only doing a bounded number of things for each n. In fact, the algorithm given at the top of this answer is a variant of this with c=1. If sub_arr is not kept in local memory, but sub_arr[n] is read at each step, then we are only storing 3 numbers at any step.

Related

Fixing this faulty Bingo Sort implementation

While studying Selection Sort, I came across a variation known as Bingo Sort. According to this dictionary entry here, Bingo Sort is:
A variant of selection sort that orders items by first finding the least value, then repeatedly moving all items with that value to their final location and find the least value for the next pass.
Based on the definition above, I came up with the following implementation in Python:
def bingo_sort(array, ascending=True):
from operator import lt, gt
def comp(x, y, func):
return func(x, y)
i = 0
while i < len(array):
min_value = array[i]
j = i + 1
for k in range(i + 1, len(array), 1):
if comp(array[k], min_value, (lt if ascending else gt)):
min_value = array[k]
array[i], array[k] = array[k], array[i]
elif array[k] == min_value:
array[j], array[k] = array[k], array[j]
j += 1
i = j
return array
I know that this implementation is problematic. When I run the algorithm on an extremely small array, I get a correctly sorted array. However, running the algorithm with a larger array results in an array that is mostly sorted with incorrect placements here and there. To replicate the issue in Python, the algorithm can be ran on the following input:
test_data = [[randint(0, 101) for i in range(0, 101)],
[uniform(0, 101) for i in range(0, 101)],
["a", "aa", "aaaaaa", "aa", "aaa"],
[5, 5.6],
[3, 2, 4, 1, 5, 6, 7, 8, 9]]
for dataset in test_data:
print(dataset)
print(bingo_sort(dataset, ascending=True, mutation=True))
print("\n")
I cannot for the life of me realize where the fault is at since I've been looking at this algorithm too long and I am not really proficient at these things. I could not find an implementation of Bingo Sort online except an undergraduate graduation project written in 2020. Any help that can point me in the right direction would be greatly appreciated.
I think your main problem is that you're trying to set min_value in your first conditional statement and then to swap based on that same min_value you've just set in your second conditional statement. These processes are supposed to be staggered: the way bingo sort should work is you find the min_value in one iteration, and in the next iteration you swap all instances of that min_value to the front while also finding the next min_value for the following iteration. In this way, min_value should only get changed at the end of every iteration, not during it. When you change the value you're swapping to the front over the course of a given iteration, you can end up unintentionally shuffling things a bit.
I have an implementation of this below if you want to refer to something, with a few notes: since you're allowing a custom comparator, I renamed min_value to swap_value as we're not always grabbing the min, and I modified how the comparator is defined/passed into the function to make the algorithm more flexible. Also, you don't really need three indexes (I think there were even a couple bugs here), so I collapsed i and j into swap_idx, and renamed k to cur_idx. Finally, because of how swapping a given swap_val and finding the next_swap_val is to be staggered, you need to find the initial swap_val up front. I'm using a reduce statement for that, but you could just use another loop over the whole array there; they're equivalent. Here's the code:
from operator import lt, gt
from functools import reduce
def bingo_sort(array, comp=lt):
if len(array) <= 1:
return array
# get the initial swap value as determined by comp
swap_val = reduce(lambda val, cur: cur if comp(cur, val) else val, array)
swap_idx = 0 # set the inital swap_idx to 0
while swap_idx < len(array):
cur_idx = swap_idx
next_swap_val = array[cur_idx]
while cur_idx < len(array):
if comp(array[cur_idx], next_swap_val): # find next swap value
next_swap_val = array[cur_idx]
if array[cur_idx] == swap_val: # swap swap_vals to front of the array
array[swap_idx], array[cur_idx] = array[cur_idx], array[swap_idx]
swap_idx += 1
cur_idx += 1
swap_val = next_swap_val
return array
In general, the complexity of this algorithm depends on how many duplicate values get processed, and when they get processed. This is because every time k duplicate values get processed during a given iteration, the length of the inner loop is decreased by k for all subsequent iterations. Performance is therefore optimized when large clusters of duplicate values are processed early on (as when the smallest values of the array contain many duplicates). From this, there are basically two ways you could analyze the complexity of the algorithm: You could analyze it in terms of where the duplicate values tend to appear in the final sorted array (Type 1), or you could assume the clusters of duplicate values are randomly distributed through the sorted array and analyze complexity in terms of the average size of duplicate clusters (that is, in terms of the magnitude of m relative to n: Type 2).
The definition you linked uses the first type of analysis (based on where duplicates tend to appear) to derive best = Theta(n+m^2), average = Theta(nm), worst = Theta(nm). The second type of analysis produces best = Theta(n), average = Theta(nm), worst = Theta(n^2) as you vary m from Theta(1) to Theta(m) to Theta(n).
In the best Type 1 case, all duplicates will be among the smallest elements of the array, such that the run-time of the inner loop quickly decreases to O(m), and the final iterations of the algorithm proceed as an O(m^2) selection sort. However, there is still the up-front O(n) pass to select the initial swap value, so the overall complexity is O(n + m^2).
In the worst Type 1 case, all duplicates will be among the largest elements of the array. The length of the inner loop isn't substantially shortened until the last iterations of the algorithm, such that we achieve a run-time looking something like n + n-1 + n-2 .... + n-m. This is a sum of m O(n) values, giving us O(nm) total run-time.
In the average Type 1 case (and for all Type 2 cases), we don't assume that the clusters of duplicate values are biased towards the front or back of the sorted array. We take it that the m clusters of duplicate values are randomly distributed through the array in terms of their position and their size. Under this analysis, we expect that after the initial O(n) pass to find the first swap value, each of the m iterations of the outer loop reduce the length of the inner loop by approximately n/m. This leads to an expression of the overall run-time for unknown m and randomly distributed data as:
We can use this expression for the average case run-time with randomly distributed data and unknown m, Theta(nm), as the average Type 2 run-time, and it also directly gives us the best and worst case run-times based on how we might vary the magnitude of n.
In the best Type 2 case, m might just be some constant value independent of n. if we have m=Theta(1) randomly distributed duplicate clusters, the best case run time is then Theta(n*Theta(1))) = Theta(n). For example as you would see O(2n) = O(n) performance from bingo-sort with just one unique value (one pass to find the find value, one pass to swap every single value to the front), and this O(n) asymptotic complexity still holds if m is bounded by any constant.
However in the worst Type 2 case we could have m=Theta(n), and bingo sort essentially devolves into O(n^2) selection sort. This is clearly the case for m = n, but if the amount the inner-loop's run-time is expected to decrease by with each iteration, n/m, is any constant value, which is the case for any m value in Theta(n), we still see O(n^2) complexity.

Is it linear Time Complexity of Bigger?

Can any one tell me what is the worst time complexity of below code?
Is it linear or bigger?
void fun(int[] nums){
{
int min = min(nums);
int max = max(nums);
for(int i= min; i<=max;i++){
print(i); //constant complexity for print
}
}
int min(int[] nums);//return min in nums in linear time
int max(int[] nums);//return max in nums in linear time
where
0 <= nums.length <= 10^4 and -10^9 <= nums[i] <= 10^9
Can I say that time complexity of this code is O(Max(nums[i]) - Min(nums[i])) and can I say, this is linear time complexity?
As the complexity is linear with respect to the range R = max - min of the data, I would call it a pseudo-linear complexity. O(N + R).
This is detailed in this Wikipedia entry: Pseudo-polynomial time
As mentioned in the introduction of this article:
In computational complexity theory, a numeric algorithm runs in pseudo-polynomial time if its running time is a polynomial in the numeric value of the input (the largest integer present in the input)—but not necessarily in the length of the input (the number of bits required to represent it), which is the case for polynomial time algorithms.
Generally, when analysing the complexity of a given algorithm, we don't make any specific assumption about the inherent range limitation of a particular targeted language, except of course if this is especially mentionned in the problem.
If the range of numbers is constant (ie -10^9 <= nums[i] <= 10^9) then
for(int i= min; i<=max;i++){
print(i); //constant complexity for print
}
is in O(1), ie constant because you know, it iterates at most 2 * 10^9 numbers, regardless of how many numbers there are in the nums[] array. Thus it does not depend on the size of the input array.
Consider the following input arrays
nums = [-10^9, 10^9]; //size 2
nums = [-10^9, -10^9 + 1, -10^9 + 2, ..., 10^9 - 2, 10^9 - 1, 10^9] //size 2 * 10^9 + 1
for both min and max will have the same values -10^9 and 10^9 respectively. Thus your loop will iterate all numbers from -10^9 to 10^9. And even if there were 10^100000 numbers in the orginal array, the for loop will at most iterate from -10^9 to 10^9.
And you say min() and max() are in O(n), thus your overall algorithm would also be in O(n). But if you then take into account that the given maximum length (10^4) of the array is by magnitudes smaller then the limit of your numbers, you can even neglect calling min and max
And as for your comment
For ex. array =[1,200,2,6,4,100]. In this case we can find min and max in linear time(O(n) where n is length of array). Now, my for loop complexity is O(200) or O(n^3) which is much more than length of array. Can I still say its linear complexity
The size of the array and the values in the array are completely independent of each other. Thus you cannot express the complexity of the for loop in terms of n (as explained above). If you really want to take into account also the range of the numbers, you have to express it somehow like this O(n + r) where n is the size of the array, and r is the range of the numbers.

Calculating Time Complexity of an Algorithm

I am learning about calculating the time complexity of an algorithm, and there are two examples that I can't get my head around why their time complexity is different than I calculated.
After doing the reading I learned that the for-loop with counter increasing once each iteration has the time complexity of O(n) and the nested for-loop with different iteration conditions is O(n*m).
This is the first question where I provided the time complexity to be O(n) but the solution says it was O(1):
function PrintColours():
colours = { "Red", "Green", "Blue", "Grey" }
foreach colour in colours:
print(colour)
This is the second one where I provided the time complexity to be O(n^2) but the solution says its O(n):
function CalculateAverageFromTable(values, total_rows, total_columns):
sum = 0
n = 0
for y from 0 to total_rows:
for x from 0 to total_columns:
sum += values[y][x]
n += 1
return sum / n
What am I getting wrong with these two questions?
There are several ways for denoting the runtime of an algorithm. One of most used notation is the Big - O notation.
Link to Wikipedia: https://en.wikipedia.org/wiki/Big_O_notation
big O notation is used to classify algorithms according to how their
run time or space requirements grow as the input size grows.
Now, while the mathematical definition of the notation might be daunting, you can think of it as a polynomial function of input size where you strip away all the constants and lower degree polynomials.
For ex: ax^2 + bx + c in Big-O would be O(x^2) (we stripped away all the constants a,b and c and lower degree polynomial bx)
Now, let's consider your examples. But before doing so, let's assume each operation takes a constant time c.
First example:
Input is: colours = { "Red", "Green", "Blue", "Grey" } and you are looping through these elements in your for loop. As the input size is four, the runtime would be 4 * c. It's constant runtime and constant runtime is written as O(1) in Big-O
Second example:
The inner for loop runs for total_columns times and it has two operations
for x from 0 to total_columns:
sum += values[y][x]
n += 1
So, it'll take 2c * total_columns times. And, the outer for loop runs for total_rows times, resulting in total time of total_rows * (2c * total_columns) = 2c * total_rows * total_columns. In Big-O it'd be written as O(total_rows * total_columns) (we stripped away the constant)
When you get out of outer loop, n which was set to 0 initially, would become total_rows * total_columns and that's why they mentioned the answer to be O(n).
One good definition of time complexity is:
"It is the number of operations an algorithm performs to complete its
task with respect to the input size".
If we think the following question input size can be defined as X= total_rows*total_columns. Then, what is the number of operations? It is X again because there will be X addition because of the operation sum += values[y][x] (neglect increment operation for n += 1 for simplicity). Then, think that we double array size from X to 2*X. How many operations there will be? It is 2*X again. As you can see, increase in number of operations is linear when we increase input size which makes time complexity O(N).
function CalculateAverageFromTable(values, total_rows, total_columns):
sum = 0
n = 0
for y from 0 to total_rows:
for x from 0 to total_columns:
sum += values[y][x]
n += 1
return sum / n
For your first question, the reason is that colours is a set. In python, {} defines a set. Accessing elements from unordered set is O(1) time complexity regardless of the input size. For furher information you can check here.

Time complexity on multiple variables & functions

I have written an algorithm to read in a text file and extract the contents inside into two array, then sort. The program is working but I am confuse at calculating the time complexity. Just needed someone to clarify on this.
Say I have two functions, a main and a helper.
Helper function
insertion(int array[], int length)
...
Main function
int main()
while(...) // this while loop read the input text file and push integer into vector
...
while(...)
...
if(...)
for(...) // this for loop validates array B only
insertion(arrayA, lengthA)
insertion(arrayB, lengthB)
Program read in text file
Push line 1 to array A, push line 2 to array B
'for loop' to validate array B array integers with an outer 'if'
Perform insertion sort on array A and array B
From what I learnt, I have to let number of data be 'n' before calculating the Big-O or number of operations. Now, obviously there are two data points here - one for array A and one for array B.
So, array A = n and array B = m.
However, I am unsure whether the number of data in the helper function should be using 'n' or 'm'. Likewise for the nested while loop, if the number of data should also be using 'n' or 'm'.
I tried my best to explain my difficulty in understanding this time complexity along with a simplified form of my program (the actual program has tons of loops...). Hopefully someone can understand what I mean and provide some clarification or else I will modify further to see if I can make it clearer. Thanks!
Edit: I am required to calculate the number of operations before finding the Big-O for my algorithm.
I understand that after you read the file, will have array A and B.
If m and n is close, then you can say that m = n. Otherwise, you choose the biggest one and say it is n.
Then you read n two time, n + n = 2, but in big O, you can take out the constant, then at this point you have O(n) time.
If validate only pass one time through your array B, then you say 3n of complexity time, but 3 still a constant, then time complexity still O(n).
But, the worse case insertion sort can do is O(n^2). You do it two time n^2 + n^ 2 = 2*n^2, two is a constant, so time of insertion sort peace takes O(n^2).
Finally, you have O(n) + O(n^2). Since it's big notation, the most cost part is the really significant part: O(n^2) is your complexity.
For example, if you use insertion sort n times, then you'd have O(n(n^2)) time, which is O(n^3).
The computer do 10^9 operation per second. So small n doesn't count so much.
If you not sure if n and m is close, let's says that 0 < n < 10^9 and 0 < m < 10^3. You'd say that time complexity of inputs is O(n+m). Then insertion sort O(n^2) + O(m^2). But still here, m << n (m is much less than n), you can equally not consider m (I'm saying m here is almost optional IF YOU'RE not being strict!). IF you need be strict, do not ignore at first this small cases.
If 0 < n < 10^9 and 0 < m < 10^9, then you should't say m = n, or ignore anyone. Because n can be one, and m one million.

Sample number with equal probability which is not part of a set

I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)

Resources