Better algorithm with better Big O [duplicate] - algorithm

This question already has answers here:
Limit input data to achieve a better Big O complexity
(3 answers)
Closed 8 years ago.
You are given an unsorted array of n integers, and you would like to find if there are any duplicates in the array (i.e. any integer appearing more than once).
The Algorithm is based on unsorted array of size n integers. Use of nested loop was implemented to find duplicates and the complexity is; O (N^2)
If we limit the input data in order to achieve some best case scenario, how can you limit the input data to achieve a better Big O complexity? Describe an algorithm for handling this limited data to find if there are any duplicates. What is the Big O complexity?
The questions asks for the following:
one way of how the data can be limited.
How this changes your algorithm for finding duplicates, and what is the better Big O complexity.
The answer I have come up with:
If we limit the data to, let’s say, array size of 5 (n = 5), we could reduce the complexity to O(N).
If the array is sorted, than all we need is a single loop to compare each element to the next element in the array and this will find if duplicates exist.
Which simply means that if an array given to us is by default (or luckily) already sorted (from lowest to highest value) in this case the reduction will be from O(N^2) to O(N) as we wouldn’t need the inner loop for comparing the integers for sorting since it is already sorted therefore we could implement a single loop to compare the integers to its successor and if a duplicate is encountered, then we could, for instance, use a printf statement to print the duplicates and proceed to iterate the loop n-1 times (which would be 4)- ending the program once that has been done.
The best case in this algorithm would be O(N) simply because the performance grows linearly and in direct proportion to the size of the input/ data so if we have a sorted array of size 50 (50 integers in the array) then the iteration would be n-1 (the loop will iterate 50 – 1 times) where n is the length of the array which is 50.
The running time in this algorithm increases in direct proportion to the input size. This simply means that in a sorted array, the amount of time the operations take to perform is completely dependent on the input size of the array.
Your confirmation (on whether this is correct or not) would be grateful. I know that there are other algorithms with better complexity class but since this is more efficient than O(N^2), it would be a possible answer since it's what the question asks for.

If you limit the size of the array to 5 (or 1000, or any other constant for that matter), then the complexity of your algorithm becomes O(1), so limiting the size of the array is a non-starter.
What you can do, however, is limit the values that go into the array. If you limit them to, say, 10000, or some other small number like that, you could make an O(N) algorithm like this:
Make an array of booleans called seen. The array needs to have the size of the max value that goes into your data array. Set all elements of the seen array to false. Now go through your array data, check if the boolean for the corresponding value is set, and if it is, declare a duplicate. Otherwise, set the seen flag to true. This algorithm has the complexity of O(N) in the worst case.
You could expand this algorithm to allow any range of values, as long as the value has a good hash function. Replace the array seen with a hash set, and use the same algorithm. Since the time complexity of adding and retrieving data in a hash set is constant, the asymptotic complexity of the algorithm would not change.
Finally, you can sort the array, and look for duplicates in O(N*logN). This algorithm has a slightly worse time complexity, but its space complexity is O(1) (the algorithms using hash set has space complexity of O(N), which may be significant).

Related

Complexity - input length

I'm currently learning complexity (or efficiency however you call it), and I read about it in a book I got.
There is written something which I find pretty senseless and I need an explanation. I've tried looking online but I didn't find an answer for this certain example that they're giving.
For an algorithm that gets the max number in a single-dimensional array the size of n the input length would be n.
"For an algorithm that gets the max number in a two-dimensional array the size of n*n the input length would still be n."
I don't understand why the input length would be 'n' in both cases even though for the two-dimensional you have to go through n*n numbers...
It says
input length = the amount of work done ...
doesn't make any sense to me.
Would anyone care to explain? They certainly don't explain this there.
It's a common misconception (much seen here on SO) that the complexity of a scan across a 2D array with n*n elements is O(n^2). It's not, it's O(n). A scan is a linear operation, one element after another.
The 2D array is a polite fiction, it is really just a convenience for accessing a 1D array. After all, in languages which implement arrays properly (i.e. none of this array of pointers to blocks of memory) a 2D array is just a set of adjacent memory locations. And even in languages which do implement 2D arrays as arrays of pointers they're just linear segments of memory with interruptions
If a scan across a 2D array were O(n^2) then you could magically transform it to O(n) by ignoring the 2d-ness and just scanning the underlying 1d block of memory.
O(n^2) describes a different complexity class of operations such as those in which each pair of elements in the input is operated upon.
Reading in the comments that this book is written in Hebrew I would assume that the issue is a translation error or some other error in proofreading. The definition given in the comments of input length "input length is the measurement that indicates the work load of an algorithm" doesn't match what you would assume the term means at all in English.
To answer the question about complexity, they are reusing the variable 'n' in multiple places which makes it slightly confusing. They use 'n' to describe the dimension of the array and to describe the complexity. O(n) simply means the complexity is linear to the input. O(n^2) would be an exponential complexity. In this case with an array of n*n elements the input is n*n or n^2, but the complexity of the algorithm is still O(n) (or linear). This is because the algorithm still only operates on each input element once, whether the input is n or n*n. It would still be linear if it operated one each element 2 or three times as 3n and n are both linear functions (any x*n would be linear).
I hope this helps.
Big-O notation is used to classify TYPES of algorithms (complexity classes), not necessarily how much time it will ACTUALLY take to run. For instance O(cn) is just O(n) where c is a constant.
n is the size of the input whether that input is an nxn matrix or just an 'n' length array. The big-O 'n' and the program variable name are not referring to the same thing.

How can the worst case for an algorithm have different bounds?

I've been trying to figure this out all day. Some other threads address this, but I really don't understand the answers. There are also many answers that contradict one another.
I understand that an algorithm will never take longer than the upper bound and never be faster than the lower bound. However, I didn't know an upper bound existed for best case time and a lower bound existed for worst case time. This question really threw me in a loop. I can't wrap my head around this... a given run time can have a different upper and lower bound?
For example, if someone asked: "Show that the worst-case running time of some algorithm on a heap of size n is Big Omega(lg(n))". How do you possibly get a lower bound, any bound for that matter, when given a run time?
So, in summation, an algorithm's worst case upper bound can be different than its worst case lower bound? How can this be? Once given the case, don't bounds become irrelevant? Trying to independent study algorithms and I really need to wrap my head around this first.
The meat of my accepted answer to that question is a function whose running time oscillates between n^2 and n^3 depending on whether n is odd. The point that I was trying to make is that sometimes bounds of the form O(n^k) and Omega(n^k) aren't sufficiently descriptive, even though the worst case running time is a perfectly well defined function (which, like all functions, is its own best lower and upper bound). This happens with more natural functions like n log n, which is Omega(n^k) but not O(n^k) for k ≤ 1, and O(n^k) but not Omega(n^k) for k > 1 (and hence not Theta(n^k) regardless of how we choose a constant k).
Suppose you write a program like this to find the smallest prime factor of an integer:
function lpf(n):
for i = 2 to n
if n%i == 0 then return i
If you run the function on the number 10^11 + 3, it will take 10^11 + 2 steps. If you run it on the number 10^11 + 4 it will take just one step. So the function's best-case time is O(1) steps and its worst-case time is O(n) steps.
Big O notation, describes efficiency in runtime iterations, generally based on size of an input data set.
The notation is written in its simplest form, ignoring multiples or additives, but keeping exponential form. If you have an operation of O(1) it is executed in constant time, no matter the input data.
However if you have something such as O(N) or O(log(N)), they will execute at different rates depending on input data.
The high and low bounds describe the largest and least iterations, respectively, that an algorithm can take.
Example: O(N), high bound is largest input data and low bound is smallest.
Extra sources:
Big O Cheat Sheet and MIT Lecture Notes
UPDATE:
Looking at the Stack Overflow question mentioned above, that algorithm is broken into three parts, where it has 3 possible types of runtime, depending on data. Now really, this is three different algorithms designed to handle for different data values. An algorithm is generally classified with just one notation of efficiency and that is of the notation taking the least time for ALL possible values of N.
In the case of O(N^2), larger data will take exponentially longer, and having a smaller number will proceed quickly. The algorithm determines how quickly a data set will be run, yet bounds are given depending on the range of data the algorithm is designed to handle.
I will try to explain it in the quicksort algorithm.
In quicksort you have an array and choose an element as pivot. The next step is to partition the input array into two arrays. The first one will contain elements < pivot and the second one elements > pivot.
Now assume you will apply quicksort on an already sorted list and the pivot element will always be the last element of the array. The result of partition will be an array of size n-1 and an array oft size 1 (the pivot element). This will result in a runtime of O(n*n). Now assume that the pivot element will always split the array in two equal sized array. In every step the array size will be cut in halves. This will result in O(n log n). I hope this example will make this a bit clearer for you.
Another well known sort algorithm is mergesort. Mergesort has always runtime of O(n log n). In mergesort you will cut the array down until only one element is left und will climb up the call stack to merge the one sized arrays and after that merge the array of size two and so on.
Let's say you implement a set using an array. To insert a element you simply put in the next available bucket. If there is no available bucket you increase the capacity of the array by a value m.
For the insert algorithm "there is no enough space" is the worse case.
insert (S, e)
if size(S) >= capacity(S)
reserve(S, size(S) + m)
put(S,e)
Assume we never delete elements. By keeping track of the last available position, put, size and capacity are Θ(1) in space and memory.
What about reserve? If it is implemented like [realloc in C][1], in the best case you just allocate new memory at the end of the existing memory (best case for reserve), or you have to move all existing elements as well (worse case for reserve).
The worst case lower bound for insert is the best case of
reserve(), which is linear in m if we dont nitpick. insert in
worst case is Ω(m) in space and time.
The worst case upper bound for insert is the worse case of
reserve(), which is linear in m+n. insert in worst case is
O(m+n) in space and time.

Why we can not apply counting sort to general arrays?

Counting sort is known with linear time if we know that all elements in the array are upper bounded by a given number. If we take a general array, cant we just scan the array in linear time, to find the maximum value in the array and then to apply counting sort?
It is not enough to know the upper bound to run a counting sort: you need to have enough memory to fit all the counters.
Consider a situation when you go through an array of 64-bit integers, and find out that the largest element is 2^60. This would mean two things:
You need an O(2^60) memory, and
It is going to take O(2^60) to complete the sort.
The fact that O(2^60) is the same as O(1) is of little help here, because the constant factor is simply too large. This is very often a problem with pseudo-polynomial time algorithms.
Suppose the largest number is like 235684121.
Then you'll spend incredible amounts of RAM to keep your buckets.
I would like to mention something with #dasblinkenlight and #AlbinSunnanbo answers, your idea to scan the array in O(n) pass, to find the maximum value in the array is okay. Below is given from Wikipedia:
However, if the value of k is not already known then it may be
computed by an additional loop over the data to determine the maximum
key value that actually occurs within the data.
As the time complexity is O(n + k) and k should be under a certain limit, your found k should be small. As #dasblinkenlight mentioned, O(large_value) can't practically be converged to O(1).
Though I don't know about any major applications of Counting sort so far except used as a subroutine of Radix Sort, it can be nicely used in problems like string sorting( i.e. sort "android" to "addnoir") as here k is only 255.

What is big-O notation? How do you come up with figures like O(n)? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Plain english explanation of Big O
I'd imagine this is probably something taught in classes, but as I a self-taught programmer, I've only seen it rarely.
I've gathered it is something to do with the time, and O(1) is the best, while stuff like O(n^n) is very bad, but could someone point me to a basic explanation of what it actually represents, and where these numbers come from?
Big O refers to the worst case run-time order. It is used to show how well an algorithm scales based on the size of the data set (n->number of items).
Since we are only concerned with the order, constant multipliers are ignored, and any terms which increase less quickly than the dominant term are also removed. Some examples:
A single operation or set of operations is O(1), since it takes some constant time (does not vary based on data set size).
A loop is O(n). Each element in the data set is looped over.
A nested loop is O(n^2). A nested nested loop is O(n^3), and onward.
Things like binary tree searching are log(n), which is more difficult to show, but at every level in the tree, the possible number of solutions is halved, so the number of levels is log(n) (provided the tree is balanced).
Something like finding the sum of a set of numbers that is closest to a given value is O(n!), since the sum of each subset needs to be calculated. This is very bad.
It's a way of expressing time complexity.
O(n) means for n elements in a list, it takes n computations to sort the list. Which isn't bad at all. Each increase in n increases time complexity linearly.
O(n^n) is bad, because the amount of computation required to perform a sort (or whatever you are doing) will exponentially increase as you increase n.
O(1) is the best, as it means 1 computation to perform a function, think of hash tables, looking up a value in a hash table has O(1) time complexity.
Big O notation as applied to an algorithm refers to how the run time of the algorithm depends on the amount of input data. For example, a sorting algorithm will take longer to sort a large data set than a small data set. If for the sorting algorithm example you graph the run time (vertical-axis) vs the number of values to sort (horizontal-axis), for numbers of values from zero to a large number, the nature of the line or curve that results will depend on the sorting algorithm used. Big O notation is a shorthand method for describing the line or curve.
In big O notation, the expression in the brackets is the function that is graphed. If a variable (say n) is included in the expression, this variable refers to the size of the input data set. You say O(1) is the best. This is true because the graph f(n) = 1 does not vary with n. An O(1) algorithm takes the same amount of time to complete regardless of the size of the input data set. By contrast, the run time of an algorithm of O(n^n) increases with the square of the size of the input data set.
That is the basic idea, for a detailed explanation, consult the wikipedia page titled 'Big O Notation'.

Number of different elements in an array

Is it possible to compute the number of different elements in an array in linear time and constant space? Let us say it's an array of long integers, and you can not allocate an array of length sizeof(long).
P.S. Not homework, just curious. I've got a book that sort of implies that it is possible.
This is the Element uniqueness problem, for which the lower bound is Ω( n log n ), for comparison-based models. The obvious hashing or bucket sorting solution all requires linear space too, so I'm not sure this is possible.
You can't use constant space. You can use O(number of different elements) space; that's what a HashSet does.
You can use any sorting algorithm and count the number of different adjacent elements in the array.
I do not think this can be done in linear time. One algorithm to solve in O(n log n) requires first sorting the array (then the comparisons become trivial).
If you are guaranteed that the numbers in the array are bounded above and below, by say a and b, then you could allocate an array of size b - a, and use it to keep track of which numbers have been seen.
i.e., you would move through your input array take each number, and mark a true in your target array at that spot. You would increment a counter of distinct numbers only when you encounter a number whose position in your storage array is false.
Assuming we can partially destroy the input, here's an algorithm for n words of O(log n) bits.
Find the element of order sqrt(n) via linear-time selection. Partition the array using this element as a pivot (O(n)). Using brute force, count the number of different elements in the partition of length sqrt(n). (This is O(sqrt(n)^2) = O(n).) Now use an in-place radix sort on the rest, where each "digit" is log(sqrt(n)) = log(n)/2 bits and we use the first partition to store the digit counts.
If you consider streaming algorithms only ( http://en.wikipedia.org/wiki/Streaming_algorithm ), then it's impossible to get an exact answer with o(n) bits of storage via a communication complexity lower bound ( http://en.wikipedia.org/wiki/Communication_complexity ), but possible to approximate the answer using randomness and little space (Alon, Matias, and Szegedy).
This can be done with a bucket approach when assuming that there are only a constant number of different values. Make a flag for each value (still constant space). Traverse the list and flag the occured values. If you happen to flag an already flagged value, you've found a duplicate. You have to traverse the buckets for each element in the list. But that's still linear time.

Resources