question about sorting - algorithm

Bubble sort is O(n) at best, O(n^2) at worst, and its memory usage is O(1) . Merge sort is always O(n log n), but its memory usage is O(n).
Which algorithm we would use to implement a function that takes an array of integers and returns the max integer in the collection, assuming that the length of the array is less than 1000. What if the array length is greater than 1000?

A sequential scan of a dataset, looking for a maximum, can be done in O(n) time with O(1) memory usage.
You just set the current maximum to the first element then run through all the other elements, setting the current maximum to the value if the value is greater. Pseudo-code:
max = list[first_index]
for index = first_index+1 to last_index:
if list[index] > max:
max = list[index]
The complexity doesn't change regardless of the number of elements in the list so it doesn't matter how many there are.
The running time will change however (since the algorithm is O(n) time) and, if it's important to find the maximum fast, there are a number of possibilities. These all hinge on doing work when the list changes, not every time you want the information, hence they're better for a list that is read more often than written, so the cost can be amortised.
Option 1 is to keep the list sorted so you can just grab the last element. This is probably overkill for just keeping a record of the maximum.
Option 2 is to recalculate the maximum (and number of elements holding it) when you insert into, or delete from, the list. Initially set max to 0 and maxcount to 0 for an empty list.
For an insert:
if maxcount is 0 (the list is empty), set max to this value and maxcount to 1.
otherwise, if this value is greater than max, set max to this value and maxcount to 1.
otherwise, if this value is equal to max, add 1 to maxcount.
For a deletion:
if this value is equal to max, subtract 1 from maxcount.
then, if maxcount is 0, rescan list to get new max and maxcount.
This way, at any time, you have the maximum value (the count is simply an extra "trick" to speed up the algorithm where there is more than one element holding the maximum value). I've used this once before in a data analysis application and it turned out to be much faster than re-sorting - I had to store both minimum and maximum in that case, but it's the same idea.

Maximum value is always O(n) unless it's presorted. If presorted, it's less. Sorting before search is always worse than O(n)... so, generally, 1,000 elements will take 1,000 comparisons... just compare literally. If working with sorted structures, it's inexpensive. If not, it's expensive. Insertion with sorted structures are more expensive. ... this is why it's such a problem issue.

Related

Counting Sort different approach

In a counting sort algorithm, we initialize an count array with a size of Maximum Value in a given array. Runtime of this method is O(n + Max value). However with an extra loop, we can look for minimum and maximum value of given array;
for 0 -> Length(given_array)
if given_array[i] > max
max = given_array[i]
if given_array[i] < min
min = given_array[i]
Then use that data to create the count array, lets say between 95-100. We could decrease the runtime in some cases tremendously. However, I haven't seen an approach like this. Would it be still a counting sort algorithm, or does it have another name that I don't know.
Counting sort is typically used when we know upfront that values will be restricted to a certain range.
This range doesn't need to start at zero; it's absolutely fine to use an array of length six whose elements represent the counts of values 95 through 100 (or, for that matter, the counts of values from −2 to 3). So, yes, your approach is still "counting sort".
But if you don't know this restriction upfront, you're not likely to get faster results by doing a complete pass over the data to check.
For example: suppose you have 1,000,000 elements, and you know they're all somewhere in the range 0–200, but you think they're probably all in a much narrower range. Well, the cost of prescanning the entire input array is going to be greater than the cost of working with a 201-element working array, which means it costs more than it can possibly save compared to just doing a counting sort with the range 0–200.
Runtime of this method is O(n + Max value).
The runtime is O(max(num_elements, range_size)), which — due to the magic of Landau (big-O) notation — is the same as O(num_elements + range_size). Your approach only affects the asymptotic complexity if max_value is asymptotically greater than both num_elements and range_size.

Most Efficient way to compute the 99th percentile of a data set

I have 100 Integers in my database.
I sort them in ascending order.
Right now for the 99th percentile I am taking the 99th number after sorting.
after a given time t, a new number come into the database and an older number gets discarded.
The current code just take the 100 integer and sort them all over again.
Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?
PS:All this is done under MySQL database
Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).
The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).
Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].
Time complexity: O(N-K) per request.
(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.
If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".
If your data are random distributed you could try guessing the position by assuming a linear distribution.
guessPosition = newnumber*(max-min)/100
And then make a gallop search from that point out.
And when found insert it at the correct position.
So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

Running maximum of changing array of fixed size

At first, I am given an array of fixed size, call it v. The typical size of v would be a few thousand entries. I start by computing the maximum of that array.
Following that, I am periodically given a new value for v[i] and need to recompute the value of the maximum.
What is a practically fast way (average time) of computing that maximum?
Edit: we can assume that the process is:
1) uniformly choosing a random entry;
2) changing its value to a uniform value between [0,1].
I believe this specifies the problem a bit better and allows an unequivocal "best answer" (which will depend on the array size).
You can maintain a max-heap of that array. The element can be index to the array. for every element of the array, you should also have some indexes to the element of max-heap. so every time v[i] is changed, you only need O(log(n)) to maintain the heap. (if v[i] is increased, it will go up in the heap, if v[i] is decreased, it will go down in the heap).
If the changes to the array are random, e.g. v[rand()%size] = rand(), then most of the time the max won't decrease.
There are two main ways I can think of to handle this: keep the full collection sorted on the fly, or track just the few (or one) highest elements. The choice depends on the relative importance of worst-case, average case, and fast-path. (Including code and data cache footprint of the common case where the change doesn't affect anything you're tracking.)
Really low complexity / overhead / code size: O(1) average case, O(N) worst case.
Just track the current max, (and optionally its position, if you can't get the old value to see if it == max before applying the change). On the rare occasion that the element holding the max decreased, rescan the whole array. Otherwise just see if the new element is greater than max.
The average complexity should be O(1) amortized: O(N) for N changes, since on average one of N changes affects the element holding the max. (And only half those changes decrease it).
A bit more overhead and code size, but less frequent scans of the full array: O(1) typical case, O(N) worst case.
Keep a priority queue of the 4 or 8 highest elements in the array (position and value). When an element in the PQueue is modified, remove it from the PQueue. Try to re-add the new value to the PQueue, but only if it won't be the smallest element. (It might be smaller than some other element we're not tracking). If the PQueue is empty, rescan the array to rebuild it to full size. The current max is the front of the PQueue. Rescanning the array should be quite rare, and in most cases we only have to touch about one cache line of data holding our PQueue.
Since the small PQueue needs to support fast access to the smallest and the largest element, and even finding elements that aren't the min or max, a sorted-array implementation probably makes the most sense, rather than a Heap. If it's only 8 elements, a linear search is probably best, too. (From the smallest element upwards, so the search ends right away if the old value of the element modified is less than the smallest value in the PQueue, the search stops right away.)
If you want to optimize the fast-path (position modified wasn't in the PQueue), you could store the PQueue as struct pqueue { unsigned pos[8]; int val[8]; }, and use vector instructions (e.g. x86 SSE/AVX2) to test i against all 8 positions in one or two tests. Hrm, actually just checking the old val to see if it's less than PQ.val[0] should be a good fast-path.
To track the current size of the PQueue, it's probably best to use a separate counter, rather than a sentinel value in pos[]. Checking for the sentinel every loop iteration is probably slower. (esp. since you'd prob. need to use pos to hold the sentinel values; maybe make it signed after all and use -1?) If there was a sentinel you could use in val[], that might be ok.
slower O(log N) average case, but no full-rescan worst case:
Xiaotian Pei's solution of making the whole array a heap. (This doesn't work if the ordering of v[] matters. You could keep all the elements in a Heap as well as in the ordered array, but that sounds cumbersome.) Re-heapifying after changing a random element will probably write several other cache lines every time, so the common case is much slower than for the methods that only track the top one or few elements.
something else clever I haven't thought of?

set with average O(1) for add/remove and worst max/min O(log n)

Can I have a set where average add/remove operation is O(1) (this is tipical for hashtable-based sets) and worst max/min is less then O(n), probably O(log n) (typical for tree-based sets)?
upd hmm it seems in the simplest case I can just rescan ALL N elements every time max/min disappear and in general it gives me O(1). But i apply my algorithm to stock trading where changes near min/max are much more likely so I just don't want to rescan everything every time max or min disappear, i need something smarter than full rescan which gives O(n).
upd2 In my case set contains 100-300 elements. Changes of max/min elements are very likely, so max/min changes often. And I need to track max/min. I still want O(1) for add/remove.
Here's an impossibility result with bad constants for worst-case, non-amortized bounds in a deterministic model where keys can be compared and hashed but nothing else. (Yes, that's a lot of stipulations. I second the recommendation of a van Emde Boas tree.)
As is usually the case with comparison lower bounds, this is an adversary argument. The adversary's game plan is to insert many keys while selectively deleting the ones about which the algorithm has the most information. Eventually, the algorithm will be unable to handle a call to max().
The adversary decides key comparisons as follows. Associated with each key is a binary string. Each key initially has an empty string. When two keys are compared, their strings are extended minimally so that neither is a prefix of the other, and the comparison is decided according to the dictionary order. For example, with keys x, y, z, we could have:
x < y: string(x) is now 0, string(y) is now 1
x < z: string(z) is now 1
y < z: string(y) is now 10, string(z) is now 11.
Let k be a worst-case upper bound on the number of key comparisons made by one operation. Each key comparison increases the total string length by at most two, so for every sequence of at most 3 * n insert/delete operations, the total string length is at most 6 * k * n. If we insert 2 * n distinct keys with interspersed deletions whenever there is a key whose string has length at least 6 * k, then we delete at most n keys on the way to a set with at least n keys where each key has a string shorter than 6 * k bits.
Extend each key's string arbitrarily to 6 * k bits. The (6 * k)-bit prefix of a key's string is the bucket to which the key belongs. Right now, the algorithm has no information about the relative order of keys within a bucket. There are 2 ** (6 * k) buckets, which we imagine laid out left to right in the increasing order dictated by the (6 * k)-bit prefixes. For n sufficiently large, there exists a bucket with a constant (depending on k) fraction of the keys and at least 2 * k times as many keys as the combined buckets to its right. Delete the latter keys, and max() requires a linear number of additional comparisons to sort out the big bucket that now holds the maximum, because at most a little more than half of the required work has been done by the deletions.
Well, you know that max/min < CONST, and the elements are all numbers. Based on this you can get O(1) insertion and O(k+n/k) find min/max 1.
Have an array of size k, each element in the array will be a hash set. At insertion, insert an element to array[floor((x-MIN)/MAX-MIN)*k)] (special case for x=MAX). Assuming uniform distribution of elements, that means each hash set has an expected number of n/k elements.
At deletion - remove from the relevant set similarly.
findMax() is now done as follows: find the largest index where the set is not empty - it takes O(k) worst case, and O(n/k) to find maximal element in the first non empty set.
Finding optimal k:
We need to minimize k+n/k.
d(n+n/k)/dk = 1-n/k^2 = 0
n = k^2
k = sqrt(n)
This gives us O(sqrt(n) + n/sqrt(n)) = O(sqrt(n)) find min/max on average, with O(1) insertion/deletion.
From time to time you might need to 'reset' the table due to extreme changes of max and min, but given a 'safe boundary' - I believe in most cases this won't be an issue.
Just make sure your MAX is something like 2*max, and MIN is 1/2*min every time you 'reset' the DS.
(1) Assuming all elements are coming from a known distribution. In my answer I assume a uniform distribution - so P(x)=P(y) for each x,y, but it is fairly easy to modify it to any known distribution.

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Resources