Most Efficient way to compute the 99th percentile of a data set - algorithm

I have 100 Integers in my database.
I sort them in ascending order.
Right now for the 99th percentile I am taking the 99th number after sorting.
after a given time t, a new number come into the database and an older number gets discarded.
The current code just take the 100 integer and sort them all over again.
Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?
PS:All this is done under MySQL database

Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).
The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).
Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].
Time complexity: O(N-K) per request.
(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.
If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".

If your data are random distributed you could try guessing the position by assuming a linear distribution.
guessPosition = newnumber*(max-min)/100
And then make a gallop search from that point out.
And when found insert it at the correct position.

So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

Related

Range search with KNN on two different dimensions

I've a few million records (which are updated often) with 2 properties:
Timestamp
Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.
If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
Perform binary search for a particular timestamp O(logn)
Extract the greater than values in O(1) since the array is sorted
Extract the matching popularity vote in O(1) since they are in a dictionary
Update a record popularity score in O(1) due to the dictionary
Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.

Giving a set of tuples (value,cost),Is there an algorithm to find the combination of tuple that have the least cost for storing given number

I have a set of (value,cost) tuples which is (2000000,200) , (500000,75) , (100000,20)
Suppose X is any positive number.
Is there an algorithm to find the combination of tuple that have the least cost for the sum of value that can store X.
The sum of tuple values can be equal or greater than the given X
ex.
giving x = 800000 the answer should be (500000,75) , (100000,20) , (100000,20) , (100000,20)
giving x = 900000 the answer should be (500000,75) , (500000,75)
giving x = 1500000 the answer should be (2000000,200)
I can hardcode this but the set and the tuple are subject to change so if this can be substitute with well-known algorithm it would be great.
This can be solved with dinamic programming, as you have no limit on number of tuples and can afford higher sums that provided number.
First, you can optimize tuples. If one big tuple can be replaced by number of smaller ones with equal or lower cost and equal or higher value, you can remove bigger tuple at all.
Also, it's fruitful for future use to order tuples in optimized set by value/cost in descending order. Tuple is better if value/cost is bigger.
Time complexity O(N*T), where N is number divided by common factor (F) of optimized tuple values, and T is number of tuples in optimized tuple set.
Memory complexity O(N).
Set up array a of size N that will contain:
in a[i].cost best cost for solution for i*F, 0 for special case "no solution yet"
in a[i].tuple the tuple that led to best solution
Recursion scheme:
function gets n as a single parameter - it's provided number/F for start, leftover of needed value/F sums for recusion calls
if array a for n is filled, return a[n].cost
otherwise set current_cost to MAXINT
for each tuple from best to worst try to add it to solution:
if value/F >= n, we've got some solution, compare tuple cost to current_cost and if it's better, update a[n].cost and a[n].tuple
if value/F < n, call recursively for n-value/F and compare cost with current solution, update current solution and a[n].cost, a[n].tuple if needed
after all, return a[n].cost or throw exception is no solution exists
Tuple list can be retrieved from a but traverse through .tuple on each step.
It's possible to reduce overall array size down to max(tuple.value/F), but you'll have to save more or less complete solution instead of one best .tuple for each element, and you'll have to make "sliding window" carefully.
It's possible to turn recursion into cycle from 0 to n, as with many other dynamic programming algorithms.

sum property of consecutive numbers

Suppose we have a list of numbers like [6,5,4,7,3]. How can we tell that the array contains consecutive numbers? One way is ofcourse to sort them or we can find the minimum and maximum. But can we determine based on the sum of the elements ? E.g. in the example above, it is 25. Could anyone help me with this?
The sum of elements by itself is not enough.
Instead you could check for:
All elements being unique.
and either:
Difference between min and max being right
or
Sum of all elements being right.
Approach 1
Sort the list and check the first element and last element.
In general this is O( n log(n) ), but if you have a limited data set you can sort in O( n ) time using counting sort or radix sort.
Approach 2
Pass over the data to get the highest and lowest elements.
As you pass through, add each element into a hash table and see if that element has now been added twice. This is more or less O( n ).
Approach 3
To save storage space (hash table), use an approximate approach.
Pass over the data to get the highest and lowest elements.
As you do, implement an algorithm which will with high (read User defined) probability determine that each element is distinct. Many such algorithms exist, and are in use in Data Mining. Here's a link to a paper describing different approaches.
The numbers in the array would be consecutive if the difference between the max and the minimum number of the array is equal to n-1 provided numbers are unique ( where n is the size of the array ). And ofcourse minimum and maximum number can be calculated in O(n).

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

question about sorting

Bubble sort is O(n) at best, O(n^2) at worst, and its memory usage is O(1) . Merge sort is always O(n log n), but its memory usage is O(n).
Which algorithm we would use to implement a function that takes an array of integers and returns the max integer in the collection, assuming that the length of the array is less than 1000. What if the array length is greater than 1000?
A sequential scan of a dataset, looking for a maximum, can be done in O(n) time with O(1) memory usage.
You just set the current maximum to the first element then run through all the other elements, setting the current maximum to the value if the value is greater. Pseudo-code:
max = list[first_index]
for index = first_index+1 to last_index:
if list[index] > max:
max = list[index]
The complexity doesn't change regardless of the number of elements in the list so it doesn't matter how many there are.
The running time will change however (since the algorithm is O(n) time) and, if it's important to find the maximum fast, there are a number of possibilities. These all hinge on doing work when the list changes, not every time you want the information, hence they're better for a list that is read more often than written, so the cost can be amortised.
Option 1 is to keep the list sorted so you can just grab the last element. This is probably overkill for just keeping a record of the maximum.
Option 2 is to recalculate the maximum (and number of elements holding it) when you insert into, or delete from, the list. Initially set max to 0 and maxcount to 0 for an empty list.
For an insert:
if maxcount is 0 (the list is empty), set max to this value and maxcount to 1.
otherwise, if this value is greater than max, set max to this value and maxcount to 1.
otherwise, if this value is equal to max, add 1 to maxcount.
For a deletion:
if this value is equal to max, subtract 1 from maxcount.
then, if maxcount is 0, rescan list to get new max and maxcount.
This way, at any time, you have the maximum value (the count is simply an extra "trick" to speed up the algorithm where there is more than one element holding the maximum value). I've used this once before in a data analysis application and it turned out to be much faster than re-sorting - I had to store both minimum and maximum in that case, but it's the same idea.
Maximum value is always O(n) unless it's presorted. If presorted, it's less. Sorting before search is always worse than O(n)... so, generally, 1,000 elements will take 1,000 comparisons... just compare literally. If working with sorted structures, it's inexpensive. If not, it's expensive. Insertion with sorted structures are more expensive. ... this is why it's such a problem issue.

Resources