Find the 10,000 largest out of 1,000,000 total values - algorithm

I have a file that has 1,000,000 float values in it. I need to find the 10,000 largest values.
I was thinking of:
Reading the file
Converting the strings to floats
Placing the floats into a max-heap (a heap where the largest value is the root)
After all values are in the heap, removing the root 10,000 times and adding those values to a list/arraylist.
I know I will have
1,000,000 inserts into the heap
10,000 removals from the heap
10,000 inserts into the return list
Would this be a good solution? This is for a homework assignment.

Your solution is mostly good. It's basically a heapsort that stops after getting K elements, which improves the running time from O(NlogN) (for a full sort) to O(N + KlogN). Here N = 1000000 and K = 10000.
However, you should not do N inserts to the heap initially, as this would take O(NlogN) - instead, use a heapify operation which turns an array to a heap in linear time.
If the K numbers don't need to be sorted, you can find the Kth largest number in linear time using a selection algorithm, and then output all numbers larger than it. This gives an O(n) solution.

How about using mergesort(log n operations in worst case scenario) to sort the 1,000,000 integers into an array then get the last 10000 directly?

Sorting is expensive, and your input set is not small. Fortunately, you don't care about order. All you need is to know that you have the top X numbers. So, don't sort.
How would you do this problem if, instead of looking for the top 10,000 out of 1,000,000, you were looking for the top 1 (i.e. the single largest value) out of 100? You'd only need to keep track of the largest value you'd seen so far, and compare it to the next number and the next one until you found a larger one or you ran out of input. Could you expand that idea back out to the input size you're looking at? What would be the big-O (hint: you'd only be looking at each input number one time)?
Final note since you said this was homework: if you've just been learning about heaps in class, and you think your teacher/professor is looking for a heap solution, then yes, your idea is good.

Could you merge sort the values in the array after you have read them all in? This is a fast way to sort the values. Then you could request your_array[10000] and you would know that it is the 10000th largest. Merge sort sounds like what you want. Also if you really need speed, you could look into format your values for radix sort, that would take a bit of formatting but it sounds like that would be the absolute fastest way to solve this problem.

Related

how to calculate different groups of one million binary sequences?

I have one million binary sequences ,they are in the same length, such as (1000010011,1100110000....) and so on. And I want to know how many different groups they have(same sequences belong to same group ).what is the fastest way?
No stoi please.
Depending on the length L of a sequence:
L < ~20: bucket sort
This is short enough in comparison to the input size. A bucketsort with L buckets is all you need. - preallocate an array of size 2L, since you have ~million sequences and 220 is ~million, you will only need O(n) of additional memory.
Go through your sequence, sort to the buckets
Go through the buckets, count the results. Return them.
And we're done.
The time complexity will be O(n) with O(n) memory cost. This is optimal complexity-wise since you have to visit every element at least once to check its value anyway.
L reasonably large: hash table
If you pick a reasonable hashing function and a good size of the hash table(or a dictionary if we need to store the counts)1 you will have small number of collisions while inserting. The amortized time will be O(n) since if the hash is good, then the insert is amortized O(1).
As a side note, the bucket sort is technically a perfect hash since the hash function in this case is an one-to-one function.
L unreasonably large: binary tree
if for some reason the construction of hash is not feasible or you wish for consistency then building a binary tree to hold the values is a way to go.
This will take O(nlog(n)) as binary trees usually do.
1 ~2M should be enough and it is still O(n). Maybe you could go even lower to around 1,5M size.

Why we can not apply counting sort to general arrays?

Counting sort is known with linear time if we know that all elements in the array are upper bounded by a given number. If we take a general array, cant we just scan the array in linear time, to find the maximum value in the array and then to apply counting sort?
It is not enough to know the upper bound to run a counting sort: you need to have enough memory to fit all the counters.
Consider a situation when you go through an array of 64-bit integers, and find out that the largest element is 2^60. This would mean two things:
You need an O(2^60) memory, and
It is going to take O(2^60) to complete the sort.
The fact that O(2^60) is the same as O(1) is of little help here, because the constant factor is simply too large. This is very often a problem with pseudo-polynomial time algorithms.
Suppose the largest number is like 235684121.
Then you'll spend incredible amounts of RAM to keep your buckets.
I would like to mention something with #dasblinkenlight and #AlbinSunnanbo answers, your idea to scan the array in O(n) pass, to find the maximum value in the array is okay. Below is given from Wikipedia:
However, if the value of k is not already known then it may be
computed by an additional loop over the data to determine the maximum
key value that actually occurs within the data.
As the time complexity is O(n + k) and k should be under a certain limit, your found k should be small. As #dasblinkenlight mentioned, O(large_value) can't practically be converged to O(1).
Though I don't know about any major applications of Counting sort so far except used as a subroutine of Radix Sort, it can be nicely used in problems like string sorting( i.e. sort "android" to "addnoir") as here k is only 255.

Finding the max and min of a BIT in linear or sub-linear time

I have to perform a series of range updations on an array, i.e., adding or subtracting some constant to and from a range. After that I have to find the RANGE of the final array, i.e., (max-min). Initially the numbers are 1 to n.
I'm using Binary Indexed Tree. Each update is in log N. I want to know if there is a way to find thus RANGE (or max and min) in O(n) or less time. Conventionally, it takes O(n log n).
You need direct indexed access to the array elements since you need to address them for doing the incremental updates.
You also need to maintain a min-heap and max-heap.
When you update an element, you also need to update the corresponding entries in the two heaps. So you need to store the pointers into corresponding elements in the two heaps in the array.
Creating the original heap is O(n) and any modifications are O(lg(N)).
Why not just sort the array once? Then adding or subtracting a constant from the whole array still gives the same ordering, as does multiplying by a positive number. Maybe there's more to the picture though.
This question is almost 2 years old, hence I am not sure if this answer is going to help much. Anyway...
I have never used BIT to answer minimum or maximum queries. And here there are range queries, which change a lot of numbers all at once. So the maximums and minimums also get updated. As far as I know, I have never seen BITs to be used in queries other than point query, range sum search, etc.
In general, segment trees provide better option for searching for minimum and maximum values. After performing all updates, you can find those in O(lg n) time. However, during updates, you must update the min max values for each node, which can be done using Lazy Propagation. The update cost is O(lg n).
To sum up, if m lg n < n for your application, you can go with Segment tree, albeit with more space.

Generate N quasi random numbers in less than O(N)

This was inspired by a question at a job interview: how do you efficiently generate N unique random numbers? Their security and distribution/bias don't matter.
I proposed a naive way of calling rand() N times and eliminating dupes by trial and error, thus getting inefficient and flawed solution. Then I've read this SO question, these algorithms are great for getting quality unique numbers and they are O(N).
But I suspect there are ways to get low-quality unique random numbers for dummy tasks in less than O(N) time complexity. I got some possible ideas:
Store many precomputed lists each containing N numbers and retrieve one list randomly. Complexity is O(1) for fixed N. Storage space used is O(NR) where R is number of lists.
Generate N/2 unique random numbers and then divide them by 2 inequal parts (floor/ceil for odd numbers, n+1/n-1 for even). I know this is flawed (duplicates can pop up) and O(N/2) is still O(N). This is more of a food for thought.
Generate one big random number and then squeeze more variants from it by some fixed manipulations like bitwise operations, factorization, recursion, MapReduce or something else.
Use a quasi-random sequence somehow (not a math guy, just googled this term).
Your ideas?
Presumably this routine has some kind of output (i.e. the results are written to an array of some kind). Populating an array (or some other data-structure) of size N is at least an O(N) operation, so you can't do better than O(N).
You can consequently generate a random number, and if the result array contains it, just add to it the maximum number of already generated numbers.
Detecting if a number already generated is O(1) (using a hash set). So it's O(n) and with only N random() calls.
Of course, this is an assumption that we do not overflow the upper limit (i.e. BigInteger).

Is it possible to find two numbers whose difference is minimum in O(n) time

Given an unsorted integer array, and without making any assumptions on
the numbers in the array:
Is it possible to find two numbers whose
difference is minimum in O(n) time?
Edit: Difference between two numbers a, b is defined as abs(a-b)
Find smallest and largest element in the list. The difference smallest-largest will be minimum.
If you're looking for nonnegative difference, then this is of course at least as hard as checking if the array has two same elements. This is called element uniqueness problem and without any additional assumptions (like limiting size of integers, allowing other operations than comparison) requires >= n log n time. It is the 1-dimensional case of finding the closest pair of points.
I don't think you can to it in O(n). The best I can come up with off the top of my head is to sort them (which is O(n * log n)) and find the minimum difference of adjacent pairs in the sorted list (which adds another O(n)).
I think it is possible. The secret is that you don't actually have to sort the list, you just need to create a tally of which numbers exist. This may count as "making an assumption" from an algorithmic perspective, but not from a practical perspective. We know the ints are bounded by a min and a max.
So, create an array of 2 bit elements, 1 pair for each int from INT_MIN to INT_MAX inclusive, set all of them to 00.
Iterate through the entire list of numbers. For each number in the list, if the corresponding 2 bits are 00 set them to 01. If they're 01 set them to 10. Otherwise ignore. This is obviously O(n).
Next, if any of the 2 bits is set to 10, that is your answer. The minimum distance is 0 because the list contains a repeated number. If not, scan through the list and find the minimum distance. Many people have already pointed out there are simple O(n) algorithms for this.
So O(n) + O(n) = O(n).
Edit: responding to comments.
Interesting points. I think you could achieve the same results without making any assumptions by finding the min/max of the list first and using a sparse array ranging from min to max to hold the data. Takes care of the INT_MIN/MAX assumption, the space complexity and the O(m) time complexity of scanning the array.
The best I can think of is to counting sort the array (possibly combining equal values) and then do the sorted comparisons -- bin sort is O(n + M) (M being the number of distinct values). This has a heavy memory requirement, however. Some form of bucket or radix sort would be intermediate in time and more efficient in space.
Sort the list with radixsort (which is O(n) for integers), then iterate and keep track of the smallest distance so far.
(I assume your integer is a fixed-bit type. If they can hold arbitrarily large mathematical integers, radixsort will be O(n log n) as well.)
It seems to be possible to sort unbounded set of integers in O(n*sqrt(log(log(n))) time. After sorting it is of course trivial to find the minimal difference in linear time.
But I can't think of any algorithm to make it faster than this.
No, not without making assumptions about the numbers/ordering.
It would be possible given a sorted list though.
I think the answer is no and the proof is similar to the proof that you can not sort faster than n lg n: you have to compare all of the elements, i.e create a comparison tree, which implies omega(n lg n) algorithm.
EDIT. OK, if you really want to argue, then the question does not say whether it should be a Turing machine or not. With quantum computers, you can do it in linear time :)

Resources