Fast random selection algorithm

Fast random selection algorithm - algorithm

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.
A sketch simple algorithm is
a <- the array
c <- 0
for i in a:
if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
while j is false: j++
return j
Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?

Use the "pick a random element from an infinite list" algorithm.
Keep an index of your current pick, and also a count of how many true values you've seen.
When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)
This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.
See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.

Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

Related

Giving a set of tuples (value,cost),Is there an algorithm to find the combination of tuple that have the least cost for storing given number

I have a set of (value,cost) tuples which is (2000000,200) , (500000,75) , (100000,20)
Suppose X is any positive number.
Is there an algorithm to find the combination of tuple that have the least cost for the sum of value that can store X.
The sum of tuple values can be equal or greater than the given X
ex.
giving x = 800000 the answer should be (500000,75) , (100000,20) , (100000,20) , (100000,20)
giving x = 900000 the answer should be (500000,75) , (500000,75)
giving x = 1500000 the answer should be (2000000,200)
I can hardcode this but the set and the tuple are subject to change so if this can be substitute with well-known algorithm it would be great.

This can be solved with dinamic programming, as you have no limit on number of tuples and can afford higher sums that provided number.
First, you can optimize tuples. If one big tuple can be replaced by number of smaller ones with equal or lower cost and equal or higher value, you can remove bigger tuple at all.
Also, it's fruitful for future use to order tuples in optimized set by value/cost in descending order. Tuple is better if value/cost is bigger.
Time complexity O(N*T), where N is number divided by common factor (F) of optimized tuple values, and T is number of tuples in optimized tuple set.
Memory complexity O(N).
Set up array a of size N that will contain:
in a[i].cost best cost for solution for i*F, 0 for special case "no solution yet"
in a[i].tuple the tuple that led to best solution
Recursion scheme:
function gets n as a single parameter - it's provided number/F for start, leftover of needed value/F sums for recusion calls
if array a for n is filled, return a[n].cost
otherwise set current_cost to MAXINT
for each tuple from best to worst try to add it to solution:
if value/F >= n, we've got some solution, compare tuple cost to current_cost and if it's better, update a[n].cost and a[n].tuple
if value/F < n, call recursively for n-value/F and compare cost with current solution, update current solution and a[n].cost, a[n].tuple if needed
after all, return a[n].cost or throw exception is no solution exists
Tuple list can be retrieved from a but traverse through .tuple on each step.
It's possible to reduce overall array size down to max(tuple.value/F), but you'll have to save more or less complete solution instead of one best .tuple for each element, and you'll have to make "sliding window" carefully.
It's possible to turn recursion into cycle from 0 to n, as with many other dynamic programming algorithms.

Algorithm to generate a 'nearly sorted' or 'k sorted' list?

I want to generate some test data to test a function that merges 'k sorted' lists (lists where each element is at most k positions away from it's correct sorted position) into a single fully sorted list. I have an approach that works but I'm not sure how well randomized it is and I feel there should be a simpler / more elegant way to do this. My current approach:
Generate n random elements paired with an integer index.
Sort random elements.
Set paired index for each element to its sorted position.
Work backwards through the elements, swapping each element with an element a random distance between 1 and k positions behind it in the list. Only swap with the target element if its paired index is its current index (this avoids swapping an element that is already out of place and moving it further than k positions away from where it should be).
Copy the perturbed elements out into another list.
Like I say, this works but I'm interested in alternative / better approaches.

I think you could just fill an array with random integers and then run quicksort on it with a custom stopping condition.
If in a particular quicksort recursion your start and end indexes are less than k apart, then just return instead of continuing to recur.
Because of how quicksort works, every number in the start..end interval belongs somewhere in that region; worst case is that array[start] might really belong at array[end] (or vice versa) in truly sorted order. So, assuring that start and end are no more than k apart should be sufficient.

You can generate array of random numbers and then h-sort it like in shellsort, but without fiew last sorting steps when h is less then k.

Step 1: Randomly permute disjoint segments of length k. (Eg. 1 to K, k+1 to 2k ...)
Step 2: Permute conditionally again by swapping (that they don't break k-sorted assumption (1+t yo k+t, k+1+t to 1+2k+t ...) where t is a number between 1 and k (most preferably k/2)
Probably repeat step 2 multiple times with different t.

If I understand the problem, you want an algorithm to randomly pick a single k-sorted list of length n, uniformly selected from the universe U of all k-sorted lists of length n. (You will then run this algorithm m times to produce m lists as input test data.)
The first step is to count them. What is the size of U? |U|
The next step is to enumerate them. Create any one-to-one mapping F between the integers (1,2,...,|U|) and k-sorted lists of length n.
Then randomly select an integer x between 1 and |U| inclusive, and then apply F(x) to get the list.

Find random numbers in a given range with certain possible numbers excluded

Suppose you are given a range and a few numbers in the range (exceptions). Now you need to generate a random number in the range except the given exceptions.
For example, if range = [1..5] and exceptions = {1, 3, 5} you should generate either 2 or 4 with equal probability.
What logic should I use to solve this problem?

If you have no constraints at all, i guess this is the easiest way: create an array containing the valid values, a[0]...a[m] . Return a[rand(0,...,m)].
If you don't want to create an auxiliary array, but you can count the number of exceptions e and of elements n in the original range, you can simply generate a random number r=rand(0 ... n-e), and then find the valid element with a counter that doesn't tick on exceptions, and stops when it's equal to r.

Depends on the specifics of the case. For your specific example, I'd return a 2 if a Uniform(0,1) was below 1/2, 4 otherwise. Similarly, if I saw a pattern such as "the exceptions are odd numbers", I'd generate values for half the range and double. In general, though, I'd generate numbers in the range, check if they're in the exception set, and reject and re-try if they were - a technique known as acceptance/rejection for obvious reasons. There are a variety of techniques to make the exception-list check efficient, depending on how big it is and what patterns it may have.

Let's assume, to keep things simple, that arrays are indexed starting at 1, and your range runs from 1 to k. Of course, you can always shift the result by a constant if this is not the case. We'll call the array of exceptions ex_array, and let's say we have c exceptions. These need to be sorted, which shall turn out to be pretty important in a while.
Now, you only have k-e useful numbers to work with, so it'll be meaningful to find a random number in the range 1 to k-e. Say we end up with the number r. Now, we just need to find the r-th valid number in your array. Simple? Not so much. Remember, you can never simply walk over any of your arrays in a linear fashion, because that can really slow down your implementation when you have a lot of numbers. You have do some sort of binary search, say, to come up with a fast enough algorithm.
So let's try something better. The r-th number would nominally have lied at index r in your original array had you had no exceptions. The number at index r is r, of course, since your range and your array indices start from 1. But, you have a bunch of invalid numbers between 1 and r, and you want to somehow get to the r-th valid number. So, lets do a binary search on the array of exceptions, ex_array, to find how many invalid numbers are equal to or less than r, because we have these many invalid numbers lying between 1 and r. If this number is 0, we're all done, but if it isn't, we have a bit more work to do.
Assume you found there were n invalid numbers between 1 and r after the binary search. Let's advance n indices in your array to the index r+n, and find the number of invalid numbers lying between 1 and r+n, using a binary search to find how many elements in ex_array are less than or equal to r+n. If this number is exactly n, no more invalid numbers were encountered, and you've hit upon your r-th valid number. Otherwise, repeat again, this time for the index r+n', where n' is the number of random numbers that lay between 1 and r+n.
Repeat till you get to a stage where no excess exceptions are found. The important thing here is that you never once have to walk over any of the arrays in a linear fashion. You should optimize the binary searches so they don't always start at index 0. Say if you know there are n random numbers between 1 and r. Instead of starting your next binary search from 1, you could start it from one index after the index corresponding to n in ex_array.
In the worst case, you'll be doing binary searches for each element in ex_array, which means you'll do c binary searches, the first starting from index 1, the next from index 2, and so on, which gives you a time complexity of O(log(n!)). Now, Stirling's approximation tells us that O(ln(x!)) = O(xln(x)), so using the algorithm above only makes sense if c is small enough that O(cln(c)) < O(k), since you can achieve O(k) complexity using the trivial method of extracting valid elements from your array first.

In Python the solution is very simple (given your example):
import random
rng = set(range(1, 6))
ex = {1, 3, 5}
random.choice(list(rng-ex))
To optimize the solution, one needs to know how long is the range and how many exceptions there are. If the number of exceptions is very low, it's possible to generate a number from the range and just check if it's not an exception. If the number of exceptions is dominant, it probably makes sense to gather the remaining numbers into an array and generate random index for fetching non-exception.
In this answer I assume that it is known how to get an integer random number from a range.

Here's another approach...just keep on generating random numbers until you get one that isn't excluded.
Suppose your desired range was [0,100) excluding 25,50, and 75.
Put the excluded values in a hashtable or bitarray for fast lookup.
int randNum = rand(0,100);
while( excludedValues.contains(randNum) )
{
randNum = rand(0,100);
}
The complexity analysis is more difficult, since potentially rand(0,100) could return 25, 50, or 75 every time. However that is quite unlikely (assuming a random number generator), even if half of the range is excluded.
In the above case, we re-generate a random value for only 3/100 of the original values.
So 3% of the time you regenerate once. Of those 3%, only 3% will need to be regenerated, etc.

Suppose the initial range is [1,n] and and exclusion set's size is x. First generate a map from [1, n-x] to the numbers [1,n] excluding the numbers in the exclusion set. This mapping with 1-1 since there are equal numbers on both sides. In the example given in the question the mapping with be as follows - {1->2,2->4}.
Another example suppose the list is [1,10] and the exclusion list is [2,5,8,9] then the mapping is {1->1, 2->3, 3->4, 4->6, 5->7, 6->10}. This map can be created in a worst case time complexity of O(nlogn).
Now generate a random number between [1, n-x] and map it to the corresponding number using the mapping. Map looks can be done in O(logn).

You can do it in a versatile way if you have enumerators or set operations. For example using Linq:
void Main()
{
var exceptions = new[] { 1,3,5 };
RandomSequence(1,5).Where(n=>!exceptions.Contains(n))
.Take(10)
.Select(Console.WriteLine);
}
static Random r = new Random();
IEnumerable<int> RandomSequence(int min, int max)
{
yield return r.Next(min, max+1);
}
I would like to acknowledge some comments that are now deleted:
It's possible that this program never ends (only theoretically) because there could be a sequence that never contains valid values. Fair point. I think this is something that could be explained to the interviewer, however I believe my example is good enough for the context.
The distribution is fair because each of the elements has the same chance of coming up.
The advantage of answering this way is that you show understanding of modern "functional-style" programming, which may be interesting to the interviewer.
The other answers are also correct. This is a different take on the problem.

searching through a vast collection of potential solutions

I have a quite difficult problem (perhaps even a NP-hard problem ^^) with looking for a solution in a massive collection of results. Perhaps there is an algorithm for it.
Below exercise is artificial but is a perfect example to illustrate my issue.
There is a big array with integers. Lets say it has 100.000 elements.
int numbers[] = {-123,32,4,-234564,23,5,....}
I want to check in a relatively quick way if a sum on any 2 numbers from this array is equal to 0. In other words, if the array has "-123" I want to find is there also a "123" number.
The easiest solution would be brute force - check everything with everything. That gives 100.000 x 100.000 a big number ;-) Obviously brute force method can by optimised. Order numbers and check negatives against positive only. My question is - is there something better then optimised brute force to find a solution?

First, sort the array by magnitude of the value.
Then, if the data contains a pair which satisfies the conditions you're after, it contains such a pair adjacent in the array. So just sweep through looking for adjacent pairs whose sum is 0.
Overall time complexity is O(n log n) for the sort, could be O(n) if you use "cheating" sorts not based solely on comparisons. Clearly it can't be done in less than linear time, because in the worst case you can't do it without looking at all the elements. I think n log n is probably optimal in the decision tree model of computing, but only because it "feels a bit like" the element uniqueness problem.
Alternative approach:
Add the elements one at a time to a hash-based or tree-based container. Before adding each element, check whether its negative is present. If so, stop.
This is likely to be faster in the case where there are lots of suitable pairs, because you save the cost of sorting the whole data. That said, you could write a modified sort that exits early by checking for adjacent pairs as soon as any subset of the data is in its final order, but that's effort.

Brute force would be an O(n^2) solution. You can certainly do better.
Off the top of my head, first sort it. Heap sort will have a complexity of O(nlogn).
Now, for the first element, say a, you know you need to find an element b, such that a+b = 0. This can be found using binary search (since your array is now sorted). Binary search has a complexity of O(logn).
This gives you an overall solution of O(nlogn) complexity.

The example you provided can be brute-force solved in O(n^2) time.
You can start ordering the numbers (O(n·logn)) from smaller to bigger. If you place one pointer at the beginning (the "most negative number") and other at the end (the "most positive"), you can check if there is such pair of numbers in an additional O(n) steps by following the next procedure:
If the numbers at both pointers have the same module, you have the solution
If not, move the pointer of the number with bigger module towards "zero" (this is, increase if it is the pointer on the negative side, decrease if it is the positive-side one)
Repeat until finding a solution, or the pointers cross.
Total complexity is O(n·logn)+O(n) = O(n·logn).

Sort your array using Quicksort. After this happened, use two indexes, let's call them positive and negative.
positive <- 0
negative <- size - 1
while ((array[positive] > 0) and (array(negative < 0) and (positive >= 0) and (negative < size)) do
delta <- array[positive] + array[negative]
if (delta = 0) then
return true
else if (delta < 0) then
negative <- negative + 1
else
positive <- positive - 1
end if
end while
return (array[positive] * array[negative] = 0)
You didn't say what should the algorithm do if 0 is part of the array, I've supposed that in this case true should be returned.

Finding the hundred largest numbers in a file of a billion

I went to an interview today and was asked this question:
Suppose you have one billion integers which are unsorted in a disk file. How would you determine the largest hundred numbers?
I'm not even sure where I would start on this question. What is the most efficient process to follow to give the correct result? Do I need to go through the disk file a hundred times grabbing the highest number not yet in my list, or is there a better way?

Obviously the interviewers want you to point out two key facts:
You cannot read the whole list of integers into memory, since it is too large. So you will have to read it one by one.
You need an efficient data structure to hold the 100 largest elements. This data structure must support the following operations:
Get-Size: Get the number of values in the container.
Find-Min: Get the smallest value.
Delete-Min: Remove the smallest value to replace it with a new, larger value.
Insert: Insert another element into the container.
By evaluating the requirements for the data structure, a computer science professor would expect you to recommend using a Heap (Min-Heap), since it is designed to support exactly the operations we need here.
For example, for Fibonacci heaps, the operations Get-Size, Find-Min and Insert all are O(1) and Delete-Min is O(log n) (with n <= 100 in this case).
In practice, you could use a priority queue from your favorite language's standard library (e.g. priority_queue from#include <queue> in C++) which is usually implemented using a heap.

Here's my initial algorithm:
create array of size 100 [0..99].
read first 100 numbers and put into array.
sort array in ascending order.
while more numbers in file:
get next number N.
if N > array[0]:
if N > array[99]:
shift array[1..99] to array[0..98].
set array[99] to N.
else
find, using binary search, first index i where N <= array[i].
shift array[1..i-1] to array[0..i-2].
set array[i-1] to N.
endif
endif
endwhile
This has the (very slight) advantage is that there's no O(n^2) shuffling for the first 100 elements, just an O(n log n) sort and that you very quickly identify and throw away those that are too small. It also uses a binary search (7 comparisons max) to find the correct insertion point rather than 50 (on average) for a simplistic linear search (not that I'm suggesting anyone else proffered such a solution, just that it may impress the interviewer).
You may even get bonus points for suggesting the use of optimised shift operations like memcpy in C provided you can be sure the overlap isn't a problem.
One other possibility you may want to consider is to maintain three lists (of up to 100 integers each):
read first hundred numbers into array 1 and sort them descending.
while more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 1 and 2 into list 3 (only first (largest) 100 numbers).
if more numbers:
read up to next hundred numbers into array 2 and sort them descending.
merge-sort lists 3 and 2 into list 1 (only first (largest) 100 numbers).
else
copy list 3 to list 1.
endif
endwhile
I'm not sure, but that may end up being more efficient than the continual shuffling.
The merge-sort is a simple selection along the lines of (for merge-sorting lists 1 and 2 into 3):
list3.clear()
while list3.size() < 100:
while list1.peek() >= list2.peek():
list3.add(list1.pop())
endwhile
while list2.peek() >= list1.peek():
list3.add(list2.pop())
endwhile
endwhile
Simply put, pulling the top 100 values out of the combined list by virtue of the fact they're already sorted in descending order. I haven't checked in detail whether that would be more efficient, I'm just offering it as a possibility.
I suspect the interviewers would be impressed with the potential for "out of the box" thinking and the fact that you'd stated that it should be evaluated for performance.
As with most interviews, technical skill is one of the the things they're looking at.

Create an array of 100 numbers all being -2^31.
Check if the the first number you read from disk is greater than the first in the list. If it is copy the array down 1 index and update it to the new number. If not check the next in the 100 and so on.
When you've finished reading all 1 billion digits you should have the highest 100 in the array.
Job done.

I'd traverse the list in order. As I go, I add elements to a set (or multiset depending on duplicates). When the set reached 100, I'd only insert if the value was greater than the min in the set (O(log m)). Then delete the min.
Calling the number of values in the list n and the number of values to find m:
this is O(n * log m)

Speed of the processing algorithm is absolutely irrelevant (unless it's completely dumb).
The bottleneck here is I/O (it's specified that they are on disk). So make sure that you work with large buffers.

Keep a fixed array of 100 integers. Initialise them to a Int.MinValue. When you are reading, from 1 billion integers, compare them with the numbers in the first cell of the array (index 0). If larger, then move up to next. Again if larger, then move up until you hit the end or a smaller value. Then store the value in the index and shift all values in the previous cells one cell down... do this and you will find 100 max integers.

I believe the quickest way to do this is by using a very large bit map to record which numbers are present. In order to represent a 32 bit integer this would need to be 2^32 / 8 bytes which is about == 536MB. Scan through the integers simply setting the corresponding bit in the bit map. Then look for the highest 100 entries.
NOTE: This finds the highest 100 numbers not the highest 100 instances of a number if you see the difference.
This kind of approach is discussed in the very good book Programming Pearls which your interviewer may have read!

You are going to have to check every number, there is no way around that.
Just as a slight improvement on solutions offered,
Given a list of 100 numbers:
9595
8505
...
234
1
You would check to see if the new found value is > min value of our array, if it is, insert it. However doing a search from bottom to top can be quite expensive, and you may consider taking a divide and conquer approach, by for example evaluating the 50th item in the array and doing a comparison, then you know if the value needs to be inserted in the first 50 items, or the bottom 50. You can repeat this process for a much faster search as we have eliminated 50% of our search space.
Also consider the data type of the integers. If they are 32 bit integers and you are on a 64 bit system, you may be able to do some clever memory handling and bitwise operations to deal with two numbers on disk at once if they are continual in memory.

I think someone should have mentioned a priority queue by now. You just need to keep the current top 100 numbers, know what the lowest is and be able to replace that with a higher number. That's what a priority queue does for you - some implementations may sort the list, but it's not required.

Assuming that 1 bill + 100ion numbers fit into memory
the best sorting algorithm is heap sort. form a heap and get the first 100 numbers. complexity o(nlogn + 100(for fetching first 100 numbers))
improving the solution
divide the implementaion to two heap(so that insertion are less complex) and while fetching the first 100 elements do imperial merge algorithm.

Here's some python code which implements the algorithm suggested by ferdinand beyer above. essentially it's a heap, the only difference is that deletion has been merged with insertion operation
import random
import math
class myds:
""" implement a heap to find k greatest numbers out of all that are provided"""
k = 0
getnext = None
heap = []
def __init__(self, k, getnext ):
""" k is the number of integers to return, getnext is a function that is called to get the next number, it returns a string to signal end of stream """
assert k>0
self.k = k
self.getnext = getnext
def housekeeping_bubbleup(self, index):
if index == 0:
return()
parent_index = int(math.floor((index-1)/2))
if self.heap[parent_index] > self.heap[index]:
self.heap[index], self.heap[parent_index] = self.heap[parent_index], self.heap[index]
self.housekeeping_bubbleup(parent_index)
return()
def insertonly_level2(self, n):
self.heap.append(n)
#pdb.set_trace()
self.housekeeping_bubbleup(len(self.heap)-1)
def insertonly_level1(self, n):
""" runs first k times only, can be as slow as i want """
if len(self.heap) == 0:
self.heap.append(n)
return()
elif n > self.heap[0]:
self.insertonly_level2(n)
else:
return()
def housekeeping_bubbledown(self, index, length):
child_index_l = 2*index+1
child_index_r = 2*index+2
child_index = None
if child_index_l >= length and child_index_r >= length: # No child
return()
elif child_index_r >= length: #only left child
if self.heap[child_index_l] < self.heap[index]: # If the child is smaller
child_index = child_index_l
else:
return()
else: #both child
if self.heap[ child_index_r] < self.heap[ child_index_l]:
child_index = child_index_r
else:
child_index = child_index_l
self.heap[index], self.heap[ child_index] = self.heap[child_index], self.heap[index]
self.housekeeping_bubbledown(child_index, length)
return()
def insertdelete_level1(self, n):
self.heap[0] = n
self.housekeeping_bubbledown(0, len(self.heap))
return()
def insert_to_myds(self, n ):
if len(self.heap) < self.k:
self.insertonly_level1(n)
elif n > self.heap[0]:
#pdb.set_trace()
self.insertdelete_level1(n)
else:
return()
def run(self ):
for n in self.getnext:
self.insert_to_myds(n)
print(self.heap)
# import pdb; pdb.set_trace()
return(self.heap)
def createinput(n):
input_arr = range(n)
random.shuffle(input_arr)
f = file('input', 'w')
for value in input_arr:
f.write(str(value))
f.write('\n')
input_arr = []
with open('input') as f:
input_arr = [int(x) for x in f]
myds_object = myds(4, iter(input_arr))
output = myds_object.run()
print output

If you find 100th order statistic using quick sort, it will work in average O(billion). But I doubt that with such numbers and due to random access needed for this approach it will be faster, than O(billion log(100)).

Here is another solution (about an eon later, I have no shame sorry!) based on the second one provided by #paxdiablo. The basic idea is that you should read another k numbers only if they're greater than the minimum you already have and that sorting is not really necessary:
// your variables
n = 100
k = a number > n and << 1 billion
create array1[n], array2[k]
read first n numbers into array2
find minimum and maximum of array2
while more numbers:
if number > maximum:
store in array1
if array1 is full: // I don't need contents of array2 anymore
array2 = array1
array1 = []
else if number > minimum:
store in array2
if array2 is full:
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
find minimum and maximum of array2
else:
discard the number
endwhile
// Finally
x = n - array1.count()
find the x largest numbers of array2 and discard the rest
return merge array1 and array2
The critical step is the function for finding the largest x numbers in array2. But you can use the fact, that you know the minimum and maximum to speed up the function for finding the largest x numbers in array2.
Actually, there are lots of possible optimisations since you don't really need to sort it, you just need the x largest numbers.
Furthermore, if k is big enough and you have enough memory, you could even turn it into a recursive algorithm for finding the n largest numbers.
Finally, if the numbers are already sorted (in any order), the algorithm is O(n).
Obviously, this is just theoretically because in practice you would use standard sorting algorithms and the bottleneck would probably be the IO.

There are lots of clever approaches (like the priority queue solutions), but one of the simplest things you can do can also be fast and efficient.
If you want the top k of n, consider:
allocate an array of k ints
while more input
perform insertion sort of next value into the array
This may sound absurdly simplistic. You might expect this to be O(n^2), but it's actually only O(k*n), and if k is much smaller than n (as is postulated in the problem statement), it approaches O(n).
You might argue that the constant factor is too high because doing an average of k/2 comparisons and moves per input is a lot. But most values will be trivially rejected on the first comparison against the kth largest value seen so far. If you have a billion inputs, only a small fraction are likely to be larger than the 100th so far.
(You could construe a worst-case input where each value is larger than its predecessor, thus requiring k comparisons and moves for every input. But that is essentially a sorted input, and the problem statement said the input is unsorted.)
Even the binary-search improvement (to find the insertion point) only cuts the comparisons to ceil(log_2(k)), and unless you special case an extra comparison against the kth-so-far, you're much less likely to get the trivial rejection of the vast majority of inputs. And it does nothing to reduce the number of moves you need. Given caching schemes and branch prediction, doing 7 non-consecutive comparisons and then 50 consecutive moves doesn't seem likely to be significantly faster than doing 50 consecutive comparisons and moves. It's why many system sorts abandon Quicksort in favor of insertion sort for small sizes.
Also consider that this requires almost no extra memory and that the algorithm is extremely cache friendly (which may or may not be true for a heap or priority queue), and it's trivial to write without errors.
The process of reading the file is probably the major bottleneck, so the real performance gains are likely to be by doing a simple solution for the selection, you can focus your efforts on finding a good buffering strategy for minimizing the i/o.
If k can be arbitrarily large, approaching n, then it makes sense to consider a priority queue or other, smarter, data structure. Another option would be to split the input into multiple chunks, sort each of them in parallel, and then merge.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio