Am writing a logic program for kth_largest(Xs,K) that implements the linear algorithm for finding the
kth largest element K of a list Xs. The algorithm has the following steps:
Break the list into groups of five elements.
Efficiently find the median of each of the groups, which can be done with a fixed number of
comparisons.
Recursively find the median of the medians.
Partition the original list with respect to the median of the medians.
Recursively find the kth largest element in the appropriate smaller list.
How do I go about it? I can select an element from a list but I don't know how to get the largest using the above procedure.Here is my definition for selecting an element from a list
select(X; HasXs; OneLessXs)
% The list OneLessXs is the result of removing
% one occurrence of X from the list HasXs.
select(X,[X|Xs],Xs).
select(X,[Y|Ys],[Y|Zs]) :- select(X,Ys,Zs).
I'm going to jump in since no one has attempted an Answer, and hopefully shed some light on the procedure to be programmed.
I've found the Wikipedia article on Selection algorithm to be quite helpful in understanding the bigger picture of "fast" (worst-case linear time) algorithms of this type.
But what you asked at the end of your Question is a somewhat simpler matter. You wrote "How do i go about it? I can select an element from a list but i dont know how to get the largest using the above procedure." (emphasis added by me)
Now there seems to be a bit of confusion about whether you want to implement "the above procedure", which is a general recipe for finding a kth largest element by successive searches for medians, or whether you ask how to use that recipe to find simply the largest element (a special case). Note that the recipe doesn't specifically use a step of finding the largest element on its way to locating the median or the kth largest element.
But you give the code to find an element of a list and the rest of that list after removing that element, a predicate that is nondeterministic and allows backtracking through all members of the list.
The task of finding the largest element is deterministic (at least if all the elements are distinct), and it is an easier task than the general selection of the kth largest element (a task associated with order statistics among other things).
Let's give some simple, hopefully obviously correct, code to find the largest element, and then talk about a more optimized way of doing it.
maxOfList(H,[H|T]) :- upperBound(H,T), !.
maxOfList(X,[_|T]) :- maxOfList(X,T).
upperBound(X,[ ]).
upperBound(X,[H|T]) :-
X >= H,
upperBound(X,T).
The idea should be understandable. We look at the head of the list and ask if that entry is an upper bound for the rest of the list. If so, that must be the maximum value and we're done (the cut makes it deterministic). If not, then the maximum value must occur later in the list, so we discard the head and continue recursively searching for an entry that is an upper bound of all the subsequent elements. The cut is essential here, because we must stop at the first such entry in order to know it is a maximum of the original list.
We've used an auxiliary predicate upperBound/2, which is not unusual, but the overall complexity of this implementation is worst-case quadratic in the length of the list. So there is room for improvement!
Let me pause here to be sure I'm not going totally off-track in trying to address your question. After all you may have meant to ask how to use "the above procedure" to find the kth largest element, and so what I'm describing may be overly specialized. However it may help to understand the cleverness of the general selection algorithms to understand the subtle optimization of the simple case, finding a largest element.
Added:
Intuitively we can reduce the number of comparisons needed in the worst case
by going through the list and keeping track of the largest value found "so
far". In a procedural language we can easily accomplish this by reassigning
the value of a variable, but Prolog doesn't allow us to do that directly.
Instead a Prolog way of doing this is to introduce an extra argument and
define the predicate maxOfList/2 by a call to an auxiliary predicate
with three arguments:
maxOfList(X,[H|T]) :- maxOfListAux(X,H,T).
The extra argument in maxOfListAux/3 can then be used to track the
largest value "so far" as follows:
maxOfListAux(X,X,[ ]).
maxOfListAux(Z,X,[H|T]) :-
( X >= H -> Y = X ; Y = H ),
maxOfListAux(Z,Y,T).
Here the first argument of maxOfListAux represents the final answer as to
the largest element of the list, but we don't know that answer until we
have emptied the list. So the first clause here "finalizes" the answer
when that happens, unifying the first argument with the second argument
(the largest value "so far") just when the tail of the list has reached
the end.
The second clause for maxOfListAux leaves the first argument unbound and
"updates" the second argument accordingly as the next element of the list
exceeds the previous largest value or not.
It isn't strictly necessary to use an auxiliary predicate in this case,
because we might have kept track of the largest value found by using the
head of the list instead of an extra argument:
maxOfList(X,[X]) :- !.
maxOfList(X,[H1,H2|T]) :-
( H1 >= H2 -> Y = H1 ; Y = H2 ),
maxOfList(X,[Y|T]).
Related
Am writing a logic program for kth_largest(Xs,K) that implements the linear algorithm for finding the
kth largest element K of a list Xs. The algorithm has the following steps:
Break the list into groups of five elements.
Efficiently find the median of each of the groups, which can be done with a fixed number of
comparisons.
Recursively find the median of the medians.
Partition the original list with respect to the median of the medians.
Recursively find the kth largest element in the appropriate smaller list.
How do I go about it? I can select an element from a list but I don't know how to get the largest using the above procedure.Here is my definition for selecting an element from a list
select(X; HasXs; OneLessXs)
% The list OneLessXs is the result of removing
% one occurrence of X from the list HasXs.
select(X,[X|Xs],Xs).
select(X,[Y|Ys],[Y|Zs]) :- select(X,Ys,Zs).
I'm going to jump in since no one has attempted an Answer, and hopefully shed some light on the procedure to be programmed.
I've found the Wikipedia article on Selection algorithm to be quite helpful in understanding the bigger picture of "fast" (worst-case linear time) algorithms of this type.
But what you asked at the end of your Question is a somewhat simpler matter. You wrote "How do i go about it? I can select an element from a list but i dont know how to get the largest using the above procedure." (emphasis added by me)
Now there seems to be a bit of confusion about whether you want to implement "the above procedure", which is a general recipe for finding a kth largest element by successive searches for medians, or whether you ask how to use that recipe to find simply the largest element (a special case). Note that the recipe doesn't specifically use a step of finding the largest element on its way to locating the median or the kth largest element.
But you give the code to find an element of a list and the rest of that list after removing that element, a predicate that is nondeterministic and allows backtracking through all members of the list.
The task of finding the largest element is deterministic (at least if all the elements are distinct), and it is an easier task than the general selection of the kth largest element (a task associated with order statistics among other things).
Let's give some simple, hopefully obviously correct, code to find the largest element, and then talk about a more optimized way of doing it.
maxOfList(H,[H|T]) :- upperBound(H,T), !.
maxOfList(X,[_|T]) :- maxOfList(X,T).
upperBound(X,[ ]).
upperBound(X,[H|T]) :-
X >= H,
upperBound(X,T).
The idea should be understandable. We look at the head of the list and ask if that entry is an upper bound for the rest of the list. If so, that must be the maximum value and we're done (the cut makes it deterministic). If not, then the maximum value must occur later in the list, so we discard the head and continue recursively searching for an entry that is an upper bound of all the subsequent elements. The cut is essential here, because we must stop at the first such entry in order to know it is a maximum of the original list.
We've used an auxiliary predicate upperBound/2, which is not unusual, but the overall complexity of this implementation is worst-case quadratic in the length of the list. So there is room for improvement!
Let me pause here to be sure I'm not going totally off-track in trying to address your question. After all you may have meant to ask how to use "the above procedure" to find the kth largest element, and so what I'm describing may be overly specialized. However it may help to understand the cleverness of the general selection algorithms to understand the subtle optimization of the simple case, finding a largest element.
Added:
Intuitively we can reduce the number of comparisons needed in the worst case
by going through the list and keeping track of the largest value found "so
far". In a procedural language we can easily accomplish this by reassigning
the value of a variable, but Prolog doesn't allow us to do that directly.
Instead a Prolog way of doing this is to introduce an extra argument and
define the predicate maxOfList/2 by a call to an auxiliary predicate
with three arguments:
maxOfList(X,[H|T]) :- maxOfListAux(X,H,T).
The extra argument in maxOfListAux/3 can then be used to track the
largest value "so far" as follows:
maxOfListAux(X,X,[ ]).
maxOfListAux(Z,X,[H|T]) :-
( X >= H -> Y = X ; Y = H ),
maxOfListAux(Z,Y,T).
Here the first argument of maxOfListAux represents the final answer as to
the largest element of the list, but we don't know that answer until we
have emptied the list. So the first clause here "finalizes" the answer
when that happens, unifying the first argument with the second argument
(the largest value "so far") just when the tail of the list has reached
the end.
The second clause for maxOfListAux leaves the first argument unbound and
"updates" the second argument accordingly as the next element of the list
exceeds the previous largest value or not.
It isn't strictly necessary to use an auxiliary predicate in this case,
because we might have kept track of the largest value found by using the
head of the list instead of an extra argument:
maxOfList(X,[X]) :- !.
maxOfList(X,[H1,H2|T]) :-
( H1 >= H2 -> Y = H1 ; Y = H2 ),
maxOfList(X,[Y|T]).
Codility, lesson 14, task TieRopes (https://codility.com/demo/take-sample-test/tie_ropes). Stated briefly, the problem is to partition a list A of positive integers into the maximum number of (contiguous) sublists having sum at least K.
I've only come up with a greedy solution because that's the name of the lesson. It passes all the tests but I don't know why it is an optimal solution (if it is optimal at all).
int solution(int K, vector<int> &A) {
int sum = 0, count = 0;
for (int a : A)
{
sum += a;
if (sum >= K)
{
++count;
sum = 0;
}
}
return count;
}
Can somebody tell me if and why this solution is optimal?
Maybe I'm being naive or making some mistake here, but I think that is not too hard (although not obvious) to see that the algorithm is indeed optimal.
Suppose that you have an optimal partition of the list that with the maximum number of sublists. You may or may not have all of the elements of the list, but since adding an element to a valid list produces an also valid lists, lets suppose that any possible "remaining" element that was initially not assigned to any sublist was assigned arbitrarily to one of its adjacent sublists; so we have a proper optimal partition of the list, which we will call P1.
Now lets think about the partition that the greedy algorithm would produce, say P2. There are two things that can happen for the first sublist in P2:
It can be the same as the first sublist in P1.
It can be shorter than the first sublist in P1.
In 1. you would repeat the reasoning starting in the next element after the first sublist. If every subsequent sublist produced by the algorithm is equal to that in P1, then P1 and P2 will be equal.
In 2. you would also repeat the reasoning, but now you have at least one "extra" item available. So, again, the next sublist may:
2.1. Get as far as the next sublist in P1.
2.2. End before the next sublist in P1.
And repeat. So, in every case, you will have at least as many sublists as P1. Which means, that P2 is at least as good as any possible partition of the list, and, in particular, any optimal partition.
It's not a very formal demonstration, but I think it's valid. Please point out anything you think may be wrong.
Here are the ideas that lead to a formal proof.
If A is a suffix of B, then the maximum partition size for A is less than or equal to the maximum partition size for B, because we can extend the first sublist of a partition of A to include the new elements without decreasing its sum.
Every proper prefix of every sublist in the greedy solution sums to less than K.
There is no point in having gaps, because we can add the missing elements to an adjacent list (I thought that my wording of the question had ruled out this possibility by definition, but I'll say it anyway).
The formal proof can be carried out by induction to show that, for every nonnegative integer i, there exists an optimal solution that agrees with the greedy solution on the first i sublists of each. It follows that, when i is sufficiently large, the only solution that agrees with greedy is greedy, so the greedy solution is optimal.
The basis i = 0 is trivial, since an arbitrary optimal solution will do. The inductive step consists of finding an optimal solution that agrees with greedy on the first i sublists and then shrinking the i+1th sublist to match the greedy solution (by observation 2, we really are shrinking that sublist, since it starts at the same position as greedy's; by observation 1, we can extend the i+2th sublist of the optimal solution correspondingly).
I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).
Say [a,b] represents the interval on the real line from a to b, a < b, inclusive (ie, [a,b] = set of all x such that a<=x<=b). Also, say [a,b] and [c,d] are 'overlapping' if they share any x such that x is in both [a,b] and [c,d].
Given a list of intervals, ([x1,y1],[x2,y2],...), what is the most efficient way to find all such intervals that overlap with [x,y]?
Obviously, I can try each and get it in O(n). But I was wondering if I could sort the list of intervals in some clever way, I could find /one/ overlapping item in O(log N) via a binary search, and then 'look around' from that position in the list to find all overlapping intervals. However, how do I sort intervals such that such a strategy would work?
Note that there may be overlaps between elements in the list items itself, which is what makes this hard.
I've tried it by sorting intervals by their left end, right end, middle, but none seem to lead to an exhaustive search.
Help?
For completeness' sake, I'd like to add that there is a well-known data structure for just this sort of problem, known (surprise, surprise) as an interval tree. It's basically an augmented balanced tree (red-black, AVL, your pick) that stores intervals sorted by their left (low) endpoint. The augmentation is that each node stores the largest right (high) endpoint in its subtree. This tree allows you to find all overlapping intervals in O(log n) time.
It's described in CLRS 14.3.
[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.
A 'quadtree' is a data structure often used to improve the efficiency of collision detection in 2 dimensions.
I think you could come up with a similar 1-d structure. This would require some pre-computation but should result in O(log N) performance.
Basically you start with a root 'node' that covers all possible intervals, and when adding a node to the tree, you decide if it falls on the left or the right of the midpoint. If it crosses the mid point, you break it into two intervals (but record the original parent) and recursively proceed from there. You can set a limit on the depth of the tree, which can save memory and improve performance, but comes at the expense of complicating things a little (you need to store a list of intervals in your nodes).
Then when checking an interval, you basically find all leaf nodes that it would be inserted into were it inserted, check the partial intervals within those nodes for intersection, and then report the interval that is recorded against them as the 'original' parent.
Just a quick thought 'off the cuff' so to speak.
Could you organize them into 2 lists, one for start of intervals and the other for end of intervals.
This way, you can compare y to the items in the start of interval list (say by binary search) to cut down the candidates based on that.
You can then compare x to the items in the end of interval list.
EDIT
Case: Once Off
If you are comparing only single interval to the list of intervals in a once-off situation, I don't believe sorting will help you out since ideal sorting is O(n).
By doing a linear search through all x's to trim out any impossible intervals then doing another linear search through the remaining y's you can reduce your total work. While this is still O(n), without this you would be doing 2n comparisons, whereas on average, you would only do (3n-1)/2 comparisons this way.
I believe this is the best you can do for an unsorted list.
Case: Pre-sorting doesn't count
In the case where you will be repeatedly comparing single intervals to this list of intervals and your pre-sort your list, you can achieve better results. The process above still applies, but by doing a binary search on the first list then the second you can get O(m log n) as opposed to O(mn), where m is the number of single intervals being compared. Note, still still gives you the advantage of reducing total comparisons. [2m log n compared to m(3(log n) - 1)/2]
You could sort by both left end and right end at the same time and use both lists to eliminate none overlapping values. If the list is sorted by the left end then none of the intervals to the right of the right end of the test range can overlap. If the list is sorted by the right end then none of the intervals to the left of the left end of the test range can overlap.
For example if the intervals are
[1,4], [3,6], [4,5], [2,8], [5,7], [1,2], [2,2.5]
and you're finding overlap with [3,4] then sorting by left end and marking position of the right end of the test (with the right end as just greater than its value so that 4 is included in the range)
[1,4], [1,2], [2,2.5], [2,8], [3,6], [4,5], *, [5,7]
you know [5,7] can't overlap, then sorting by right end and marking position of the left end of the test
[1,2], [2,2.5], *, [1,4], [4,5], [3,6], [5,7], [2,8]
you know [1,2] and [2,2.5] can't overlap
Not sure how efficient this would be since you're having to do two sorts and searches.
As you can see in other answers, most algorithms come together with a special data structure. For example, for unsorted list of intervals as input O(n) is best that you'll get. (And usually it's easier to think in terms of data structure that dictates the algorithm).
In this case, your question is not complete:
Are you given the whole list or it is you who actually creates it?
Do you have to perform just one such lookup or many of them?
Do you have any estimations for operations it should support and their frequencies?
For example, if you have to perform just one such lookup, then it's not worthy to sort the list before. If many, then the more expensive sorting or generation of an "1D quadtree" would be amortized.
However, it would be difficult to solve it, because a simple quadtree (as I understand it) is able just to detect the collistion, but it's not able to create the list of all the segments that are overlapping with your input.
One simple implementation would be an ordered (by coordonate) list where you insert all the segment ends with flag start/end and with segment number. In this way, by parsing it (still O(n), but I doubt you can make it faster if you also need the list of all the segments that overlaps), and keeping the track of all opened segments that were not closed at "check points".
Say I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.
How can I most efficiently write a function that will return k completely random numbers from the list?
There's a very nice and efficient algorithm for this using a method called reservoir sampling.
Let me start by giving you its history:
Knuth calls this Algorithm R on p. 144 of his 1997 edition of Seminumerical Algorithms (volume 2 of The Art of Computer Programming), and provides some code for it there. Knuth attributes the algorithm to Alan G. Waterman. Despite a lengthy search, I haven't been able to find Waterman's original document, if it exists, which may be why you'll most often see Knuth quoted as the source of this algorithm.
McLeod and Bellhouse, 1983 (1) provide a more thorough discussion than Knuth as well as the first published proof (that I'm aware of) that the algorithm works.
Vitter 1985 (2) reviews Algorithm R and then presents an additional three algorithms which provide the same output, but with a twist. Rather than making a choice to include or skip each incoming element, his algorithm predetermines the number of incoming elements to be skipped. In his tests (which, admittedly, are out of date now) this decreased execution time dramatically by avoiding random number generation and comparisons on each in-coming number.
In pseudocode the algorithm is:
Let R be the result array of size s
Let I be an input queue
> Fill the reservoir array
for j in the range [1,s]:
R[j]=I.pop()
elements_seen=s
while I is not empty:
elements_seen+=1
j=random(1,elements_seen) > This is inclusive
if j<=s:
R[j]=I.pop()
else:
I.pop()
Note that I've specifically written the code to avoid specifying the size of the input. That's one of the cool properties of this algorithm: you can run it without needing to know the size of the input beforehand and it still assures you that each element you encounter has an equal probability of ending up in R (that is, there is no bias). Furthermore, R contains a fair and representative sample of the elements the algorithm has considered at all times. This means you can use this as an online algorithm.
Why does this work?
McLeod and Bellhouse (1983) provide a proof using the mathematics of combinations. It's pretty, but it would be a bit difficult to reconstruct it here. Therefore, I've generated an alternative proof which is easier to explain.
We proceed via proof by induction.
Say we want to generate a set of s elements and that we have already seen n>s elements.
Let's assume that our current s elements have already each been chosen with probability s/n.
By the definition of the algorithm, we choose element n+1 with probability s/(n+1).
Each element already part of our result set has a probability 1/s of being replaced.
The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).
Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).
The definition of the algorithm tells us that the first s elements are automatically included as the first n=s members of the result set. Therefore, the n-seen result set includes each element with s/n (=1) probability giving us the necessary base case for the induction.
References
McLeod, A. Ian, and David R. Bellhouse. "A convenient algorithm for drawing a simple random sample." Journal of the Royal Statistical Society. Series C (Applied Statistics) 32.2 (1983): 182-184. (Link)
Vitter, Jeffrey S. "Random sampling with a reservoir." ACM Transactions on Mathematical Software (TOMS) 11.1 (1985): 37-57. (Link)
This is called a Reservoir Sampling problem. The simple solution is to assign a random number to each element of the list as you see it, then keep the top (or bottom) k elements as ordered by the random number.
I would suggest: First find your k random numbers. Sort them. Then traverse both the linked list and your random numbers once.
If you somehow don't know the length of your linked list (how?), then you could grab the first k into an array, then for node r, generate a random number in [0, r), and if that is less than k, replace the rth item of the array. (Not entirely convinced that doesn't bias...)
Other than that: "If I were you, I wouldn't be starting from here." Are you sure linked list is right for your problem? Is there not a better data structure, such as a good old flat array list.
If you don't know the length of the list, then you will have to traverse it complete to ensure random picks. The method I've used in this case is the one described by Tom Hawtin (54070). While traversing the list you keep k elements that form your random selection to that point. (Initially you just add the first k elements you encounter.) Then, with probability k/i, you replace a random element from your selection with the ith element of the list (i.e. the element you are at, at that moment).
It's easy to show that this gives a random selection. After seeing m elements (m > k), we have that each of the first m elements of the list are part of you random selection with a probability k/m. That this initially holds is trivial. Then for each element m+1, you put it in your selection (replacing a random element) with probability k/(m+1). You now need to show that all other elements also have probability k/(m+1) of being selected. We have that the probability is k/m * (k/(m+1)*(1-1/k) + (1-k/(m+1))) (i.e. probability that element was in the list times the probability that it is still there). With calculus you can straightforwardly show that this is equal to k/(m+1).
Well, you do need to know what N is at runtime at least, even if this involves doing an extra pass over the list to count them. The simplest algorithm to do this is to just pick a random number in N and remove that item, repeated k times. Or, if it is permissible to return repeat numbers, don't remove the item.
Unless you have a VERY large N, and very stringent performance requirements, this algorithm runs with O(N*k) complexity, which should be acceptable.
Edit: Nevermind, Tom Hawtin's method is way better. Select the random numbers first, then traverse the list once. Same theoretical complexity, I think, but much better expected runtime.
Why can't you just do something like
List GetKRandomFromList(List input, int k)
List ret = new List();
for(i=0;i<k;i++)
ret.Add(input[Math.Rand(0,input.Length)]);
return ret;
I'm sure that you don't mean something that simple so can you specify further?