Find the middle of unknown size list - algorithm

In a recent interview I was asked:
Find the middle element of a sorted List of unknown length starting from the first position.
I responded with this:
Have 2 position counters:
counter1
counter2
Increment counter1 by 1 and counter2 by 2. When counter 2 reaches the end of the list counter 1 will be in the middle. I feel that this isn't efficient because I am revisiting nodes I have already seen. Either way, is there a more efficient algorithm?

Assuming a linked list, you can do it while visiting arbitrarily-close-to N of the items.
To do 5/4 N:
Iterate over the list until you hit the end, counting the items.
Drop an anchor at every power of 2'th element. Track the last 2 anchors.
Iterate the before-last anchor until it reaches the middle of the list.
When you hit the end of the list, the before-last anchor is before the mid-point but at least half way there already. So N for the full iteration + at most 1/4 N for the anchor = 5/4 N.
Dropping anchors more frequently, such as at every ceiling power of 1.5^th item, gets you as close to N as needed (at the cost of tracking more anchors; but for any given X as the power step, the asymptotic memory is constant).

I assume you're discussing a linked-list. Indeed your solution is an excellent one. An alternative would be to simply traverse the list counting the number of elements, and then start over from the beginning, traversing half the counted amount. Both methods end up traversing 3n/2 nodes, so there isn't much difference.
It's possible there may be slight cache advantages to either method depending on the architecture; the first method might have the advantage of having cached nodes which could mean quicker retrieval if the cache is large enough before the two pointers are two far apart. Alternatively, the cache might get better blocks if we traverse the list in one go, rather than keeping two pointers alive.

Presuming you can detect that you're beyond the end of the list and seek an arbitrary position efficiently within the list, you could keep doubling the presumed length of the list (guess the length is 1, then 2, then 4, ...) until you're past the end of the list, then use a binary search between the last attempted value less than the length of the list and the first value that exceeded the end of the list to find the actual end of the list.
Then, seek the position END_OF_LIST / 2.
That way, you do not have to visit every node.

Technically, you could do this in one traversal if you used O(N) memory (assuming you're using a linked list).
Traverse the list and convert it into an array (of pointers if this is a linked list).
Return arr[N/2]
Edit: I do love your answer though!

Assuming the regular in-memory linked list that allows reading the next element given the reference to the current multiple times (disclaimer, not tested, but the idea should work):
// assume non-empty list
slow = fast = first;
count = 0;
while (fast)
{
fast = fast.next;
if (!fast)
break;
count++;
fast = fast.next;
if (fast)
count++;
slow = slow.next;
}
if (count % 2)
return slow.data;
else
return (slow.data + slow.next.data)/2.0;
A more difficult case is when the "list" is not a linked list in memory, but rather a stream that you can read in sorted order and what to read each element only once for which I do not have a nice solution.

Related

Finding the kth last element of a singly linked list: answer explanation

When you need to find the kth last element of a singly linlked list, the usual naive approach is to perform two passes. The first to find the length of the list and the second to iterate until the (length-k)th element.
Whereas the optimized version takes advantage of two pointers:
p1 refering to the head of the list
p2 being kth elements ahead of p1
This allows us to return p1's element when p2 reaches the end of the list.
I don't understand why the second approach is faster than the first when in both cases we have one pointer iterating all over the list and another until the (length-k)th element.
Is it due to cache optimization?
Thanks.
If you keep p2 exactly k elements behind p1, then it doesn't really help much, since you have to do the same number of traversals all together.
You can optimize the procedure by using more pointers, though.
As you walk though the list, lets say you remember the pointer at every (k/m)th position, for some m. You only need to remember the last m+1 of those pointers. Then, when you get to the end of the list, instead of iterating again from the beginning, start at the oldest pointer you remembered. It will be between k and k + (k/m) elements behind the end, so you only have to move it forward by at most k/m positions.
Consider non-uniform memory access times and a singly linked list of length n:
- in the counted iteration approach, accesses to the same node will be n accesses apart
- in the lagging pointer approach, accesses to the same node will be k accesses apart
With an LRU cache (/with each LRU cache level), the former is more likely to induce capacity misses than the latter.

Looking for a limited shuffle algorithm

I have a shuffling problem. There is lots of pages and discussions about shuffling a array of values completely, like a stack of cards.
What I need is a shuffle that will uniformly displace the array elements at most N places away from its starting position.
That is If N is 2 then element I will be shuffled at most to a position from I-2 to I+2 (within the bounds of the array).
This has proven to be tricky with some simple solutions resulting in a directional bias to the element movement, or by a non-uniform amount.
You're right, this is tricky! First, we need to establish some more rules, to ensure we don't create artificially non-random results:
Elements can be left in the position they started in. This is a necessary part of any fair shuffle, and also ensures our shuffle will work for N=0.
When N is larger than an element's distance from the start or end of the array, it's allowed to be moved to the other side. We could tweak the algorithm to forbid this, but it would violate the "uniformly" requirement - elements near either end would be more likely to stay put than elements near the middle.
Now we can actually solve the problem.
Generate an array of random value in the range i + [-N, N] where i is the current index in the array. Normalize values outside the array bounds (e.g. -1 should become length-1 and length should become 0).
Look for pairs of duplicate values (collisions) in the array, and recompute them. You have a few options:
Recompute both values until they don't collide with each other, they could both still collide with other values.
Recompute just one until it doesn't collide with the other, the first value could still collide, but the second should now be unique, which might mean fewer calls to the RNG.
Identify the set of available indices for each collision (e.g. in [3, 1, 1, 0] index 2 is available), pick a random value from that set, and set one of the array values to selected result. This avoids needing to loop until the collision is resolved, but is more complex to code and risks running into a case where the set is empty.
However you address individual collisions, repeat the process until every value in the array is unique.
Now move each element in the original array to the index specified in the array we generated.
I'm not sure how to best implement #2, I'd suggest you benchmark it. If you don't want to take the time to benchmark, I'd go with the first option. The others are optimizations that might be faster, but might actually end up being slower.
This solution has an unbounded runtime in theory, but should terminate reasonably quickly in practice. Again, benchmark and test it before using it anywhere critical.
One possible solution I have come up with though how 'naive' it is I am not certain. Especially at edges, the far edge especially.
create a array of flags (boolean) N long (representing elements that have been swapped)
For At each index check if it has already been swapped (according first element in flags array) if so, move on to next (see below)
rotate the flags array, deleting the first element (representing this
element), and add a new 'not swapped' element to end. ASIDE: This
maybe done using a modulus array lookup, to avoid having to actually
move array contents, especially for large N
Loop...
pick a number from 0 to N (or less than N, if N plus current
index is larger that array being shuffled.
If 0, element swaps with itself, move to next.
Otherwise if that element marked as swapped, Loop and try again.
Note there is always 2 elements in flags array that can be picks, itself
and the last element (unless close to end of array being shuffled)
Swap current element with selected unswapped element, mark the selected element as swapped in the flags array. Loop to next element

How is it possible to do binary search on a singly-linked list in O(n) time?

This earlier question talks about doing binary search over a doubly-linked list in O(n) time. The algorithm in that answer work as follows:
Go to the middle of the list to do the first comparison.
If it's equal to the element we're looking for, we're done.
If it's bigger than the element we're looking for, walk backwards halfway to the start and repeat.
If it's smaller than the element we're looking for, walk forwards halfway to the start and repeat.
This works perfectly well for a doubly-linked list because it's possible to move both forwards and backwards, but this algorithm wouldn't work in a singly-linked list.
Is it possible to make binary search work in time O(n) on a singly-linked list rather than a doubly-linked list?
It's absolutely possible to make this work. In fact, there's pretty much only one change you need to make to the doubly-linked list algorithm to make it work.
The issue with the singly-linked list case is that if you have a pointer to the middle of the list, you can't go backwards to get back to the first quarter of the list. However, if you think about it, you don't need to start from the middle to do this. Instead, you can start at the front of the list and walk to the first quarter. This takes (essentially) the same amount of time as before: rather than going backward n / 4 steps, you can start at the front and go forwards n / 4 steps.
Now suppose you've done the first step and are at position n / 4 or 3n / 4. In this case, you're going to have the same problem as before if you need to back up to position n / 8 or position 5n / 8. In the case that you need to get to position n / 8, you can start at the front of the list again and walk forward n / 8 steps. What about the 5n / 8 case? Here's the trick - if you still have pointer to the n / 2 point, then you can start there and walk forwards n / 8 steps, which will take you to the right spot.
More generally, instead of storing a pointer to the middle of the list, store two pointers into the list: one at the front of the range where the value might be and one in the middle of the range where the value might be. If you need to advance forward in the list, update the pointer to the start of the range to be the pointer to the middle of the range, then walk the pointer to the middle of the range forward halfway to the end of the range. If you need to advance backward in the list, update the pointer to the middle of the range to be the pointer to the front of the range, then walk forwards halfway.
Overall, this has the exact same time complexity as the doubly-linked case: we take n / 2 steps, then n / 4 steps, then n / 8 steps, etc., which sums up to O(n) total steps. We also only make O(log n) total comparisons. The only difference is the extra pointer we need to keep track of.
Hope this helps!
The comparison takes O(1), what takes more time is traversing the nodes. So even if you hold pointers to n/2, n/4 and 3n/4 - the time that'll take you to find it will remain O(n).
Further, if you start from the middle and go back(or forward) you might as well compare along the way cause it takes the same amount of time to do another comparison: O(1).
To sum up:
Running a binary-search on a linked list makes no sense unless the linked-list is backed up by an array (ArrayList) which allows direct access to its elements in O(1).
This can be achieved by using Double Pointer Method (provided the list is in sorted order) as described here in this research work: http://www.ijcsit.com/docs/Volume%205/vol5issue02/ijcsit20140502215.pdf

Getting the nth to last element in a linked list

We have a linked list of size L, and we want to retrieve the nth to the last element.
Solution 1: naive solution
make a first pass from the beginning to the end to compute L
make a second pass from the beginning to the expected position
Solution 2: use 2 pointers p1, p2
p1 starts iterating from the beginning, p2 does not move.
when there are n elements between p1 and p2, p2 starts iterating as well
when p1 arrives at the end of the list, p2 is at the expected position
Both solutions seem to have the same time complexity (i.e, 2L - n iterations over list elements)
Which one is better?
Both those algorithms are two-pass. The second may have better performance for reasonably small n because the second pass accesses memory that is already cached by the first pass. (The passes are interleaved.)
A one-pass solution would store the pointers in a circular buffer or queue, and return the "head" of the queue once the end of the list is reached.
How about using 3 pointers p, q, r and a counter.
Iterate through the list with p updating the counter.
Every n nodes assign r to q and q to p
When you hit the end of the list you can figure out how far
r is from the end of the list.
You can get the answer in no more than O(L + n)
If n << L, solution 2 is typically faster, because of caching, i.e. the memory blocks containing p1 and p2 are copied to the CPU cache once and the pointers moved for a bunch of iterations before RAM needs to be accessed again.
Would it not be much cheaper to simply store the length of the linked list in O(1) memory? The only reason you have to do a "first pass" at all is because you don't know the length of your linked list. If you store the length, you could iterate over (|L|-n) elements every time and get retrieve the element easily. For higher values of n in comparison to L, this way would save you substantial amounts of time. For example if n was equal to |L|, you could simply return the head of the list with no iteration at all.
This method uses slightly more memory than your first algorithm since it stores the length in memory, but your second algorithm uses two pointers, whereas this method only uses 1 pointer. If you have the memory for a second pointer, you probably have the memory to store the length of your linked list.
Granted O(|L|-n) is equivalent to O(n) in pure theory, but there are "fast" linear algorithms and then there are "slow" ones. Two-pass algorithms for this kind of problem are slow.
As #HotLicks pointed out in the comments, "One needs to understand that "big O" complexity is only loosely related to actual performance in many cases, since it ignores additive factors and constant multipliers." IMO just go for the laziest method in this case and don't overthink it.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Resources