Algorithms: Remember the last n unique numbers, in order

Algorithms: Remember the last n unique numbers, in order - algorithm

I want to remember the last n unique numbers, in order.
Here is what I mean: Let's say n = 4.
My current list is 5 3 4 2 If I add 6, it turns into 3 4 2 6. If I add 3 instead, the list turns into 5 4 2 3, where 3 moves to the front.
I would do it like this: Store the numbers in a queue. When adding a new number, search through the queue for the number. If the number is not found, pop the number at the end, and push the new number in the front. If the number is found, remove the number at that position, then push the new number in front.
Now obviously, removing a number from an arbitrary position in a queue, optimized for queue operations (like std::deque in C++) will be quite slow. Using a linked list, though will be slower to search through the list. Is there a better combination of algorithm + data structure to accomplish this sort of task?
If it makes any difference, I don't necessarily care about "remembering the last n unique numbers, in order." I specifically need to know, what element has been removed from the list upon an addition (if any).

You could use a doubly linked list. You can add your n numbers to be remembered in a hash table where the key is the number itself and the value a pointer that points to the node of the linked list that contains that number.
Then in the step you describe search through the queue for the number you change it for look if the number is in the hash table which will be constant time instead of liner time using the queue.
The pop and push operations you describe can be performed in constant time if you store a pointer p that points to the first element of the doubly linked list and a pointer q that points to the last element of your list.
Your step If the number is found, remove the number at that position can be performed in constant time since you already have the position of the number to be removed.(by position I mean the pointer you get from the hash table).
UPDATE:
Be careful that you must update your hash table to remove and add new numbers accordingly.

Related

Sort a given array based on parent array using only swap function

It is a coding interview question. We are given an array say random_arr and we need to sort it using only the swap function.
Also the number of swaps for each element in random_arr are limited. For this you are given an array parent_arr, containing number of swaps for each element of random_arr.
Constraints:
You should use swap function.
Every element may repeat minimum 5 times and maximum 26 times.
You cannot make elements of given array to 0.
You should not write helper functions.
Now I will explain how parent_arr is declared. If parent_arr is like:
parent_arr[] = {a,b,c,d,...,z} then
a can be swapped at most one time.
b can be swapped at most two times.
if parent_arr[] = {c,b,a,....,z} then
c can be swapped at most one time.
b can be swapped at most two times.
a can be swapped at most three times
My solution:
For each element in random_arr[] store that how many elements are below it, if it is sorted. Now select element having minimum swap count from parent_arr[] and check whether it exist in random_arr[]. If yes and it if has occurred more than one time then it will have more than one location where it can be placed. Now choose the position(rather element at that position, preciously) with maximum swap count and swap it. Now decrease the swap count for that element and sort the parent_arr[] and repeat the process.
But it is quite inefficient and its correctness can't be proved. Any ideas?

First, let's simplify your algorithm; then let's informally prove its correctness.
Modified algorithm
Observe that once you computed the number of elements below each number in the sorted sequence, you have enough information to determine for each group of equal elements x their places in the sorted array. For example, if c is repeated 7 times and has 21 elements ahead of it, then cs will occupy the range [21..27] (all indexes are zero-based; the range is inclusive of its ends).
Go through the parent_arr in the order of increasing number of swaps. For each element x, find the beginning of its target range rb; also note the end of its target range re. Now go through the elements of random_arr outside of the [rb..re] range. If you see x, swap it into the range. After swapping, increment rb. If you see that random_arr[rb] is equal to x, continue incrementing: these xs are already in the right spot, you wouldn't need to swap them.
Informal proof of correctness
Now lets prove the correctness of the above. Observe that once an element is swapped into its place, it is never moved again. When you reach an element x in the parent_arr, all elements with lower number of swaps are already processed. By construction of the algorithm this means that these elements are already in place. Suppose that x has k number of allowed swaps. When you swap it into its place, you move another element out.
This replaced element cannot be x, because the algorithm skips xs when looking for a destination in the target range [rb..re]. Moreover, the replaced element cannot be one of elements below x in the parent_arr, because all elements below x are in their places already, and therefore cannot move. This means that the swap count of the replaced element is necessarily k+1 or more. Since by the time that we finish processing x we have exhausted at most k swaps on any element (which is easy to prove by induction), any element that we swap out to make room for x will have at least one remaining swap that would allow us to swap it in place when we get to it in the order dictated by the parent_arr.

Algorithm to find duplicates in multiple linked lists

What is the fastest method of finding duplicates across multiple (large) linked lists.
I will attempt to illustrate the problem with arrays instead just to make it a bit more readable. (I used numbers from 0-9 for simplicity instead of pointers).
list1[] = {1,2,3,4,5,6,7,8,9,0};
list2[] = {0,2,3,4,5,6,7,8,9,1};
list3[] = {4,5,6,7,8,9,0,1,2,3};
list4[] = {8,2,5};
list5[] = {1,1,2,2,3,3,4,4,5,5};
If I now ask: 'does the number 8 exist in list1-5?' I could sort the lists, remove duplicates, repeat this for all lists and merge them into a "superlist" and see if the number of (new) duplicates equal the number of lists that I search through. Assuming that I got the correct number of duplicates I can assume that what I searched for (8) exists in all of the lists.
If I instead searched for 1 I will only get four duplicates—ergo not found in all of the lists.
Is there a faster/smarter/better way to achieve the above without sorting and/or changing the lists in any way?
P.S.: This question is asked mostly out of pure curiosity and nothing else! :)

Just put each number into a hash table and store the number of occurrences for that item in the table. When you find another, just increment the counter. O(n) algorithm (n items across all the lists).
If you want to store the lists that each occurs in, then you need a set representation to be stored under each item as well. YOu can use any set representation -- bit vector, list, array etc. This will tell you the lists that that item is a member of. This does not change it from O(n), just increases the work by a constant factor.

Define an array hash and set all the location values to 0
define hash[MAX_SYMBOLS] = {0};
define new_list[LENGTH]
defile list[LENGTH] and populate
Now for each element in your list, use this number as an index in hash and increment that location of hash . Each presence of that number would increment the value at that hash location once. So a duplicate value i would have hash[i] > 1
for i=0 to (n - 1)
do
increment hash[list[i]]
endfor
If you want to remove the duplicates and create a new list then scan the hash array and for each presence of i ie. if hash[i] > 0 load them into a new list in the order in which they appeared in the original list.
define j = 0
for i=0 to (n - 1)
do
if hash[list[i]] is not 0
then
new_list[j] := i
increment j
endif
endfor
Note that when using with negative numbers you will not be able to use the values directly to index. To use negative numbers, first we can find the largest magnitude of the negative numbers and use that magnitude to add to all the numbers when we use them to index the hash array.
find the highest magnitude of negative value into min_neg
for i=0 to (n - 1)
do
increment hash[list[i + min_neg]]
endfor
Or in implementation you can allocate contiguous memory and then define a pointer at the middle of the allocated memory block, so that you could move in both front and back directions so that you can use negative index with it. You need to make sure that you have enough memory to use in front and back of the pointer.
int *hash = malloc (sizeof (int) * SYMBOLS)
int *hash_ptr = hash + (int)(SYMBOLS/2)
now you can do hash_ptr[-6] or some hash_ptr[i] with -SYMBOLS/2 < i < SUMBOLS/2 + 1

The question is a bit vague, so the answer depends on what you want.
A hash table is the correct answer for asking general questions about duplicates, because it allows you to go through each list just once to build a table that will answer most questions; however, some questions will not require one.
Possible cases that seem to answer your question:
Do you just need to know if a certain value is present in each list? - Check through the first list until the value is found. If not, you're done: it is not. Repeat for each successive list. If all lists are searched and the value found, it is duplicated in each list. In this algorithm, it is not necessary to look at each value in each list, or even each list, so this would be the quickest.
Do you need to know whether any duplicates exist at all?
- If any value in a hash table keyed by number has a count greater than 0, there are duplicates... If that is all you need to know, you can quit right there.
Do you need the number of duplicates
in each table, separately?
- Multiply each value by the number of lists and add the number of the list in process. Store that as the hash key and count duplicates. When all lists are processed, you have a table that can answer all kinds of questions. To check duplicates for a specific value, multiply it by the list count and examine sequential hash keys. If there is one for each list, the number is present in each list. If all the counts are greater than 1 over that range, the number is duplicated in each list.
Etc.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']

This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.

If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.

Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.

I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.

The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...

There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html

In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Finding the Nth largest value in a group of numbers as they are generated

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.
Note: The numbers are decimals and the list can include duplicates.
[Edit]: My memory limit is 16 MB.

This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).
First pass:
Find the highest value and the lowest value. That's your initial range.
Passes after the first:
Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
The winning bin's top and bottom range are your new range.
Loop through these steps again until you have enough memory to hold the numbers in the current bin.
Last pass:
You should know how many numbers are above the current bin by now.
You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.
Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:
(- 0.0 - 100.0]
(100.0 - 200.0]
(200.0 - 300.0]
...
(900.0 - 1000.0)
If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:
(100.0 - 110.0]
(110.0 - 120.0]
etc.
Another multipass idea:
Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).

Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.
It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.
You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.
If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.

This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.
Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.
Imagine a 1 billion bit-vector stored as an array of 32-bit words
Let it be initialized to all zeros
Start running through your numbers and keep setting the correct bit position for the value of the number
When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
That bit's position gives you the value for your Nth largest number
You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)

If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.

sort -n | uniq -c and the Nth should be the Nth row

Stability of quicksort partitioning approach

Does the following Quicksort partitioning algorithm result in a stable sort (i.e. does it maintain the relative position of elements with equal values):
partition(A,p,r)
{
x=A[r];
i=p-1;
for j=p to r-1
if(A[j]<=x)
i++;
exchange(A[i],A[j])
exchange(A[i+1],A[r]);
return i+1;
}

There is one case in which your partitioning algorithm will make a swap that will change the order of equal values. Here's an image that helps demonstrate how your in-place partitioning algorithm works:
We march through each value with the j index, and if the value we see is less than the partition value, we append it to the light-gray subarray by swapping it with the element that is immediately to the right of the light-gray subarray. The light-gray subarray contains all the elements that are <= the partition value. Now let's look at, say, stage (c) and consider the case in which three 9's are in the beginning of the white zone, followed by a 1. That is, we are about to check whether the 9's are <= the partition value. We look at the first 9 and see that it is not <= 4, so we leave it in place, and march j forward. We look at the next 9 and see that it is not <= 4, so we also leave it in place, and march j forward. We also leave the third 9 in place. Now we look at the 1 and see that it is less than the partition, so we swap it with the first 9. Then to finish the algorithm, we swap the partition value with the value at i+1, which is the second 9. Now we have completed the partition algorithm, and the 9 that was originally third is now first.

Any sort can be converted to a stable sort if you're willing to add a second key. The second key should be something that indicates the original order, such as a sequence number. In your comparison function, if the first keys are equal, use the second key.

A sort is stable when the original order of similar elements doesn't change. Your algorithm isn't stable since it swaps equal elements.
If it didn't, then it still wouldn't be stable:
( 1, 5, 2, 5, 3 )
You have two elements with the sort key "5". If you compare element #2 (5) and #5 (3) for some reason, then the 5 would be swapped with 3, thereby violating the contract of a stable sort. This means that carefully choosing the pivot element doesn't help, you must also make sure that the copying of elements between the partitions never swaps the original order.

Your code looks suspiciously similar to the sample partition function given on wikipedia which isn't stable, so your function probably isn't stable. At the very least you should make sure your pivot point r points to the last position in the array of values equal to A[r].
You can make quicksort stable (I disagree with Matthew Jones there) but not in it's default and quickest (heh) form.
Martin (see the comments) is correct that a quicksort on a linked list where you start with the first element as pivot and append values at the end of the lower and upper sublists as you go through the array. However, quicksort is supposed to work on a simple array rather than a linked list. One of the advantages of quicksort is it's low memory footprint (because everything happens in place). If you're using a linked list you're already incurring a memory overhead for all the pointers to next values etc, and you're swapping those rather than the values.

If you need a stable O(n*log(n)) sort, use mergesort. (The best way to make quicksort stable by the way is to chose a median of random values as the pivot. This is not stable for all elements equivalent, however.)

Quick sort is not stable. Here is the case when its not stable.
5 5 4 8
taking 1st 5 as pivot, we will have following after 1st pass-
4 5 5 8
As you can see order of 5's have been changed. Now if we continue doing sorting it will change the order of 5's in sorted array.

From Wikipedia:
Quicksort is a comparison sort and, in
efficient implementations, is not a
stable sort.

One way to solve this problem is by not taking Last Element of array as Key. Quick sort is randomized algorithm.
Its performance highly depends upon selection of Key. Although algorithm def says we should take last or first element as key, in reality we can select any element as key.
So I tried Median of 3 approach, which says take first ,middle and last element of array. Sorts them and then use middle position as a Key.
So for example my array is {9,6,3,10,15}. So by sorting first, middle and last element it will be {3,6,9,10,15}. Now use 9 as key. So moving key to the end it will be {3,6,15,10,9}.
All we need to take care is what happens if 9 comes more than once. That is key it self comes more than once.
In such cases after selecting key as middle index we need to go through elements between Key to Right end and if any element is found same key i.e. if 9 is found between middle position to the end make that 9 as key.
Now in the region of elements greater than 9 i.e. loop of j if any 9 is found swap it with region of elements less than that is region of i. Your array will be stable sorted.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio