Algorithm to find duplicates in multiple linked lists - algorithm

What is the fastest method of finding duplicates across multiple (large) linked lists.
I will attempt to illustrate the problem with arrays instead just to make it a bit more readable. (I used numbers from 0-9 for simplicity instead of pointers).
list1[] = {1,2,3,4,5,6,7,8,9,0};
list2[] = {0,2,3,4,5,6,7,8,9,1};
list3[] = {4,5,6,7,8,9,0,1,2,3};
list4[] = {8,2,5};
list5[] = {1,1,2,2,3,3,4,4,5,5};
If I now ask: 'does the number 8 exist in list1-5?' I could sort the lists, remove duplicates, repeat this for all lists and merge them into a "superlist" and see if the number of (new) duplicates equal the number of lists that I search through. Assuming that I got the correct number of duplicates I can assume that what I searched for (8) exists in all of the lists.
If I instead searched for 1 I will only get four duplicates—ergo not found in all of the lists.
Is there a faster/smarter/better way to achieve the above without sorting and/or changing the lists in any way?
P.S.: This question is asked mostly out of pure curiosity and nothing else! :)

Just put each number into a hash table and store the number of occurrences for that item in the table. When you find another, just increment the counter. O(n) algorithm (n items across all the lists).
If you want to store the lists that each occurs in, then you need a set representation to be stored under each item as well. YOu can use any set representation -- bit vector, list, array etc. This will tell you the lists that that item is a member of. This does not change it from O(n), just increases the work by a constant factor.

Define an array hash and set all the location values to 0
define hash[MAX_SYMBOLS] = {0};
define new_list[LENGTH]
defile list[LENGTH] and populate
Now for each element in your list, use this number as an index in hash and increment that location of hash . Each presence of that number would increment the value at that hash location once. So a duplicate value i would have hash[i] > 1
for i=0 to (n - 1)
do
increment hash[list[i]]
endfor
If you want to remove the duplicates and create a new list then scan the hash array and for each presence of i ie. if hash[i] > 0 load them into a new list in the order in which they appeared in the original list.
define j = 0
for i=0 to (n - 1)
do
if hash[list[i]] is not 0
then
new_list[j] := i
increment j
endif
endfor
Note that when using with negative numbers you will not be able to use the values directly to index. To use negative numbers, first we can find the largest magnitude of the negative numbers and use that magnitude to add to all the numbers when we use them to index the hash array.
find the highest magnitude of negative value into min_neg
for i=0 to (n - 1)
do
increment hash[list[i + min_neg]]
endfor
Or in implementation you can allocate contiguous memory and then define a pointer at the middle of the allocated memory block, so that you could move in both front and back directions so that you can use negative index with it. You need to make sure that you have enough memory to use in front and back of the pointer.
int *hash = malloc (sizeof (int) * SYMBOLS)
int *hash_ptr = hash + (int)(SYMBOLS/2)
now you can do hash_ptr[-6] or some hash_ptr[i] with -SYMBOLS/2 < i < SUMBOLS/2 + 1

The question is a bit vague, so the answer depends on what you want.
A hash table is the correct answer for asking general questions about duplicates, because it allows you to go through each list just once to build a table that will answer most questions; however, some questions will not require one.
Possible cases that seem to answer your question:
Do you just need to know if a certain value is present in each list? - Check through the first list until the value is found. If not, you're done: it is not. Repeat for each successive list. If all lists are searched and the value found, it is duplicated in each list. In this algorithm, it is not necessary to look at each value in each list, or even each list, so this would be the quickest.
Do you need to know whether any duplicates exist at all?
- If any value in a hash table keyed by number has a count greater than 0, there are duplicates... If that is all you need to know, you can quit right there.
Do you need the number of duplicates
in each table, separately?
- Multiply each value by the number of lists and add the number of the list in process. Store that as the hash key and count duplicates. When all lists are processed, you have a table that can answer all kinds of questions. To check duplicates for a specific value, multiply it by the list count and examine sequential hash keys. If there is one for each list, the number is present in each list. If all the counts are greater than 1 over that range, the number is duplicated in each list.
Etc.

Related

interview Q:Given an input array of size unknown with all 1's in the beginning and 0's in the end. find the index in the array from where 0's start

I was asked the following question in a job interview.
Given an input array of size unknown with all 1's in the beginning and 0's in the end. find the index in the array from where 0's start. consider there are millions of 1's and 0's in the array.i.e array is very big..e.g array contents 1111111.......1100000.........0000000.On later googling the question, I found the question on http://www.careercup.com/question?id=2441 .
The most puzzling thing about this question is if I don't know the size of an array, how do I know if *(array_name + index) belongs to the array?? Even if someone finds an index where value changes from 1 to 0, how can one assert that the index belongs to the array.
The best answer I could find was O(logn) solution where one keeps doubling index till one finds 0. Again what is the guarantee that the particular element belongs to the array.
EDIT:
it's a c based array. The constraint is you don't have the index of end elem (can't use sizeof(arr)/sizeof(arr[0])). what if i am at say 1024.arr[1024]==1. arr[2048] is out of bound as array length is 1029(unknown to the programmer). so is it okay to use arr[2048] while finding the solution?It's out of bound and it's value can be anything. So i was wondering that maybe the question is flawed.
If you don't know the length of the array, and can't read past the end of the array (because it might segfault or give you random garbage), then the only thing you can do is start from the beginning and look at each element until you find a zero:
int i = 0;
while (a[i] != 0) i++;
return i;
And you'd better hope there is at least one zero in the array.
If you can find out the length of the array somehow, then a binary search is indeed more efficient.
Ps. If it's a char array, it would be easier and likely faster to just call strlen() on it. The code above is pretty much what strlen() does, except that the standard library implementation is likely to be better optimized for your CPU architecture.
I would go with a Binary Search in order to find the 0.
At first you take the middle, if it is 1 you go in the right side otherwise in the left side. Keep doing this untill you find the first 0.
Now, the problem statement sais that : Given an input array of size unknown with all 1's in the beginning and 0's in the end. The way an Array is represented in the memory is 1 element after another, therefore since you know that there are 0's at the end of the array, if your algorithm works correctly then *(array_name + index) will surely belong to the array.
Edit :
Sorry, I just realised that the solution only works if you know the size. Otherwise, yes doubling the index is the best algorithm that comes to my mind too. But the proof of the fact that the index still belongs to the array is the same.
Edit due to comment:
It states that at the end of the array there are 0's. Therefore If you do a simple
int i;
while(i)
if( *(array_name+i) != 1 )
return i;
It should give you the first index, right?
Now since you know that the array looks like 1111...000000 you also know that atleast 1 of the 0's and that is the 1st one, surely belongs to the array.
In your case you do the search by doubling the index and then using a binary search between index and index/2. Here you can't be sure if index belongs the the array but the first 0 between index and index/2 surely belongs to the array ( the statement said there is atleast one 0 ).
Uppss... I just realised that if you keep doubling the index and you get out of the array you will find "garbage values" which means that they might not be 0's. So the best thing you can do is instead of checking for the first 0 to be checking for the first element which is not 0. Sadly there can be garbage values with the value of 1 ( extremly small chances but it might happen ). If that's the case you will need to use a O(n) algorithm.
If you don't know the size of the array you can start with index = 1; At each step you check if 2 * index is bigger than the length of the array - if it is or if it is zero - now you have a boundary to start the binary search; otherwise index = 2 * index.

Finding all pairs of sequences that differ at exactly one position

I need a data structure representing a set of sequences (all of the same, known, length) with the following non-standard operation:
Find two sequences in the set that differ at exactly one index. (Or establish that no such pair exists.)
If N is the length of the sequences and M the number of sequences, there is an obvious O(N*M*M) algorithm. I wonder if there is a standard way of solving this more efficiently. I'm willing to apply pre-processing if needed.
Bonus points if instead of returning a pair, the algorithm returns all sequences that differ at the same point.
Alternatively, I am also interested in a solution where I can check efficiently whether a particular sequence differs at one index with from any sequence in the set. If it helps, we can assume that in the set, no two sequences have that property.
Edit: you can assume N to be reasonably small. By this, I mean improvements such as O(log(N)*M*M) are not immediately useful for my use case.
For each sequence and each position i in that sequence, calculate a hash of the sequence without position i and add it to a hash table. If there is already an entry in the table, you have found a potential pair that differs only in one position. Using rolling hashes from both start and end and combining them, you can calculate each hash in constant time. The total running time is expected O(N*M).
Select j sets of k indexes each randomly (make sure none of the sets overlap).
For each set XOR the elements.
You now have j fingerprints for each document.
Compare sequences based on these fingerprints. j-1 fingerprints should match if the sequences do indeed match. But the converse might not be true and you might have to check location by location.
More clarification on comparison part: Sort all fingerprints from all documents (or use hash table). In that way you don't have to compare every pair, but only the pairs that do have a matching fingerprint.
A simple recursive approach:
Find all sets of sequences that have the same first half through sort or hash.
For each of these sets, repeat the whole process now only looking at the second half.
Find all sets of sequences that have the same second half through sort or hash.
For each of these sets, repeat the whole process now only looking at the first half.
When you've reached length 1, all those that don't match are what you're looking for.
Pseudo-code:
findPairs(1, N)
findPairs(set, start, end)
mid = (start + end)/2
sort set according to start and mid indices
if end - start == 1
last = ''
for each seq: set
if last != '' and seq != last
DONE - PAIR FOUND
last = seq
else
newSet = {}
last = ''
for each seq: set
if newSet.length > 1 and seq and last don't match from start to mid indices
findPairs(newSet, mid, end)
newSet = {}
newSet += seq
last = seq
It should be easy enough to modify the code to be able to find all pairs.
Complexity? I may be wrong but:
The max depth is log M. (I believe) the worst case would be if all sequences are identical. In this case the work done will be O(N*M*log M*log M), which is better than O(N*M*M).

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

How can I extract random elements from an array while not producing any duplicates

I am using the rand() function in my iPhone project to generate a random array index. I generate several random indexes and then get the objects from those indexes. However I don't want to get one object more than once so is there a way to say generate a random number within the range of the array count (which I am already doing) excluding previously picked numbers.
i.e. something like this:
int one = rand() % arrayCount
int two = rand() % arrayCount != one
Thanks
Three possibilities:
Shuffling
Shuffle your array and extract the elements in their order.
Remember
Extract a random element and store it into a NSSet. If you extract one the next time check if it's already in the set. (This is linear time.)
Delete
Use a NSMutableArray and remove already extracted elements from the array. If you don't want to modify the original one make a mutable copy.
Which one's the best depends on your needs.

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources