Find duplicate strings in a large file - algorithm

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates

erickson's answer is probably the one expected by whoever set this question.
You could use each of the N machines as a bucket in a hashtable:
for each string, (say string number i in sequence) compute a hash function on it, h.
send the the values of i and h to machine number n for storage, where n = h % N.
from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
check the sets of strings with equal hash values, to see whether they're actually equal.
To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.

Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

Related

Efficient data structure for storing nongrammatical strings

I need a data structure in which I can store object of variable size and later modify its bytes or remove it, but not change its size (that would be done only be removing it and reinserting with new size). Objects do not need random access, only sequential. I need its memory-efficiency to approach 1 as the total memory allocated approaches infinity (assuming all pointers magically require a constant space and we won't be questioning that).
I do know for tries and that it's popular for storing strings, but after all my needs trie is just not what I am looking for. After all, my strings will not have "common" morphemes, they will be technical and pseudo-random. I am not storing words.
Another option I came to is to have a magic constant M and then M vectors where k-th vector stores chunks of k bytes and one pointer which points to another chunk (a previous block in the following context). Additionally, the elements in M-th chunk will have 2 pointers: one for the previous and one for the next chunk. Then I would split my string (that I am about to insert) into chunks of M bytes each and then store them in M-th vector as a linked list. The last chunk with possibly less than M bytes I would store in the appropriate other vector. When removing a string, I would remove all its chunks from vectors and then reallocate lingering chunks and reconnect them so that new vectors constitute from consecutive chunks, i.e. don't have holes.
This idea satisfies my needs except its converging efficiency. Additionally, there comes the cost of M separate vectors which can't be ignored in computers.
Is there any other already existing idea which explains how to build this structure?

A large file containing 1 million integers, what would be the fastest way to find the most occurring?

Basic approach would be to use an array or a hashmap to create a historgram of numbers and select the most frequent.
In this case let's assume that all the numbers from the file cannot be loaded into the main memory.
One way I can think of is to sort using external merge/quick sort and then chunk by chunk calculate the frequency. As they are sorted, we don't have to worry about the number appearing again after the sequence with a number finishes.
Is there a better and more efficient way to do this?
Well, a million isn't so much anymore, so lets assume we're talking about several billion integers.
In that case, I would suggest that you hash them and partition them into 2^N buckets (separate files or preallocated parts of the same file) using the top N bits of their hash values.
You would choose N so that the resulting buckets were highly likely to be small enough to process in memory.
You would then process each bucket by counting the occurrences of each unique value in a hash table or similar.
In the unlikely event that a bucket has too many unique values to fit in RAM, repartition using the next N bits of the hash and try again.

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Sorting with limited memory and read-only disk

Imagine the following scenario: I have a 10 Mb array of integers stored on a read-only storage medium. I wish to print out the numbers in ascending order. However, I only have 2 Mb of main memory (and no hard disk).
A very simple O(n2) solution (which doesn't make use of the available main memory) would be to repeatedly scan the entire input array and incrementally output the next smallest integer. I've tried googling for better sorting algorithms, but the answers keep leading me to in-place or external sorting algorithms, which would not work because of the read-only storage constraint. Is there a better solution?
You can use the main memory to reduce the number of scans, with the relation o sizes you gave, quite dramatically.
First scan: Keep an in-memory store of nearly the main memory size with the smallest numbers found so far. While the store is not yet full, add the next number read from the array. When the store is full, compare to the largest number in the store, if the new one is smaller, remove the largest number and add the new one. When the complete array has been scanned, output the found numbers in order, remember the largest number stored and how often that occurred in this chunk.
Subsequent scans: If the number scanned equals the largest number from the previous chunk and its occurrence count is smaller than its count from the previous scan, increment its occurrence count, but don't add it to the store, if its occurrence count is larger than or equal to the remembered count add the number to the store (removing the largest number from the store if necessary). If the scanned number is larger than the largest number of the previous scan, but smaller than the largest number in the store (or the store is not yet full), add it to the store (remove largest number if necessary). When the scan is complete, output the stored numbers in order, and remember the largest number output so far, and the number it has been output in total (the largest number might be the same as the one from the previous scan, so you need to know how often it was output in all chunks treated so far).
I'm not sure what the best data structure for the store would be, but I think a heap would be a good choice (comparison with largest: O(1), replacing: O(log size), final sorting for output: O(size*log size), practically no memory overhead as you would have with a binary search tree).

Algorithm for generating a list of recurring pairs

Given a text file in the format below, each line is a list of up to 50
names. Write a program produces a list of pairs of names which appear
together in at least fifty different lists.
Tyra,Miranda,Naomi,Adriana,Kate,Elle,Heidi
Daniela,Miranda,Irina,Alessandra,Gisele,Adriana
In the above sample, Miranda and Adriana appear together twice, but
every other pair appears only once. It should return
"Miranda,Adriana\n". An approximate solution may be returned with
lists which appear at least 50 times with high probability.
I was thinking of the following solution:
Generate a Map <Pair,Integer> pairToCountMap, after reading through the file.
Iterate through the map, and print those with counts >= 50
Is there a better way to do this? The file could be very large, and I'm not sure what is meant by the approximate solution. Any links or resources would be much appreciated.
First let's assume that names are limited in length, so operations on them are constant time.
Your answer should be acceptable if it fits in memory. If you have N lines with m names each, your solution should take O(N*m*m) to complete.
If that data set doesn't fit in memory, you can write the pairs to a file, sort that file using a merge sort, then scan through to count pairs. The running time of this is O(N*m*log(N*m)), but due to details about speed of disk access will run much faster in practice.
If you have a distributed cluster, then you could use a MapReduce. It would run very similarly to the last solution.
As for the statistics approach, my guess is that they mean running through the list of files to find the frequency of each name, and the number of lines with different numbers of names in them. If we assume that each line is a random assortment of names, using statistics we can estimate how many intersections there are between any pair of common names. This will be roughly linear in the length of the file.
You can for each name obtain the list of the line numbers where it appears (use a hashtable to store the names), then for every pair of names get the size of the intersection of the corresponding line indices (in the case of two increasing sequences this is linear time).
Say the length of a name is limited by a constant. So if you have N names and M lines, then building the list is like O(MN) and the final stage is O(N^2 M).

Resources