What algorithm to use to delete duplicates? - algorithm

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A.
Let it have structure like that:
a 1
b 2
c 445
a 1
We need to get file "B.txt", that will not have such duplicates. As example, it should be this:
a 1
b 2
c 445
I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.
But I think it is way too slow. What can I use?
A is not database! No SQL, please.
Sorry, that didn't said, sorting is OK.
Although it can be sorted, what about if it cannot be sorted?

One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.
Then the question becomes: how do you sort a file that is too big to fit in memory?
Here's how Unix sort does it.
See also this question.

Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.
Process:
Allocate a few megabytes for input buffer I, a few gigabytes for a result buffer R, and a gigabyte or so for a hash table H. Open input file F and output file O.
Repeat: Fill I from F and process it into R, via step 3.
For each line L in I, check if L is already in H and R. If so, go on to next L, else add L to R and its hash to H.
When R is full, say with M entries, write it to O. Then repeatedly fill I from F, dedup as in step 3, and write to O. At EOF(F) go to 5.
Repeat (using old O as input F and a new O for output): Read M lines from F and copy to O. Then load R and H as in steps 2 and 3, and copy to EOF(F) with dedup as before. Set M to the new number of non-dupped lines at the beginning of each O file.
Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.
Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.

You will essentially have to build up a searchable result set (if the language reminds of you database technology, this is no accident, no matter how much you hate the fact that databases deal with the same questions as you do).
One of the possible efficient data structures for that is either a sorted range (implementable as a tree of some sort), or a hash table. So as you process your file, you insert each record into your result set, efficiently, and at that stage you get to check whether the result already exists. When you're done, you will have a reduced set of unique records.
Rather than duplicating the actual record, your result set could also store a reference of some sort to any one of the original records. It depends on whether the records are large enough to make that a more efficient solution.
Or you could simply add a mark to the original data whether or not the record is to be included.
(Also consider using an efficient storage format like NetCDF for binary data, as a textual representation is far far slower to access and process.)

Related

External merge sort algorithm

I am having certain trouble understanding the merge step in external sort algorithm.I saw this example in Wikipedia but could not understand it.
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
1) Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
2) Write the sorted data to disk.
3) Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4) Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5) Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file, and empty it. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.
I am not able to understand the 4th step here.Why are reading first 10MB of memory when we have 100 MB of available memory.How to we decide number of passes in external merge?Will we sort each chunk and store them in 9 files?
Suppose that you've broken apart the range to be sorted into k sorted blocks of elements. If you can perform a k-way merge of these sorted blocks and write the result back to disk, then you'll have sorted the input.
To do a k-way merge, you store k read pointers, one per file, and repeatedly look at all k elements, take the smallest, then write that element to the output stream and advance the corresponding read pointer.
Now, since you have all the data stored in files on disk, you can't actually store pointers to the elements that you haven't yet read because you can't fit everything into main memory.
So let's start with a simple way to simulate what the normal merge algorithm would do. Suppose that you store an array of k elements in memory. You read one element from each file into each array slot. Then, you repeat the following:
Scan across the array slots and take the smallest.
Write that element to the output stream.
Replace that array element by reading the next value from the corresponding file.
This approach will work correctly, but it's going to be painfully slow. Remember that disk I/O operations take much, much longer than the corresponding operations in main memory. This merge algorithm ends up doing Θ(n) disk reads (I assume k is much less than n), since every time the next element is chosen, we need to do another read. This is going to be prohibitively expensive, so we need a better approach.
Let's consider a modification. Now, instead of storing an array of k elements, one per file, we store an array of k slots, each of which holds the first R elements from the corresponding file. To find the next element to output, we scan across the array and, for each array, look at the first element we haven't yet considered. We take that minimum value, write it to the output, then remove that element from the array. If this empties out one of the slots in the array, we replenish it by reading R more elements from the file.
This is more complicated, but it significantly cuts down on how many disk reads we need to do. Specifically, since the elements are read in blocks of size R, we only need to do Θ(n / R) disk reads.
We could take a similar approach for minimizing writes. Instead of writing every element to disk one at a time (requiring Θ(n) writes), we store a buffer of size W, accumulating elements into it as we go and only writing the buffer once it fills up. This requires Θ(n / W) disk writes.
Clearly, making R and W bigger will make this approach go a lot faster, but at the cost of more memory. Specifically, we need space for kR items to store k copies of the read buffers of size R, and we need space for W items to store the write buffer of size W. Therefore, we need to pick R and W so that kR + W items fit into main memory.
In the example given above, you have 100MB of main memory and 900MB to sort. If you split the array into 9 pieces, then you need to pick R and W so that (kR + W) · sizeof(record) ≤ 100MB. If every item is one byte, then picking R = 10MB and W = 10MB ensures that everything fits. This is also probably a pretty good distribution, since it keeps the number of reads and writes low.

Comparing five different sources

I need to write a function which will compare 2-5 "files" (well really 2-5 sets of database rows, but similar concept), and I have no clue of how to do it. The resulting diff should present the 2-5 files side by side. The output should show added, removed, changed and unchanged rows, with a column for each file.
What algorithm should I use to traverse rows so as to keep complexity low? The number of rows per file is less than 10,000. I probably won't need External Merge as total data size is in the megabyte range. Simple and readable code would of course also be nice, but it's not a must.
Edit: the files may be derived from some unknown source, there is no "original" to which the other 1-4 files can be compared to; all files will have to be compared to the others in their own right somehow.
Edit 2: I, or rather my colleague, realized that the contents may be sorted, as the output order is irrelevant. This solution means using additional domain knowledge to this part of the application, but also that diff complexity is O(N) and less complicated code. This solution is simple and I'll disregards any answers to this edit when I close the bounty. However I'll answer my own question for future reference.
If all of the n files (where 2 <= n <= 5 for the example) have to be compared to the others, then it seems to me that the number of combinations to compare will be C(n,2), defined by (in Python, for instance) as:
def C(n,k):
return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
Thus, you would have 1, 3, 6 or 10 pairwise comparisons for n = 2, 3, 4, 5 respectively.
The time complexity would then be C(n,2) times the complexity of the pairwise diff algorithm that you chose to use, which would be an expected O(ND), in the case of Myers' algorithm, where N is the sum of the lengths of the two sequences to be compared, A and B, and D is the size of the minimum edit script for A and B.
I'm not sure about the environment in which you need this code but difflib in Python, as an example, can be used to find the differences between all sorts of sequences - not just text lines - so it might be useful to you. The difflib documentation doesn't say exactly what algorithm it uses, but its discussion of its time complexity makes me think that it is similar to Myers'.
Pseudo code (for Edit 2):
10: stored cells = <empty list>
for each column:
if cell < stored cells:
stored cells = cell
elif cell == lastCell:
stored cells += cell
if stored cells == <empty>:
return result
result += stored cells
goto 10
The case of 2 files can be solved with a standard diff algorithm.
From 3 files on you can use a "majority vote" algorithm:
If more than half of the records are the same: 2 out of 3, 3 out of 4, 3 out 5 than these are the reference to consider the other record(s) changed.
Also this means quite a speedup for the algorithm if the number of changes is comparatively low.
Pseudocode:
initialize as many line indexes as there are files
while there are still at least 3 indexes incrementable
if all indexed records are the same
increment all line indexes
else
if at least one is different - check majority vote
if there is a majority
mark minority changes, increment all line indexes
else
mark minority additions (maybe randomly deciding e.g. in a 2:2 vote)
check addition or removing and set line indexes accordingly
increment all indexes
endif
endif
endwhile

How can I sort a partitioned array efficiently?

I have K number of files. I call them X1, X2, ... ,XK.
Each of these files is a N x 1 array of doubles.
It means that I actually have a NK x 1 array, partitioned in K arrays. Lets call this large array X.
I need to sort X and I cannot load all data into the memory. What is the efficient algorithm to perform this sort and save the results in separate files?
I know (of course not sure efficient) How to do it, if I just want to sort H elements:
sort X1 and save it as sX1
A = sX1(1:H,1) //in Matlab
sort X2 and A
repeat steps 1,2 and 3 for other files
But H cannot be very large, again because of memory problems.
Update
The Sort with the limited memory question is different from this question, although it helped. If I want to use that questions answer or MikeB's answer, then this should be answered too:
Should I merge the K files into one file and then use external sort algorithm. If yes, How?
Thanks.
What you're attempting is called an external sort. Each partition gets sorted by itself. Then, you have to merge all the partitions to build the final sorted list. If you're only looking for the top few items you can exit the merge early.
There seem to be a few existing solutions matlab solutions for external merges. Here's a link to one over at the mathworks file exchange site: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/merge.m
Update: the code I linked shows how it's done in matlab. Specifically, the code here: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/extmerge.m takes a list of files that need to be merged, and eventually merges them to one file.
In your original problem statement, you said you have K files, from X1 thru XK. An external sort first sorts those files, then merges them into one file. A simple implementation would have pseudocode like this:
// external merge-sort algorithm
For each file F in (X1 ... XK)
Read file F into memory array R
Sort R
Overwrite file F with sorted data from R
Clear array R in memory
For N = K-1 down to 1
in-order merge file XN+1 and XN into file X'
erase file XN+1 and XN
rename file X' as XN
You should see that the first phase is to sort. We read each file into memory, sort it, and write it back out. This is I/O, but it's efficient; hopefully, we're using as much memory as possible so that we sort in memory as much as we can. At the end of that first loop, we have K files, each one sorted within its own domain of values.
Given the K sorted files, our next step is to merge them. Merging two files doesn't use any memory, but does lots of I/O. Merging two files looks like this, given two files named L and R we can merge them into O:
// merge two files algorithm
Get value LV from L
Get value RV from R
While L is not EOF AND R is not EOF
if ( LV <= RV )
write LV into O
get value LV from L
else
write RV into O
get value RV from R
While L is not EOF
get LV from L
write LV into O
While R is not EOF
get RV from R
write RV into O
The second loop in the merge-sort will merge two files, N+1 and N into a single file N. It loops through each of your files and merges them. This reads and re-writes lots of data, and you can get a bit more efficient than that by handling multiple files in a loop. But it works fine as I've written it.

Parallel Subset

The setup: I have two arrays which are not sorted and are not of the same length. I want to see if one of the arrays is a subset of the other. Each array is a set in the sense that there are no duplicates.
Right now I am doing this sequentially in a brute force manner so it isn't very fast. I am currently doing this subset method sequentially. I have been having trouble finding any algorithms online that A) go faster and B) are in parallel. Say the maximum size of either array is N, then right now it is scaling something like N^2. I was thinking maybe if I sorted them and did something clever I could bring it down to something like Nlog(N), but not sure.
The main thing is I have no idea how to parallelize this operation at all. I could just do something like each processor looks at an equal amount of the first array and compares those entries to all of the second array, but I'd still be doing N^2 work. But I guess it'd be better since it would run in parallel.
Any Ideas on how to improve the work and make it parallel at the same time?
Thanks
Suppose you are trying to decide if A is a subset of B, and let len(A) = m and len(B) = n.
If m is a lot smaller than n, then it makes sense to me that you sort A, and then iterate through B doing a binary search for each element on A to see if there is a match or not. You can partition B into k parts and have a separate thread iterate through every part doing the binary search.
To count the matches you can do 2 things. Either you could have a num_matched variable be incremented every time you find a match (You would need to guard this var using a mutex though, which might hinder your program's concurrency) and then check if num_matched == m at the end of the program. Or you could have another array or bit vector of size m, and have a thread update the k'th bit if it found a match for the k'th element of A. Then at the end, you make sure this array is all 1's. (On 2nd thoughts bit vector might not work out without a mutex because threads might overwrite each other's annotations when they load the integer containing the bit relevant to them). The array approach, atleast, would not need any mutex that can hinder concurrency.
Sorting would cost you mLog(m) and then, if you only had a single thread doing the matching, that would cost you nLog(m). So if n is a lot bigger than m, this would effectively be nLog(m). Your worst case still remains NLog(N), but I think concurrency would really help you a lot here to make this fast.
Summary: Just sort the smaller array.
Alternatively if you are willing to consider converting A into a HashSet (or any equivalent Set data structure that uses some sort of hashing + probing/chaining to give O(1) lookups), then you can do a single membership check in just O(1) (in amortized time), so then you can do this in O(n) + the cost of converting A into a Set.

Detect largest suffix of some file that is prefix of another file

I have two files - let's call them file0 and file1.
What I would like to get is a fast algorithm for the following problem (it is clear to me how to write a rather slow algorithm that solves it):
Detect the largest suffix of file0 that is a prefix of file1, that means a memory block B (or more precisely: the number of bytes of such a memory block) of maximum length so that
file0 consists of some memory block A, followed by B
file1 constist of memory block B, followed by some memory block C
Note that the blocks A, B and C can also have a length of zero bytes.
Edit (to answer drysdam's remark): the obvious rather slow algorithm that I think of (Pseudocode): let the length of the files be bounded by m, n with wlog m<=n.
for each length from m to 0
compare the m last bytes of file0 with the first m bytes of file1
if they are equal
return length
This is obviously an O(m*min(m, n)) algorithm. If the files are about the same size this is O(n^2).
The files that I have to handle currently are of size of 10 up to a few hundread megabytes. But in extreme cases they can also be of size of a few gigabytes - just big enough not to fit into the 32 bit adress space of x86 anymore.
Consider treating your bytes as numbers 0..255 held as integers mod p, where p is a prime, optionally much larger than 255. Here are two ways of computing b0*x^2 + b1*x + b2:
(b0*x + b1)*x + b2
b0*x^2 + (b1*x + b2).
Therefore I can compute this quantity efficiently either by working from left to right - multiply by x and adding b2, or by working right to left - adding b0*x^2.
Pick a random x and compute this working from right to left in AB and from left to right in BC. If the values computed match, you note down the location. Later do a slow check of all the matches starting with the longest to see if the B really is identical in both cases.
What is the chance of a match at random? If you have a false match then (a0 - c0)*x^2 + (a1 - c1)*x + (a2 - c2) = 0. A polynomial of degree d has at most d roots, so if x is random the chance of a false match is at most d / p, and you can make this small by working mod p for suitably large p. (If I remember rightly there is a scheme for message authentication which has this idea at its heart).
Depending on how much memory you have available, you may want to consider building a suffix tree for the first file. Once you have this, you can query the prefix of the second file that maximally overlaps with a suffix of the second file by just walking the suffix tree down from the root along the edges matching the letters of the prefix of the second file. Since suffix trees can be built in linear time, the runtime for this algorithm is O(|A| + |B|) using your terminology, since it takes O(|A| + |B|) time to build the suffix tree and O(|B|) time to walk the suffix tree to find the block B.
If it is not an academical assignment then It might make sense to implement the simplest solution and see how it behaves on your data.
For example, a theoretically more efficient Knuth-Morris-Pratt algorithm -based solution can perform worse than IndexOf-based solution (see Overlap Detection).
For large files your program might spent all the time waiting for I/O.

Resources