I have K number of files. I call them X1, X2, ... ,XK.
Each of these files is a N x 1 array of doubles.
It means that I actually have a NK x 1 array, partitioned in K arrays. Lets call this large array X.
I need to sort X and I cannot load all data into the memory. What is the efficient algorithm to perform this sort and save the results in separate files?
I know (of course not sure efficient) How to do it, if I just want to sort H elements:
sort X1 and save it as sX1
A = sX1(1:H,1) //in Matlab
sort X2 and A
repeat steps 1,2 and 3 for other files
But H cannot be very large, again because of memory problems.
Update
The Sort with the limited memory question is different from this question, although it helped. If I want to use that questions answer or MikeB's answer, then this should be answered too:
Should I merge the K files into one file and then use external sort algorithm. If yes, How?
Thanks.
What you're attempting is called an external sort. Each partition gets sorted by itself. Then, you have to merge all the partitions to build the final sorted list. If you're only looking for the top few items you can exit the merge early.
There seem to be a few existing solutions matlab solutions for external merges. Here's a link to one over at the mathworks file exchange site: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/merge.m
Update: the code I linked shows how it's done in matlab. Specifically, the code here: http://www.mathworks.com/matlabcentral/fileexchange/29306-external-merge-sort/content/ext_merge/extmerge.m takes a list of files that need to be merged, and eventually merges them to one file.
In your original problem statement, you said you have K files, from X1 thru XK. An external sort first sorts those files, then merges them into one file. A simple implementation would have pseudocode like this:
// external merge-sort algorithm
For each file F in (X1 ... XK)
Read file F into memory array R
Sort R
Overwrite file F with sorted data from R
Clear array R in memory
For N = K-1 down to 1
in-order merge file XN+1 and XN into file X'
erase file XN+1 and XN
rename file X' as XN
You should see that the first phase is to sort. We read each file into memory, sort it, and write it back out. This is I/O, but it's efficient; hopefully, we're using as much memory as possible so that we sort in memory as much as we can. At the end of that first loop, we have K files, each one sorted within its own domain of values.
Given the K sorted files, our next step is to merge them. Merging two files doesn't use any memory, but does lots of I/O. Merging two files looks like this, given two files named L and R we can merge them into O:
// merge two files algorithm
Get value LV from L
Get value RV from R
While L is not EOF AND R is not EOF
if ( LV <= RV )
write LV into O
get value LV from L
else
write RV into O
get value RV from R
While L is not EOF
get LV from L
write LV into O
While R is not EOF
get RV from R
write RV into O
The second loop in the merge-sort will merge two files, N+1 and N into a single file N. It loops through each of your files and merges them. This reads and re-writes lots of data, and you can get a bit more efficient than that by handling multiple files in a loop. But it works fine as I've written it.
Related
There is decent literature about merging sorted files or say merging K sorted files. They all work on the theory that first element of each file is put in a Heap, then until the heap is empty poll that element, get another from the file from where this element was taken. This works as long as one record of each file can be put in a heap.
Now let us say I have N sorted files but I can only bring K records in the heap and K < N and let us say N = Kc where "c" is the multiplier implying that N is so large that it is some multiple of c. Clearly, it will require doing K way merge over and over until we only are left with K files and then we merge them as one last time into the final sort. How do I implement this and what will be the complexity of this?
There are multiple examples of k-way merge written in Java. One is http://www.sanfoundry.com/java-program-k-way-merge-algorithm/.
To implement your merge, you just have to write a simple wrapper that continually scans your directory, feeding the thing files until there's only one left. The basic idea is:
while number of files > 1
fileList = Load all file names
i = 0
while i < fileList.length
filesToMerge = copy files i through i+k-1 from file list
merge(filesToMerge, output file name)
i += k
end while
end while
Complexity analysis
This is easier to think about if we assume that each file contains the same number of items.
You have to merge M files, each of which contains n items, but you can only merge k files at a time. So you have to do logk(M) passes. That is, if you have 1,024 files and you can only merge 16 at a time, then you'll make one pass that merges 16 files at a time, creating a total of 64 files. Then you'll make another pass that merges 16 files at a time, creating four files, and your final pass will merge those four files to create the output.
If you have k files, each of which contains n items, then complexity of merging them is O(n*k log2 k).
So in the first pass you do M/k merges, each of which has complexity O(nk log k). That's O((M/k) * n * k * log2 k), or O(Mn log k).
Now, each of your files contains nkk items, and you do M/k/k merges of k files each. So the second pass complexity is O((M/k2) n * k2 * log2 k). Simplified, that, too works out to O(Mn log k).
In the second pass, you do k merges, each of which has complexity O(nk).
Note that in every pass you're working with M*n items. So each pass you do is O(Mn log k). And you're doing logk(M) passes. So the total complexity is: O(logk(M) * (Mn log k)), or
O((Mn log k) log M)
The assumption that every file contains the same number of items doesn't affect the asymptotic analysis because, as I've shown, every pass manipulates the same number of items: M*n.
This is all my thoughts
I would do it in iteration. First I would go for p=floor(n/k) iteration to get p sorted file. Then continue doing this for p+n%k items, until p+n%k becomes less then k. And then finally will get the sorted file.
Does it make sense?
I'm watching the Coursera Princeton algorithms lecture on merge sort and I understand all of the analysis except for a merge being at most 6 n log n array accesses.
Why 6?
To get 6 array accesses, a somewhat inefficient merge process:
read - read an element from even run for compare
read - read an element from odd run for compare
- compare
read - read the lower element again for copy
write - write the lower element to the output array for copy
... - after merge copy back
read - read element from output array to copy back
write - write element back to original array to copy back
The normal case is one read and one write for every element moved, but consider the case where elements are too large to fit in a variable, like a string, so after a compare, the string to be moved has to be read again.
Usually the copy back operation can be avoided, depending on how the merge sort is coded.
I also wondered about the 6. I watched Tim Roughgarden's analysis (video '1 - 7 - Merge Sort- Analysis (9 min).mp4') and he says six as well. Each explanation feels like hand waving, but maybe because it's so simple they didn't realize it needed explanation:
You access an array twice for each n (or k) when you copy to the auxiliary array
aux[k] = a[k];
Then, in the worst case you never exhaust a sub-array (where you're only comparing constants) such that you have four more array accesses
else if (less(aux[j], aux[i])) a[k] = aux[j++]; (e.g., if the input is in reverse order) or each compare fails and the else kicks in after the compare (correct order), or some combination of the two. That doesn't matter, just that by definition of the worst case you can't escape array accesses via constants (with the if (i > mid) or else (j > hi)) so you've got four more here for each k. Total is 6.
(Each line of code is Sedgewick's - page p271 of his text.)
Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)
Merge Sort divide the list into the smallest unit (1 element), then compare each element with the adjacent list to sort and merge the two adjacent list. Finally all the elements are sorted and merged.
I want to implement the merge sort algorithm in such a way that it divides the list into a smallest unit of two elements and then sort and merge them. ?
How i can implement that???
MERGE-SORT (A, p, r)
IF p < r // Check for base case
THEN q = FLOOR[(p + r)/2] // Divide step
MERGE (A, p, q) // Conquer step.
MERGE (A, q + 1, r) // Conquer step.
MERGE (A, p, q, r) // Conquer step.
something like p < r+1 .
I've done something that sounds this before. Here are 2 variations.
Variation 1: Go through the list, sorting each pair. Then go through the list, merging each pair of pairs. Then each pair of 4s, and so on. When you've merged the whole list, you're done.
Variation 2: Have a stack of sorted arrays. Each element merges into the bottom array, and then cascade, but merging down until there is only one, or the second from the top is larger than the top. After your last element has been added, collapse the array by merging it.
The case where I've used variation 2 was one where I had a very large amount of data streaming in. I kept the first few stacks of sorted arrays in memory, and then later ones stored on disk. This lead to good locality of reference, and efficient use of disk. (You ask why I didn't use an off the shelf solution? Well the dataset I had coming in was bigger than the disk I had to handle it on, there was custom merging logic in there, and the sort really wasn't that hard to write.)
Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A.
Let it have structure like that:
a 1
b 2
c 445
a 1
We need to get file "B.txt", that will not have such duplicates. As example, it should be this:
a 1
b 2
c 445
I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.
But I think it is way too slow. What can I use?
A is not database! No SQL, please.
Sorry, that didn't said, sorting is OK.
Although it can be sorted, what about if it cannot be sorted?
One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.
Then the question becomes: how do you sort a file that is too big to fit in memory?
Here's how Unix sort does it.
See also this question.
Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.
Process:
Allocate a few megabytes for input buffer I, a few gigabytes for a result buffer R, and a gigabyte or so for a hash table H. Open input file F and output file O.
Repeat: Fill I from F and process it into R, via step 3.
For each line L in I, check if L is already in H and R. If so, go on to next L, else add L to R and its hash to H.
When R is full, say with M entries, write it to O. Then repeatedly fill I from F, dedup as in step 3, and write to O. At EOF(F) go to 5.
Repeat (using old O as input F and a new O for output): Read M lines from F and copy to O. Then load R and H as in steps 2 and 3, and copy to EOF(F) with dedup as before. Set M to the new number of non-dupped lines at the beginning of each O file.
Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.
Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.
You will essentially have to build up a searchable result set (if the language reminds of you database technology, this is no accident, no matter how much you hate the fact that databases deal with the same questions as you do).
One of the possible efficient data structures for that is either a sorted range (implementable as a tree of some sort), or a hash table. So as you process your file, you insert each record into your result set, efficiently, and at that stage you get to check whether the result already exists. When you're done, you will have a reduced set of unique records.
Rather than duplicating the actual record, your result set could also store a reference of some sort to any one of the original records. It depends on whether the records are large enough to make that a more efficient solution.
Or you could simply add a mark to the original data whether or not the record is to be included.
(Also consider using an efficient storage format like NetCDF for binary data, as a textual representation is far far slower to access and process.)