Two-pass multi way merge sort? - sorting

If I have a relation (SQL) that does not fit in memory and I want to sort the relation using TPMMS (Two-pass multi-way merge sort method). How would I divide the table in sub tables (and how many) that can fit in memory and than merge them?
Let's say I am using C#.

I've not hunted down the current definition of a two-pass multi-way merge sort, but the theory of 'external sorting' (where the data is too big to fit in memory) is pretty much standard. Any decent book on algorithms will cover it; amongst many others, you could look at Knuth, Sedgewick, or (for the software archaeologists) Kernighan & Plauger Software Tools.
The basic technique is simple:
Read data until there is no space left.
Sort it.
Write to a temporary file.
Repeat from 1 until there is no data left to read.
You know know how many temporary files you have, N.
You need to determine how many of those files you can read at one time, M.
If N > M, then you design your merging phase so that the last phase will merge M files.
You merge sets of M files into new temporary files until you reach the last merge.
You merge the final set of M files (or N if N < M) writing to final destination.
All dreadfully standard - but there are nit-picky details to get right.
There was a good article in AT&T's Unix System Readings Volume II called 'Theory and Practice in the Construction of a Working Sort Routine' that you should find and read if you are serious about learning how to handle external sorts. However, when you read it, remember that machines have changed dramatically since it was written, with gigabytes of main memory (instead of megabytes) and terabytes of disk space (or SSD — also instead of megabytes).

Related

How can I efficiently sort a list of variable sized elements?

My motivation for this is to write some Z80 Assembly code to sort the TI-83+ series' Variable Allocation Table (VAT), but I am also interested in this as a general problem.
The part of the VAT that I want to sort is arranged in contiguous memory with each element comprised of some fixed-size data, followed by a size byte for the name, then the name. To complicate matters, there are two stacks located on either side of the VAT, offering no wiggle room to safely pad it with allocated RAM.
Ideally, I'd want to use O(1) space as I have ready access to 2 768-byte non-user RAM buffers. I also want to make it fast as it can contain many entries and this is a 6MHz processor (effectively 1MIPS, though-- no instruction pipeline). It's also important to note that each entry is at least 8 bytes and at most 15 bytes.
The best approach that I've been able to think up relies on block memory transfers which aren't particularly fast on the Z80. In the past others have implemented an insertion sort algorithm, but it wasn't particularly efficient. As well, while I can (and have) written code to collect into an array and sort the pointers to all of the entries, it requires variable amounts of space, so I have to allocate user RAM which is already in short supply.
I feel like it vaguely reminds me of some combinatorial trick I came across once, but for the life of me, a good solution to this problem has evaded me. Any help would be much appreciated.
Divide the table into N pieces which each piece is small enough to be sorted by your existing code using the fixed size temporary buffers available. Then perform a merge sort on the N lists to produce the final result.
Instead of an N-way merge it may be easiest to sort the N pieces pairwise using 2-way merges.
When sorting each piece it may be an advantage to use hash codes to avoid string comparisons. Seems like radix sorting might provide some benefit.
For copying data the Z-80's block move instructions LDIR and LDDR are fairly expensive but hard to beat. Unrolling LDIR into a series of LDI can be faster. Pointing the stack pointer at source and destination and using multiple POP and then PUSH can be faster but requires interrupts be disabled and a guarantee of no non-maskable interrupts occurring.

Linked List vs Vector

Over the past few days I have been preparing for my very first phone interview for a software development job. In researching questions I have come up with this article.
Every thing was great until I got to this passage,
"When would you use a linked list vs. a vector? "
Now from experience and research these are two very different data structures, a linked list being a dynamic array and a vector being a 2d point in space. The only correlation I can see between the two is if you use a vector as a linked list, say myVector(my value, pointer to neighbor)
Thoughts?
Vector is another name for dynamic arrays. It is the name used for the dynamic array data structure in C++. If you have experience in Java you may know them with the name ArrayList. (Java also has an old collection class called Vector that is not used nowadays because of problems in how it was designed.)
Vectors are good for random read access and insertion and deletion in the back (takes amortized constant time), but bad for insertions and deletions in the front or any other position (linear time, as items have to be moved). Vectors are usually laid out contiguously in memory, so traversing one is efficient because the CPU memory cache gets used effectively.
Linked lists on the other hand are good for inserting and deleting items in the front or back (constant time), but not particularly good for much else: For example deleting an item at an arbitrary index in the middle of the list takes linear time because you must first find the node. On the other hand, once you have found a particular node you can delete it or insert a new item after it in constant time, something you cannot do with a vector. Linked lists are also very simple to implement, which makes them a popular data structure.
I know it's a bit late for this questioner but this is a very insightful video from Bjarne Stroustrup (the inventor of C++) about why you should avoid linked lists with modern hardware.
https://www.youtube.com/watch?v=YQs6IC-vgmo
With the fast memory allocation on computers today, it is much quicker to create a copy of the vector with the items updated.
I don't like the number one answer here so I figured I'd share some actual research into this conducted by Herb Sutter from Microsoft. The results of the test was with up to 100k items in a container, but also claimed that it would continue to out perform a linked list at even half a million entities. Unless you plan on your container having millions of entities, your default container for a dynamic container should be the vector. I summarized more or less what he says, but will also link the reference at the bottom:
"[Even if] you preallocate the nodes within a linked list, that gives you half the performance back, but it's still worse [than a vector]. Why? First of all it's more space -- The per element overhead (is part of the reason) -- the forward and back pointers involved within a linked list -- but also (and more importantly) the access order. The linked list has to traverse to find an insertion point, doing all this pointer chasing, which is the same thing the vector was doing, but what actually is occurring is that prefetchers are that fast. Performing linear traversals with data that is mapped efficiently within memory (allocating and using say, a vector of pointers that is defined and laid out), it will outperform linked lists in nearly every scenario."
https://youtu.be/TJHgp1ugKGM?t=2948
Use vector unless "data size is big" or "strong safety guarantee is essential".
data size is big
:- vector inserting in middle take linear time(because of the need to shuffle things around),but other are constant time operation (like traversing to nth node).So there no much overhead if data size is small.
As per "C++ coding standards Book by Andrei Alexandrescu and Herb Sutter"
"Using a vector for small lists is almost always superior to using list. Even though insertion in the middle of the sequence is a linear-time operation for vector and a constant-time operation for list, vector usually outperforms list when containers are relatively small because of its better constant factor, and list's Big-Oh advantage doesn't kick in until data sizes get larger."
strong safety guarantee
List provide strong safety guaranty.
http://www.cplusplus.com/reference/list/list/insert/
As a correction on the Big O time of insertion and deletion within a linked list, if you have a pointer that holds the position of the current element, and methods used to move it around the list, (like .moveToStart(), .moveToEnd(), .next() etc), you can remove and insert in constant time.

What is the best way to sort 30gb of strings with a computer with 4gb of RAM using Ruby as scripting language?

Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.
What would be the best way ?
Assuming *nix:
system("sort <input_file >output_file")
"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.
If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See #psyho's answer for a good summary of an external sorting algorithm.
Put them in a database and let the database worry about it.
One way to do this is to use an external sorting algorithm:
Read a chunk of file into memory
Sort that chunk using any regular sorting algorithm (like quicksort)
Output the sorted strings into a
temporary file
Repeat steps 1-3 until you process
the whole file
Apply the merge-sort algorithm by
reading the temporary files line by line
Profit!
Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle
When your interviewer asks for "best", I believe he/she talks about performance only.
Answer 1
30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort
Answer 2 (partial)
Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n
Database systems are already handling this particular problem well.
A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.

Diffing more quickly

I'm working on diffing large binary files. I've implemented the celebrated Myers Diff algorithm, which produces a minimal diff. However, it is O(ND), so to diff two very different 1 MB files, I expect to take time 1 million squared = 1 trillion. That's not good!
What I'd like is an algorithm that produces a potentially non-minimal diff, but does it much faster. I know that one must exist, because Beyond Compare does it. But I don't know how!
To be sure: There are tools like xdelta or bdiff, but these produce a patch meant for computer consumption, which is different than a human-consumable diff. A patch is concerned with transforming one file into another, so it can do things like copying from previous parts of the file. A human-consumable diff is there to visually show the differences, and can only insert and delete. For example, this transform:
"puddi" -> "puddipuddipuddi"
would produce a small patch of "copy [0,4] to [5,9] and to [10, 14]", but a larger diff of "append 'puddipuddi'". I'm interested in algorithms that produce the larger diff.
Thanks!
Diffing is basically the same algorithm as is used in bioinformatics to align DNA sequences. These sequences are often large (millions or billions of nucleotides long), and one strategy that works well there on longer genomes is used by the program MUMmer:
Quickly find all Maximal Unique Matches (substrings that appear in both files and which cannot be extended in either direction with that condition still holding) using a suffix tree
Quickly find the longest subset of MUMs that appear in consecutive order in both files using a longest-increasing-subsequence dynamic programming algorithm
Fix this subset of MUMs in the alignment (i.e. mark those regions as matching)
If deemed necessary, perform slower (e.g. Myers) diffing on the inter-MUM regions. In your case, you would probably omit this step entirely if you found the length of the longest MUM was beneath some threshold (which you would take to be evidence that the 2 files are unrelated).
This tends to give a very good (though not guaranteed-optimal) set of aligned regions (or equivalently, a very small set of differences) whenever there are not too many differences. I'm not certain of the exact time bounds for each step, but I know that there are no n^2 or higher terms.
I believe the MUMmer program requires DNA or protein sequences, so it may not work out of the box for you, but the concepts certainly apply to general strings (e.g. files) so if you're prepared to reimplement it yourself I would recommend this approach.
From a performance standpoint as file size grows, GNU Diffutils is probably the most robust option. For your situation I'd probably use it's side-by-side comparison format, which is probably the most human friendly of the lot. Elsewise you're off taking its output in another format and doing some work to make it pretty .
A good contender, whose performance has been improving steadily, including numerous speedups, is diff-match-patch. It implements the Myers Diff algorithm in several different languages including Java and JavaScript. See the online demo for an example of the latter with pretty printed results. If you want to do line diffing study the wiki for tips there on how to use it for that purpose.

Partial sorting algorithm

Say I have 50 million features, each feature comes from disk.
At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.
A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.
Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?
Like a partial sorting algorithm or something like that?
In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.
The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.
If you are on Unix, use sort ;)
It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.

Resources