I am reading Programming Pearls by Jon Bentley (reference).
Here author is mentioning about various sorting algorithms like merge sort, multipass sort.
Questions:
How does algorithm for merge sort work by reading input file once and using work files and writing output file only once?
How does the author denote that the 40 pass i.e. multipass sort algorithm works by writing only once to output file and with no work files?
Can someone explain the above with a simple example, like having memory to store 3 digits and having 10 digits to store, e.g. 9,0,8,6,5,4,1,2,3,7
This is from Chapter 1 of Jon Bentley's
Programming Pearls, 2nd Edn (1999), which is an excellent book. The equivalent example from the first edition is slightly different; the multipass algorithm only made 27 passes over the data (and there was less memory available).
The sort described by Jon Bentley has special setup constraints.
File contains at most 10 million records.
Each record is a 7 digit number.
There is no other data associated with the records.
There is only 1 MiB of memory available when the sort must be done.
Question 1
The single read of the input file slurps as many lines from the input as will fit in memory, sorts that data, and writes it out to a work file. Rinse and repeat until there is no more data in the input file.
Then, complete the process by reading the work files and merging the sorted contents into a single output file. In extreme cases, it might be necessary to create new, bigger work files because the program can't read all the work files at once. If that happens, you arrange for the final pass to have the maximum number of inputs that can be handled, and have the intermediate passes merge appropriate numbers of files.
This is a general purpose algorithm.
Question 2
This is where the peculiar properties of the data are exploited. Since the numbers are unique and limited in range, the algorithm can read the file the first time, extracting numbers from the first fortieth of the range, sorting and writing those; then it extracts the second fortieth of the range, then the third, ..., then the last fortieth.
This is a special-purpose algorithm, exploiting the nature of the numbers.
Related
In question 11.5 of Gayle Laakman's book, Cracking the Technical Interview,
"Imagine you have a 20GB file with one string per line. Explain how you would sort the file"
My initial reaction was exactly the solution that she proposed - splitting the file into smaller chunks (megabytes) by reading in X mb's of data, sorting it, and then writing it to disk. And at the very end, merge the files.
I decided not to pursue this approach because the final merge would involve holding on to all the data in main memory - and we're assuming that's not possible. If that's the case, how exactly does this solution hold?
My other approach is based on the assumption that we have near unlimited disk space, or at least enough to hold 2X the data we already have. We can read in X mb's of data and then generate hash keys for them - each key corresponding to a line in a file. We'll continue doing this until all values have been hashed. Then we just have to write the values of that file into the original file.
Let me know what you think.
http://en.wikipedia.org/wiki/External_sorting gives a more detailed explanation on how external sorting works. It addresses your concern of eventually having to bring the entire 20gB into memory by explaining how you perform the final merge of the N sorted chunks by reading in chunks of the sorted chunks as opposed to reading in all the sorted chunks at the same time.
I have roughly ~600GB of dictionaries I've accumulated over the years, and I decided I want to clean them up and sort them
First of all, each file on average is very large, anywhere from 500MB to 9GB in size. A prerequisite for what I want to do is that I sort each dictionary. My end goal is to entirely remove duplicate words within and throughout all dictionary files.
The reason for this is that most of my dictionaries are sorted and organized by categories, but duplicates still often exist.
Load file
Read each line and put into data structure
Sort and remove any and all duplicate
Load next file and repeat
Once all files are individually unique, compare against eachother and remove duplicates
For Dictionaries D{1} to D{N}:
1) Sort D{1} through D{N} individually.
2) Check uniqueness of each word in D{i}
3) For each word in D{i}, check ALL words across D{i+1} to D{N}. Delete each word if unique in D{i} first.
I am considering using a sort of "hash" to improve this algorithm. Possibly by only checking the first one or two characters, since the list will be sorted (e.g. hash beginning line location for words starting with a, b, etc.).
4) Save and exit.
Example before (but far smaller):
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all avisskriveri 4BZ32nKEMiqEaT7z
ast chorion 4BZ5
astn chowders bebotch
apiala chroma bebotch
apiales louts bebotch
avisskriveri lowlander chorion
avisskriverier namely PC-Based
avisskriverierne silking PC-Based
avisskriving underwater PC-Based
So it would see avisskriveri, chorion, bebotch and PC-Based are words that repeate both within and among each of the three dictionaries. So I see avisskriveri in D{1} first, so remove it in all other instances that I have seen it in. Then I see chorion in D{2} first, and remove that in all other instances first, and so forth. In D{3} bebotch and PC-Based are replicated, so I want to delete all but one entry of it (unless I've seen it before). Then save all files and close.
Example after:
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all chorion 4BZ32nKEMiqEaT7z
ast chowders 4BZ5
astn chroma bebotch
apiala louts PC-Based
apiales lowlander
avisskriveri namely
avisskriverier silking
avisskriverierne underwater
avisskriving
Remember: I do NOT want to create any new dictionaries, only remove duplicates across all dictionaries.
Options:
"Hash" the amount of unique words for each file, allowing the program to estimate the computation time.
Specify a way give the location of the first word beginning with the desired first letter. So that the search may "jump" to a line and skip unecessary computational time.
Run on GPU for high performance parallel computing. (This is an issue because getting the data off of the GPU is tricky)
Goal: Reduce computational time and space consumption so that the method is affordable on a standard machine or server with limited abilities. Or device a method for running it remotely on a GPU cluster.
tl;dr - Sorting unique words across hundreds of files, where each file is 1-9GB in size.
Assuming the dictionaries are in alphabetical order and line by line, one word per line (as are most dictionaries), you could do something like this:
Open a file stream to each file.
Open a file stream to the compiled list file.
Read 1 entry from each file and put it onto a heap, priority queue, or other sorted data structure.
while you still have entries
find & remove the first entry, storing the word (it is not necessary to store the file)
read in the next entry from that file, if one exists
find & remove any duplicates of the stored entry
read in the next entry for each of those files, if one exists
write the stored word to your compiled list file
Close all of the streams
The efficiency of this is something like O(n*m*log(n)) and the space efficiency is O(n), where n is the number of files and m is the average number of entries.
Note that you'll want to create a data type that pairs entries (strings) with file pointers/references, and sorts by string storing. You'll also need a data structure that allows you to peek before you pop.
If you have questions in implementation, ask me.
A more thorough analysis of the efficiency:
Space efficiency is pretty easy. You fill the data structure, and for every item you put on, you take one off, so it stays at O(n).
Computational efficiency is more complex. The looping itself is O(n*m), because you will consider each entry, and there are n*m entries. Some c percent of those will be valid, but that's a constant, so we don't care.
Next, adding and removing from a priority queue is log(n) both ways, so to find & remove is 2*log(n).
Because we add and remove each entry, we get n*m add and removes, so O(n*m*log(n)). I think it might actually be a theta in this case, but meh.
As far as I understand, there is no pattern to exploit in a clever way. So we want to do raw sorting.
Let us assume that no cluster farm is available (we could do other things then)
Then I would start with the easiest approach possible, the command line tool sort:
sort -u inp1 inp2 -o sorted
This will sort inp1 and inp2 together in output file sorted without duplicates (u = unique). Sort typically uses a customized mergesort algorithm, which can handle a limited amount of memory. So you should not run in memory problems.
You should have at least 600 gb (double the size) of free disk space.
You should test with only 2 input files how long it takes and what happens. My tests did not show any problems, but they had used different data and an afs server (which is rather slow, but is a better emulation as some HPC filesystem provider):
$ ll
2147483646 big1
2147483646 big2
$ time sort -u big1 big2 -o bigsorted
1009.674u 6.290s 28:01.63 60.4% 0+0k 0+0io 0pf+0w
$ ll
2147483646 big1
2147483646 big2
117440512 bigsorted
I'd start with something like:
#include <string>
#include <set>
int main()
{
typedef std::set<string> Words;
Words words;
std::string word;
while (std::cin >> word)
words.insert(word); // will only work if not seen before
for (Words::const_iterator i = words.begin(); i != words.end(); ++i)
std::cout << *i;
}
Then just:
cat file1 file2... | ./this_wonderful_program > greatest_dictionary.txt
Should be fine assuming the number of non-duplicate words fits in memory (likely on any modern PC, especially if you've 64 bits and > 4GB), this will probably be I/O bound anyway so no point fussing over unordered map vs (binary-tree) map etc.. You may want to convert to lower-case, strip spurious characters etc. before inserting to the map.
EDIT:
If the unique words don't fit in memory, or you're just stubbornly determined to sort each individual input then merge them, you can use the unix sort command on each file, then sort -m to efficiently merge the pre-sorted files. If you're not on UNIX/Linux, you can probably still find a port of sort (e.g. from Cygwin for Windows), your OS may have an equivalent program, or you could try compiling the sort source code. Note that this approach is a little different from tb-'s suggestion of asking one invocation of sort to sort everything (presumably in memory) - I'm not sure how well that would work, so best to try/compare.
On that that scale of 300GB+, you may want to consider using Hadoop or some other scalable store - otherwise, you will have to deal with memory issues through your own coding. You can try other, more direct methods (UNIX scripting, small C/C++ programs, etc...), but you will likely run out of memory unless you have a ton of duplicate words in your data.
Addendum
Just came across memcached which seems very close to what you are trying to accomplish: but you may have to tweak it not to throw away the oldest values. I don't have time to check right now, but you should do a search on Distributed Hash Tables.
Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null
What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.
We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.