Merge sorted files efficiently - sorting

I need to merge about 30 gzip-ed text files, each about 10-15GB compressed, each containing multi-line records, each sorted by the same key. The files reside on an NFS share, I have access to them from several nodes, and each node has its own /tmp filesystem. What would be the fastest way to go about it?
Some possible solutions:
A. Leave it all to sort -m. To do that, I need to pass every input file through awk/sed/grep to collapse each record into a line and extract a key that would be understood by sort. So I would get something like
sort -m -k [...] <(preprocess file1) [...] <(preprocess filen) | postprocess
B. Look into python's heapq.merge.
C. Write my own C code to do this. I could merge the files in small batches, make an OMP thread for each input file, one for the output, and one actually doing the merging in RAM, etc.
Options for all of the above:
D. Merge a few files at a time, in a tournament.
E. Use several nodes for this, copying intermediate results in between the nodes.
What would you recommend? I don't have much experience about secondary storage efficiency, and as such, I find it hard to estimate how either of these would perform.

If you go for your solution B involving heapq.merge, then you will be delighted to know, that Python 3.5 will add a key parameter to heapq.merge() according to docs.python.org, bugs.python.org and github.com. This will be a great solution to your problem.

Related

Spark read partitions - Resource cost analysis

When reading data partitioned by column in Spark with something like spark.read.json("/A=1/B=2/C=3/D=4/E=5/") will allow to scan only the files in the folder E=5.
But let's say I am interested to read partitions in which C = my_value through all the data source. The instruction will be spark.read.json("/*/*/C=my_value/").
What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?
Thank you for an interesting question. Apache Spark uses Hadoop's FileSystem abstraction to deal with wildcard patterns. In the source code they're called glob patterns
The org.apache.hadoop.fs.FileSystem#globStatus(org.apache.hadoop.fs.Path) method is used to return "an array of paths that match the path pattern". This function calls then org.apache.hadoop.fs.Globber#glob to figure out the exact files matching algorithm for the glob pattern. globStatus is called by org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary. You can add some breakpoints to see how does it work under-the-hood.
But long story short:
What happens computationally in the described scenario under the hood? Spark will just list through the partition values of A and B? Or it will scan through all the leaves (the actual files) too?
Spark will split your glob in 3 parts ["*", "*", "C=my_value"]. Later, it will list files at every level by using Hadoop org.apache.hadoop.fs.FileSystem#listStatus(org.apache.hadoop.fs.Path) method. For every file it will build a path and try to match it against the current pattern. The matching files will be kept as "candidates" that will be filtered out only at the last step, when the algorithm will look for "C=my_value".
Unless you have a lot of files, this operation shouldn't hurt you. And probably that's one of the reasons why you should rather keep less but bigger files (famous data engineering problem of "too many small files").

Perl processing a trillion records

Looking for some advice or insight on what I consider a simple method in PERL to compare text files to one another.
Lets assume you have 90,000 text files that are all structured similarly, say they have a common theme with a small amount of unique data in each.
My logic says to simply loop through the files (breaking into 1000 lines for simplicity), then loop through the # of files ... 90,000 - then loop through the 90,000 files again to compare to each other. This becomes a virtually endless loop of a bazillion lines or processes.
Now the mandatory step here is to "remove" any line that is found in any file except the file we are working on. The ultimate goal is to scrub all the files down to content that is unique across the entire collection, even if it means some files end up empty.
I am saying files, but this could be rows in a database, or elements in an array. (I`ve tried all.) The fastest solution so far has been to load all the files into mysql, then run
UPDATE table SET column=REPLACE(column, find, replace); Also tried Parallel::ForkManager when working with mysql.
The slowest approach actually led to exhausting my 32 GB of ram - that was loading all 90k files into an array. 90k files didnt work at all, smaller batches like 1000 works fine, but then doesnt compare to the other 89,000.
Server specs if helpful: Single Quad-Core E3-1240 4Cores x 3.4Ghz w/ HT 32GB DDR3 ECC RAM 1600MHz 1x256SSD
So how does an engineer solve this problem? I am just PERL hacker...
Tag every line with the filename (and maybe the line number) and sort all the lines using Sort::External. Then you can read the sorted records in order and write only a single unique line to the result files.
A Bloom filter is perfect for this, if you can handle arbitrarily small error.
To quote wikipedia: "A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either 'possibly in set' or 'definitely not in set'."
In essence, you'll use k hashes to hash each row to k spots on a bit array. Each time you encounter a new row, you are guaranteed you haven't seen it if at least one of the k hashed indices has a '0' bit. You can read up on Bloom filters to see how to size the array and choose k to make false positives arbitrarily small.
Then you go through your files, and either delete rows where you get a positive match, or copy the negative match rows into a new file.
Sort the items using an external merge sort algorithm and remove the duplicates on the merge phase.
Actually, you can do that efficiently just calling the sort command with the -u flag. From Perl:
system "sort -u #files >output";
Your sort command may provide several adjustable parameters to improve its performance. For instance, the number of parallel processes or the amount of memory it can allocate.

Removing Duplicate Words Across Multiple and Large Dictionary Files

I have roughly ~600GB of dictionaries I've accumulated over the years, and I decided I want to clean them up and sort them
First of all, each file on average is very large, anywhere from 500MB to 9GB in size. A prerequisite for what I want to do is that I sort each dictionary. My end goal is to entirely remove duplicate words within and throughout all dictionary files.
The reason for this is that most of my dictionaries are sorted and organized by categories, but duplicates still often exist.
Load file
Read each line and put into data structure
Sort and remove any and all duplicate
Load next file and repeat
Once all files are individually unique, compare against eachother and remove duplicates
For Dictionaries D{1} to D{N}:
1) Sort D{1} through D{N} individually.
2) Check uniqueness of each word in D{i}
3) For each word in D{i}, check ALL words across D{i+1} to D{N}. Delete each word if unique in D{i} first.
I am considering using a sort of "hash" to improve this algorithm. Possibly by only checking the first one or two characters, since the list will be sorted (e.g. hash beginning line location for words starting with a, b, etc.).
4) Save and exit.
Example before (but far smaller):
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all avisskriveri 4BZ32nKEMiqEaT7z
ast chorion 4BZ5
astn chowders bebotch
apiala chroma bebotch
apiales louts bebotch
avisskriveri lowlander chorion
avisskriverier namely PC-Based
avisskriverierne silking PC-Based
avisskriving underwater PC-Based
So it would see avisskriveri, chorion, bebotch and PC-Based are words that repeate both within and among each of the three dictionaries. So I see avisskriveri in D{1} first, so remove it in all other instances that I have seen it in. Then I see chorion in D{2} first, and remove that in all other instances first, and so forth. In D{3} bebotch and PC-Based are replicated, so I want to delete all but one entry of it (unless I've seen it before). Then save all files and close.
Example after:
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all chorion 4BZ32nKEMiqEaT7z
ast chowders 4BZ5
astn chroma bebotch
apiala louts PC-Based
apiales lowlander
avisskriveri namely
avisskriverier silking
avisskriverierne underwater
avisskriving
Remember: I do NOT want to create any new dictionaries, only remove duplicates across all dictionaries.
Options:
"Hash" the amount of unique words for each file, allowing the program to estimate the computation time.
Specify a way give the location of the first word beginning with the desired first letter. So that the search may "jump" to a line and skip unecessary computational time.
Run on GPU for high performance parallel computing. (This is an issue because getting the data off of the GPU is tricky)
Goal: Reduce computational time and space consumption so that the method is affordable on a standard machine or server with limited abilities. Or device a method for running it remotely on a GPU cluster.
tl;dr - Sorting unique words across hundreds of files, where each file is 1-9GB in size.
Assuming the dictionaries are in alphabetical order and line by line, one word per line (as are most dictionaries), you could do something like this:
Open a file stream to each file.
Open a file stream to the compiled list file.
Read 1 entry from each file and put it onto a heap, priority queue, or other sorted data structure.
while you still have entries
find & remove the first entry, storing the word (it is not necessary to store the file)
read in the next entry from that file, if one exists
find & remove any duplicates of the stored entry
read in the next entry for each of those files, if one exists
write the stored word to your compiled list file
Close all of the streams
The efficiency of this is something like O(n*m*log(n)) and the space efficiency is O(n), where n is the number of files and m is the average number of entries.
Note that you'll want to create a data type that pairs entries (strings) with file pointers/references, and sorts by string storing. You'll also need a data structure that allows you to peek before you pop.
If you have questions in implementation, ask me.
A more thorough analysis of the efficiency:
Space efficiency is pretty easy. You fill the data structure, and for every item you put on, you take one off, so it stays at O(n).
Computational efficiency is more complex. The looping itself is O(n*m), because you will consider each entry, and there are n*m entries. Some c percent of those will be valid, but that's a constant, so we don't care.
Next, adding and removing from a priority queue is log(n) both ways, so to find & remove is 2*log(n).
Because we add and remove each entry, we get n*m add and removes, so O(n*m*log(n)). I think it might actually be a theta in this case, but meh.
As far as I understand, there is no pattern to exploit in a clever way. So we want to do raw sorting.
Let us assume that no cluster farm is available (we could do other things then)
Then I would start with the easiest approach possible, the command line tool sort:
sort -u inp1 inp2 -o sorted
This will sort inp1 and inp2 together in output file sorted without duplicates (u = unique). Sort typically uses a customized mergesort algorithm, which can handle a limited amount of memory. So you should not run in memory problems.
You should have at least 600 gb (double the size) of free disk space.
You should test with only 2 input files how long it takes and what happens. My tests did not show any problems, but they had used different data and an afs server (which is rather slow, but is a better emulation as some HPC filesystem provider):
$ ll
2147483646 big1
2147483646 big2
$ time sort -u big1 big2 -o bigsorted
1009.674u 6.290s 28:01.63 60.4% 0+0k 0+0io 0pf+0w
$ ll
2147483646 big1
2147483646 big2
117440512 bigsorted
I'd start with something like:
#include <string>
#include <set>
int main()
{
typedef std::set<string> Words;
Words words;
std::string word;
while (std::cin >> word)
words.insert(word); // will only work if not seen before
for (Words::const_iterator i = words.begin(); i != words.end(); ++i)
std::cout << *i;
}
Then just:
cat file1 file2... | ./this_wonderful_program > greatest_dictionary.txt
Should be fine assuming the number of non-duplicate words fits in memory (likely on any modern PC, especially if you've 64 bits and > 4GB), this will probably be I/O bound anyway so no point fussing over unordered map vs (binary-tree) map etc.. You may want to convert to lower-case, strip spurious characters etc. before inserting to the map.
EDIT:
If the unique words don't fit in memory, or you're just stubbornly determined to sort each individual input then merge them, you can use the unix sort command on each file, then sort -m to efficiently merge the pre-sorted files. If you're not on UNIX/Linux, you can probably still find a port of sort (e.g. from Cygwin for Windows), your OS may have an equivalent program, or you could try compiling the sort source code. Note that this approach is a little different from tb-'s suggestion of asking one invocation of sort to sort everything (presumably in memory) - I'm not sure how well that would work, so best to try/compare.
On that that scale of 300GB+, you may want to consider using Hadoop or some other scalable store - otherwise, you will have to deal with memory issues through your own coding. You can try other, more direct methods (UNIX scripting, small C/C++ programs, etc...), but you will likely run out of memory unless you have a ton of duplicate words in your data.
Addendum
Just came across memcached which seems very close to what you are trying to accomplish: but you may have to tweak it not to throw away the oldest values. I don't have time to check right now, but you should do a search on Distributed Hash Tables.

splitting files in unix

Just wondering if there is a faster way to split a file into N chunks other than unix "split".
Basically I have large files which I would like to split into smaller chunks and operate on each one in parallel.
I assume you're using split -b which will be more CPU-efficient than splitting by lines, but still reads the whole input file and writes it out to each file. If the serial nature of the execution of this portion of split is your bottleneck, you can use dd to extract the chunks of the file in parallel. You will need a distinct dd command for each parallel process. Here's one example command line (assuming the_input_file is a large file this extracts a bit from the middle):
dd skip=400 count=1 if=the_input_file bs=512 of=_output
To make this work you will need to choose appropriate values of count and bs (those above are very small). Each worker will also need to choose a different value of skip so that the chunks don't overlap. But this is efficient; dd implements skip with a seek operation.
Of course, this is still not as efficient as implementing your data consumer process in such a way that it can read a specified chunk of the input file directly, in parallel with other similar consumer processes. But I assume if you could do that you would not have asked this question.
Given that this is an OS utility, my inclination would be to think that it's optimized for best performance.
You can see this question (or do a man -k split or man split) to find related commands that you might be able to use instead of split.
If you are thinking of implementing your own solution in say C, then I would suggest you run some benchmarks this for your own specific system/environment and some sample data and determine what tool to use.
Note: if you aren't going to be doing this regularly, it may not be worth your while to even think about this much, just go ahead and use a tool that does what you need it to do (in this case split)

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Resources