I have around five-hundred 23M files of data that I am trying to sort into one database. I have sorted them all individually using sort, and am now trying to merge them using sort --merge, so that the whole sample is sorted together. Then I plan to split them up into five-hundred files again.
The issue I am running into is that my drive is very congested, and I only have about 805GB available according to df. When I run sort --merge file1 file2 file3... I eventually receive an error that sort has failed due to no more space left on the drive.
Are there any tips for how I could work around or solve this issue, or is the only solution to free up more space?
Related
Given that i have a very large log file, large enough that it can not be loaded into my main memory, and i wanted to sort it somehow, what would be the most recommended sorting technique and algorithm?
If you have GNU sort, use it. It knows how to deal with large files. For details, see the answers to How to sort big files on Unix SE. You will of course need sufficient free disk space.
If you are looking for an algorithm, you could apply merge sort.
Essentially you split your data into smaller chunks and sort each chunk. Then you take two sorted chunks and merge them (this can be done in a streaming fashion, just take the smallest value of the two chunks and increment)m this results in a bigger chunk. Keep doing this until you have merged all chunks.
This depends on OS. If on Linux/Unix, you can use the sed command to print specific lines
sed -n -e 120p /var/log/syslog
Which would print line 120 of the syslog file. You could also use head
head -n 15 /var/log/syslog
Which would print the first 15 lines of the syslog file. There is also grep, tail, etc. for viewing portions of a large file. More detail here on these and more:
http://www.thegeekstuff.com/2009/08/10-awesome-examples-for-viewing-huge-log-files-in-unix
For Windows, there is Large Text File Viewer
I like to use the -u option of the UNIX sort utility to get unique lines based on a particular subset of columns, e.g. sort -u -k1,1 -k4,4
I have looked extensively in UNIX sort and GNU sort documentation, and I cannot find any guarantee that the -u option will return the first instance (like the uniq utility) after sorting by the specified keys.
It seems to work as desired in practice (sort by keys, then give first instance of each unique key combination), but I was hoping for some kind of guarantee in the documentation to put my paranoia at ease.
Does anyone know of such a guarantee?
I think the code for such a small utility is likely the only place you'll find such a guarantee. You can enable more debugging output as well if you'd like to see how it is working.
If you look through the code for GNU sort, it appears that the uniqueness testing happens after all sorting is completed, when it is iterating through the sorted contents of the temporary files created by the sorting process.
This happens in a while loop that compares the previous line savedline with smallest, which is the next smallest input line which would be output.
Thus, my opinion would be that it will process your sorting criteria first, then unique the output at the last step.
Looking for some advice or insight on what I consider a simple method in PERL to compare text files to one another.
Lets assume you have 90,000 text files that are all structured similarly, say they have a common theme with a small amount of unique data in each.
My logic says to simply loop through the files (breaking into 1000 lines for simplicity), then loop through the # of files ... 90,000 - then loop through the 90,000 files again to compare to each other. This becomes a virtually endless loop of a bazillion lines or processes.
Now the mandatory step here is to "remove" any line that is found in any file except the file we are working on. The ultimate goal is to scrub all the files down to content that is unique across the entire collection, even if it means some files end up empty.
I am saying files, but this could be rows in a database, or elements in an array. (I`ve tried all.) The fastest solution so far has been to load all the files into mysql, then run
UPDATE table SET column=REPLACE(column, find, replace); Also tried Parallel::ForkManager when working with mysql.
The slowest approach actually led to exhausting my 32 GB of ram - that was loading all 90k files into an array. 90k files didnt work at all, smaller batches like 1000 works fine, but then doesnt compare to the other 89,000.
Server specs if helpful: Single Quad-Core E3-1240 4Cores x 3.4Ghz w/ HT 32GB DDR3 ECC RAM 1600MHz 1x256SSD
So how does an engineer solve this problem? I am just PERL hacker...
Tag every line with the filename (and maybe the line number) and sort all the lines using Sort::External. Then you can read the sorted records in order and write only a single unique line to the result files.
A Bloom filter is perfect for this, if you can handle arbitrarily small error.
To quote wikipedia: "A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either 'possibly in set' or 'definitely not in set'."
In essence, you'll use k hashes to hash each row to k spots on a bit array. Each time you encounter a new row, you are guaranteed you haven't seen it if at least one of the k hashed indices has a '0' bit. You can read up on Bloom filters to see how to size the array and choose k to make false positives arbitrarily small.
Then you go through your files, and either delete rows where you get a positive match, or copy the negative match rows into a new file.
Sort the items using an external merge sort algorithm and remove the duplicates on the merge phase.
Actually, you can do that efficiently just calling the sort command with the -u flag. From Perl:
system "sort -u #files >output";
Your sort command may provide several adjustable parameters to improve its performance. For instance, the number of parallel processes or the amount of memory it can allocate.
I have roughly ~600GB of dictionaries I've accumulated over the years, and I decided I want to clean them up and sort them
First of all, each file on average is very large, anywhere from 500MB to 9GB in size. A prerequisite for what I want to do is that I sort each dictionary. My end goal is to entirely remove duplicate words within and throughout all dictionary files.
The reason for this is that most of my dictionaries are sorted and organized by categories, but duplicates still often exist.
Load file
Read each line and put into data structure
Sort and remove any and all duplicate
Load next file and repeat
Once all files are individually unique, compare against eachother and remove duplicates
For Dictionaries D{1} to D{N}:
1) Sort D{1} through D{N} individually.
2) Check uniqueness of each word in D{i}
3) For each word in D{i}, check ALL words across D{i+1} to D{N}. Delete each word if unique in D{i} first.
I am considering using a sort of "hash" to improve this algorithm. Possibly by only checking the first one or two characters, since the list will be sorted (e.g. hash beginning line location for words starting with a, b, etc.).
4) Save and exit.
Example before (but far smaller):
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all avisskriveri 4BZ32nKEMiqEaT7z
ast chorion 4BZ5
astn chowders bebotch
apiala chroma bebotch
apiales louts bebotch
avisskriveri lowlander chorion
avisskriverier namely PC-Based
avisskriverierne silking PC-Based
avisskriving underwater PC-Based
So it would see avisskriveri, chorion, bebotch and PC-Based are words that repeate both within and among each of the three dictionaries. So I see avisskriveri in D{1} first, so remove it in all other instances that I have seen it in. Then I see chorion in D{2} first, and remove that in all other instances first, and so forth. In D{3} bebotch and PC-Based are replicated, so I want to delete all but one entry of it (unless I've seen it before). Then save all files and close.
Example after:
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all chorion 4BZ32nKEMiqEaT7z
ast chowders 4BZ5
astn chroma bebotch
apiala louts PC-Based
apiales lowlander
avisskriveri namely
avisskriverier silking
avisskriverierne underwater
avisskriving
Remember: I do NOT want to create any new dictionaries, only remove duplicates across all dictionaries.
Options:
"Hash" the amount of unique words for each file, allowing the program to estimate the computation time.
Specify a way give the location of the first word beginning with the desired first letter. So that the search may "jump" to a line and skip unecessary computational time.
Run on GPU for high performance parallel computing. (This is an issue because getting the data off of the GPU is tricky)
Goal: Reduce computational time and space consumption so that the method is affordable on a standard machine or server with limited abilities. Or device a method for running it remotely on a GPU cluster.
tl;dr - Sorting unique words across hundreds of files, where each file is 1-9GB in size.
Assuming the dictionaries are in alphabetical order and line by line, one word per line (as are most dictionaries), you could do something like this:
Open a file stream to each file.
Open a file stream to the compiled list file.
Read 1 entry from each file and put it onto a heap, priority queue, or other sorted data structure.
while you still have entries
find & remove the first entry, storing the word (it is not necessary to store the file)
read in the next entry from that file, if one exists
find & remove any duplicates of the stored entry
read in the next entry for each of those files, if one exists
write the stored word to your compiled list file
Close all of the streams
The efficiency of this is something like O(n*m*log(n)) and the space efficiency is O(n), where n is the number of files and m is the average number of entries.
Note that you'll want to create a data type that pairs entries (strings) with file pointers/references, and sorts by string storing. You'll also need a data structure that allows you to peek before you pop.
If you have questions in implementation, ask me.
A more thorough analysis of the efficiency:
Space efficiency is pretty easy. You fill the data structure, and for every item you put on, you take one off, so it stays at O(n).
Computational efficiency is more complex. The looping itself is O(n*m), because you will consider each entry, and there are n*m entries. Some c percent of those will be valid, but that's a constant, so we don't care.
Next, adding and removing from a priority queue is log(n) both ways, so to find & remove is 2*log(n).
Because we add and remove each entry, we get n*m add and removes, so O(n*m*log(n)). I think it might actually be a theta in this case, but meh.
As far as I understand, there is no pattern to exploit in a clever way. So we want to do raw sorting.
Let us assume that no cluster farm is available (we could do other things then)
Then I would start with the easiest approach possible, the command line tool sort:
sort -u inp1 inp2 -o sorted
This will sort inp1 and inp2 together in output file sorted without duplicates (u = unique). Sort typically uses a customized mergesort algorithm, which can handle a limited amount of memory. So you should not run in memory problems.
You should have at least 600 gb (double the size) of free disk space.
You should test with only 2 input files how long it takes and what happens. My tests did not show any problems, but they had used different data and an afs server (which is rather slow, but is a better emulation as some HPC filesystem provider):
$ ll
2147483646 big1
2147483646 big2
$ time sort -u big1 big2 -o bigsorted
1009.674u 6.290s 28:01.63 60.4% 0+0k 0+0io 0pf+0w
$ ll
2147483646 big1
2147483646 big2
117440512 bigsorted
I'd start with something like:
#include <string>
#include <set>
int main()
{
typedef std::set<string> Words;
Words words;
std::string word;
while (std::cin >> word)
words.insert(word); // will only work if not seen before
for (Words::const_iterator i = words.begin(); i != words.end(); ++i)
std::cout << *i;
}
Then just:
cat file1 file2... | ./this_wonderful_program > greatest_dictionary.txt
Should be fine assuming the number of non-duplicate words fits in memory (likely on any modern PC, especially if you've 64 bits and > 4GB), this will probably be I/O bound anyway so no point fussing over unordered map vs (binary-tree) map etc.. You may want to convert to lower-case, strip spurious characters etc. before inserting to the map.
EDIT:
If the unique words don't fit in memory, or you're just stubbornly determined to sort each individual input then merge them, you can use the unix sort command on each file, then sort -m to efficiently merge the pre-sorted files. If you're not on UNIX/Linux, you can probably still find a port of sort (e.g. from Cygwin for Windows), your OS may have an equivalent program, or you could try compiling the sort source code. Note that this approach is a little different from tb-'s suggestion of asking one invocation of sort to sort everything (presumably in memory) - I'm not sure how well that would work, so best to try/compare.
On that that scale of 300GB+, you may want to consider using Hadoop or some other scalable store - otherwise, you will have to deal with memory issues through your own coding. You can try other, more direct methods (UNIX scripting, small C/C++ programs, etc...), but you will likely run out of memory unless you have a ton of duplicate words in your data.
Addendum
Just came across memcached which seems very close to what you are trying to accomplish: but you may have to tweak it not to throw away the oldest values. I don't have time to check right now, but you should do a search on Distributed Hash Tables.
I need to merge about 30 gzip-ed text files, each about 10-15GB compressed, each containing multi-line records, each sorted by the same key. The files reside on an NFS share, I have access to them from several nodes, and each node has its own /tmp filesystem. What would be the fastest way to go about it?
Some possible solutions:
A. Leave it all to sort -m. To do that, I need to pass every input file through awk/sed/grep to collapse each record into a line and extract a key that would be understood by sort. So I would get something like
sort -m -k [...] <(preprocess file1) [...] <(preprocess filen) | postprocess
B. Look into python's heapq.merge.
C. Write my own C code to do this. I could merge the files in small batches, make an OMP thread for each input file, one for the output, and one actually doing the merging in RAM, etc.
Options for all of the above:
D. Merge a few files at a time, in a tournament.
E. Use several nodes for this, copying intermediate results in between the nodes.
What would you recommend? I don't have much experience about secondary storage efficiency, and as such, I find it hard to estimate how either of these would perform.
If you go for your solution B involving heapq.merge, then you will be delighted to know, that Python 3.5 will add a key parameter to heapq.merge() according to docs.python.org, bugs.python.org and github.com. This will be a great solution to your problem.