I am using unix 'sort textfile.txt' to sort text file with more than 155,000 kb
It takes 2 min. Is there any way to make it faster or is there any alternative faster?
Edit1:
I try both:
sort -S 50% --parallel=4
I takes the same time "2 min"!
Related
I am using bash and shuf to shuffle a 400 million line file and it took about two hours when I operate on the file directly.
Since this is a little long for my taste and I have to repeat this shuffling I split the file into about 400 chucks of 1x10^6 lines and cat them together after shuffling.
Even with the split operation and the cat it takes just short of 10 minutes. Could anybody help me understand this bad scaling ?
Thanks very much !
Given that i have a very large log file, large enough that it can not be loaded into my main memory, and i wanted to sort it somehow, what would be the most recommended sorting technique and algorithm?
If you have GNU sort, use it. It knows how to deal with large files. For details, see the answers to How to sort big files on Unix SE. You will of course need sufficient free disk space.
If you are looking for an algorithm, you could apply merge sort.
Essentially you split your data into smaller chunks and sort each chunk. Then you take two sorted chunks and merge them (this can be done in a streaming fashion, just take the smallest value of the two chunks and increment)m this results in a bigger chunk. Keep doing this until you have merged all chunks.
This depends on OS. If on Linux/Unix, you can use the sed command to print specific lines
sed -n -e 120p /var/log/syslog
Which would print line 120 of the syslog file. You could also use head
head -n 15 /var/log/syslog
Which would print the first 15 lines of the syslog file. There is also grep, tail, etc. for viewing portions of a large file. More detail here on these and more:
http://www.thegeekstuff.com/2009/08/10-awesome-examples-for-viewing-huge-log-files-in-unix
For Windows, there is Large Text File Viewer
I have around five-hundred 23M files of data that I am trying to sort into one database. I have sorted them all individually using sort, and am now trying to merge them using sort --merge, so that the whole sample is sorted together. Then I plan to split them up into five-hundred files again.
The issue I am running into is that my drive is very congested, and I only have about 805GB available according to df. When I run sort --merge file1 file2 file3... I eventually receive an error that sort has failed due to no more space left on the drive.
Are there any tips for how I could work around or solve this issue, or is the only solution to free up more space?
How to get the line count of a large file, at least 5G. the fastest approach using shell.
Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5
Step 2: Get the huge file size, A
Step 3: Get the newfile size,B
Step 4: (A/B)*n is approximately equal to the exact line count.
Set n to be different values,done a few times more, then get the average.
The fastest approach is likely to be wc -l.
The wc command is optimized to do exactly this kind of thing. It's very unlikely that anything else you can do (other than doing it on more powerful hardware) is going to be any faster.
Yes, counting lines in a 5 gigabyte text file is slow. It's a big file.
The only alternative would be to store the data in some different format in the first place, perhaps a database, perhaps a file with fixed-length records. Converting your 5 gigabyte text file to some other format is going to take at least as wrong as running wc -l on it, but it might be worth it if you're going to be counting lines a lot. It's impossible to say what the tradeoffs are without more information.
I have file ~ 1.5GB
I need to find in this file 3 billion sequences of bytes. One sequence may be 4 or 5 bytes.
Find the first position, or to make sure that such a sequence in the file no.
How to do it fastest?
RAM limit on computer - 4GB
Use grep. It's highly optimized for finding things in large files.
If that's not an option, read about the Boyer-Moore algorithm it uses and implement it yourself. It'll take a lot of tweaking to reproduce the same speed grep has though.
Use Preprocessing.
I think you should just create an Index, make a run through the file, recording the first instance of every unique 4 byte sequence. Store the 4 byte sequence and the first occurring position in a different file, sorted by the byte sequence.
Using a simple binary search on the Index file will efficiently find your sequence.
You could be more clever and use hashing to reduce the search to O(1).
Check out the Searchlight search engine.
This program allows multiple sequences of up to 10 ASCII bytes to be stored within a single file. You then point it at a file, directory, file of filenames, file of directory names, arraylist of filenames or an arraylist of directory names and away it goes!!
Furthermore, it reports the file byte position/offset of each sequence found.