I am using bash and shuf to shuffle a 400 million line file and it took about two hours when I operate on the file directly.
Since this is a little long for my taste and I have to repeat this shuffling I split the file into about 400 chucks of 1x10^6 lines and cat them together after shuffling.
Even with the split operation and the cat it takes just short of 10 minutes. Could anybody help me understand this bad scaling ?
Thanks very much !
Related
Let's assume that we have a file of 100k lines or ~2gB and we want to split it in 10 chunks of 10k lines each, so that the chunks can then be processed in parallel. Is there any way to create pointers in the starting line of each of the 10 chunks, without needing to traverse the whole file ? I was thinking of somehow dividing the file with regards to its size, so that the pointers are created every 200mB. Is this even feasible ?
Yes, of course. But you need to make some assumptions and accept that your chunks will not be exact.
Either assume a standard line length or scan a few lines and measure it. Then you multiply that by the number of lines you are aiming for and just hope it's a good estimate.
Or if you just want 10 chunks take the file size and divide by 10.
So then you jump to that point in the file, either by using lseek and read, pread, or mmap. Then you scan forward until you find the end of a line and the start of the next.
It won't be exact line counts unless you actually count every line. But it will be pretty close.
I was bored and curious so check this out:
https://github.com/zlynx/linesection
Given that i have a very large log file, large enough that it can not be loaded into my main memory, and i wanted to sort it somehow, what would be the most recommended sorting technique and algorithm?
If you have GNU sort, use it. It knows how to deal with large files. For details, see the answers to How to sort big files on Unix SE. You will of course need sufficient free disk space.
If you are looking for an algorithm, you could apply merge sort.
Essentially you split your data into smaller chunks and sort each chunk. Then you take two sorted chunks and merge them (this can be done in a streaming fashion, just take the smallest value of the two chunks and increment)m this results in a bigger chunk. Keep doing this until you have merged all chunks.
This depends on OS. If on Linux/Unix, you can use the sed command to print specific lines
sed -n -e 120p /var/log/syslog
Which would print line 120 of the syslog file. You could also use head
head -n 15 /var/log/syslog
Which would print the first 15 lines of the syslog file. There is also grep, tail, etc. for viewing portions of a large file. More detail here on these and more:
http://www.thegeekstuff.com/2009/08/10-awesome-examples-for-viewing-huge-log-files-in-unix
For Windows, there is Large Text File Viewer
How would I split a text file in thirds using the Terminal and if do not know exactly how many lines of text there are? I know it is around 3000 or so.
Do you mean splitting the total number of lines into three?
if yes, then write a small script that would do
- count the number of lines. wc -l should do that.
- divide the number by 3 and read the exact number of lines from the text file and write it to 3 buffers, for first two buffers write exact number of lines and write the rest of the lines to 3rd buffer/file since the total number of lines is not always proper multiple of 3. Hope that helps.
I am using unix 'sort textfile.txt' to sort text file with more than 155,000 kb
It takes 2 min. Is there any way to make it faster or is there any alternative faster?
Edit1:
I try both:
sort -S 50% --parallel=4
I takes the same time "2 min"!
How to get the line count of a large file, at least 5G. the fastest approach using shell.
Step 1: head -n filename > newfile // get the first n lines into newfile,e.g. n =5
Step 2: Get the huge file size, A
Step 3: Get the newfile size,B
Step 4: (A/B)*n is approximately equal to the exact line count.
Set n to be different values,done a few times more, then get the average.
The fastest approach is likely to be wc -l.
The wc command is optimized to do exactly this kind of thing. It's very unlikely that anything else you can do (other than doing it on more powerful hardware) is going to be any faster.
Yes, counting lines in a 5 gigabyte text file is slow. It's a big file.
The only alternative would be to store the data in some different format in the first place, perhaps a database, perhaps a file with fixed-length records. Converting your 5 gigabyte text file to some other format is going to take at least as wrong as running wc -l on it, but it might be worth it if you're going to be counting lines a lot. It's impossible to say what the tradeoffs are without more information.