`uniq` without sorting an immense text file? - bash

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.
The file has unix line endings, and all content matches [[:print:]]. I tried the following awk script to display only unique lines:
awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt
The thought was that I'd populate an array by referencing its elements, using the contents of the file as keys, then skip lines that were already in the array. But this fails for two reasons -- firstly, because it inexplicably just doesn't work (even on small test files), and secondly because I know that my system will run out of memory before the entire set of unique lines is loaded into memory by awk.
After searching, I found this answer which recommended:
awk '!x[$0]++'
And while this works on small files, it also will run out of memory before reading my entire file.
What's a better (i.e. working) solution? I'm open to just about anything, though I'm more partial to solutions in languages I know (bash & awk, hence the tags). In trying to visualize the problem, the best I've come up with would be to store an array of line checksums or MD5s rather than the lines themselves, but that only saves a little space and runs the risk of checksum collisions.
Any tips would be very welcome. Telling me this is impossible would also be welcome, so that I stop trying to figure it out. :-P

The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.
However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.
However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.
The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:
cat huge.txt | pv |
parallel --pipe --keep-order --block 100M -j4 -q \
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.
In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).
Other options include:
Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
Sort your huge file with an external merge sort or distribution/bucket sort
Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
Or use bloom filters
Here is an example of using the Bloom::Faster module of Perl:
perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
You may install Bloom::Faster from cran (sudo cran and then install "Bloom::Faster")
Explanation:
You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.
In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.

I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:
$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five
Our raw data includes one duplicated line. The pipes function as follows:
nl adds line numbers. It's a standard, low-impact unix tool.
sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
sort the second time puts things back in the order defined by the nl command.
cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.
Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.
I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.

Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).
sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
awk -f dedupe.awk file
Which on a test input file like:
line 1
line 2
line 3
line 2
line 2
line 3
line 4
line 5
line 6
creates an awk script of:
$0=="line 2"{x[1]++; if (x[1]>1){next}}
$0=="line 3"{x[2]++; if (x[2]>1){next}}
7
and run as awk -f dedupe.awk file outputs:
line 1
line 2
line 3
line 4
line 5
line 6
If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:
sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'
which cuts seven characters off each line (six if you remove the space from the original too) and generates:
{f=1}
$0=="line 2"{x[1]++;f=(x[1]<=1)}
$0=="line 3"{x[2]++;f=(x[2]<=1)}
f
This solution will probably run slower though because it doesn't short-circuit the script as matches are found.
If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).

I'd do it like this:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1"
for x in x*; do
awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done
cat $(sort -n yx*) | sort | uniq -d | \
while IFS= read -r line; do
fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
while IFS=: read -r file nr rest; do
sed -i -d ${nr}d "$file"
done
done
cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
(proof of concept; needs more polishing before being used in production).
What's going on here:
split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
sed -i -e ${nr}d "$file" removes line $nr from chunk $file
cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.
This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.
The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).
Edit: Second version, after feedback from #EtanReisner:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
ls -1 "$tdir" | while IFS= read -r x; do
awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
rm -f "${tdir}/$x" || exit 1
done
find "$tdir" -type f -name 'yx*' | \
xargs -n 1 cat | \
sort | \
uniq -d >"$dupes" || exit 1
find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
sed 's!.*/!!' | \
sort -t: -n -k 1.3,1 -k 2,2 | \
perl '
while(<STDIN>) {
chomp;
m/^(yx\d+):(\d+):(.*)$/o;
if ($dupes{$3}++)
{ push #{$del{$1}}, int($2) }
else
{ $del{$1} = [] }
}
undef %dupes;
chdir $ARGV[0];
for $fn (sort <"yx*">) {
open $fh, "<", $fn
or die qq(open $fn: $!);
$line = $idx = 0;
while(<$fh>) {
$line++;
if ($idx < #{$del{$fn}} and $line == $del{$fn}->[$idx])
{ $idx++ }
else
{ print }
}
close $fh
or die qq(close $fn: $!);
unlink $fn
or die qq(remove $fn: $!);
}
' "$tdir" >uniq_"$1" || exit 1

If there's a lot of duplication, one possibility is to split the file using split(1) into manageable pieces and using something conventional like sort/uniq to make a summary of unique lines. This will be shorter than the actual piece itself. After this, you can compare the pieces to arrive at an actual summary.

Maybe not the answer you've been looking for but here goes: use a bloom filter.
https://en.wikipedia.org/wiki/Bloom_filter This sort of problem is one of the main reasons they exist.

Related

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Finding a uniq -c substitute for big files

I have a large file (50 GB) and I could like to count the number of occurrences of different lines in it. Normally I'd use
sort bigfile | uniq -c
but the file is large enough that sorting takes a prohibitive amount of time and memory. I could do
grep -cfx 'one possible line'
for each unique line in the file, but this would mean n passes over the file for each possible line, which (although much more memory friendly) takes even longer than the original.
Any ideas?
A related question asks about a way to find unique lines in a big file, but I'm looking for a way to count the number of instances of each -- I already know what the possible lines are.
Use awk
awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt
This is O(n) in time, and O(unique lines) in space.
Here is a solution using jq 1.5. It is essentially the same as the awk solution, both in approach and performance characteristics, but the output is a JSON object representing the hash. (The program can be trivially modified to produce output in an alternative format.)
Invocation:
$ jq -nR 'reduce inputs as $line ({}; .[$line] += 1)' bigfile.txt
If bigfile.txt consisted of these lines:
a
a
b
a
c
then the output would be:
{
"a": 3,
"b": 1,
"c": 1
}
#!/bin/bash
# port this logic to awk or ksh93 to make it fast
declare -A counts=( )
while IFS= read -r line; do
counts[$line]=$(( counts[$line] + 1 )) # increment counter
done
# print results
for key in "${!counts[#]}"; do
count=${counts[$key]}
echo "Found $count instances of $key"
done

how to split a file into smaller files (one file per line) [split doesn't work]

I'm trying to split a very large file to one new file per line.
Why? It's going to be input for Mahout. but there are too many lines and not enough suffixes for split.
Is there a way to do this in bash?
Increase Your Suffix Length with Split
If you insist on using split, then you have to increase your suffix length. For example, assuming you have 10,000 lines in your file:
split --suffix-length=5 --lines=1 foo.txt
If you really want to go nuts with this approach, you can even set the suffix length dynamically with the wc command and some shell arithmetic. For example:
file='foo.txt'
split \
--suffix-length=$(( $(wc --chars < <(wc --lines < "$file")) - 1 )) \
--lines=1 \
"$file"
Use Xargs Instead
However, the above is really just a kludge anyway. A more correct solution would be to use xargs from the GNU findutils package to invoke some command once per line. For example:
xargs --max-lines=1 --arg-file=foo.txt your_command
This will pass one line at a time to your command. This is a much more flexible approach and will dramatically reduce your disk I/O.
split --lines=1 --suffix-length=5 input.txt output.
This will use 5 characters per suffix, which is enough for 265 = 11881376 files. If you really have more than that, increase suffix-length.
Here's another way to do something for each line:
while IFS= read -r line; do
do_something_with "$line"
done < big.file
GNU Parallel can do this:
cat big.file | parallel --pipe -N1 'cat > {#}'
But if Mahout can read from stdin then you can avoid the temporary files:
cat big.file | parallel --pipe -N1 mahout --input-file -
Learn more about GNU Parallel https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

How could the UNIX sort command sort a very large file?

The UNIX sort command can sort a very large file like this:
sort large_file
How is the sort algorithm implemented?
How come it does not cause excessive consumption of memory?
The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.
The sort command stores working data in temporary disk files (usually in /tmp).
#!/bin/bash
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2
WARNING: This script starts one shell per chunk, for really large files, this could be hundreds.
Here is a script I wrote for this purpose. On a 4 processor machine it improved the sort performance by 100% !
#! /bin/ksh
MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted
usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
echo and each chunk will be sorted in parallel
}
# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE
#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX
for file in $CHUNK_FILE_PREFIX*
do
sort $file > $file.sorted &
done
wait
#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE
#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
See also:
"Sorting large files faster with a shell script"
I'm not familiar with the program but I guess it is done by means of external sorting (most of the problem is held in temporary files while relatively small part of the problem is held in memory at a time). See Donald Knuth's The Art of Computer Programming, Vol. 3 Sorting and Searching, Section 5.4 for very in-depth discussion of the subject.
Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem.
Key parameters on Ubuntu are
Location of temporary files -T directory_name
Amount of memory to use -S N% ( N% of all memory to use, the more the better but
avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)
The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
How to use -T option for sorting large file
I have to sort a large file's 7th column.
I was using:
grep vdd "file name" | sort -nk 7 |
I faced below error:
******sort: write failed: /tmp/sort1hc37c: No space left on device******
then I used -T option as below it worked:
grep vdda "file name" | sort -nk 7 -T /dev/null/ |
Memory should not be a problem - sort already takes care of that. If you want make optimal usage of your multi-core CPU I have implementend this in a small script (similar to some you might find on the net, but simpler/cleaner than most of those ;)).
#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
#
# psort largefile.txt 20m 4
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
let i++
sort $fname > $fname.$suffix &
mres=$(($i % $nthreads))
test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix
rm $1.part*

Resources