Stream filter large number of lines that are specified by line number from stdin - bash

I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file

Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime

To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!

If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499

Related

Why is my awk script much slower than the head+tail script?

I want to split a huge file (big.txt). by given line numbers. For example, if the give numbers are 10 15 30, I will get 4 files: 1-10, 11-15, 16-30, and 30 - EOF of the big.txt.
Solving the problem is not a challenge for me, I wrote 3 different solutions. However, I cannot explain the performance. Why the awk script is the slowest. (GNU Awk)
For the big.txt, I just did seq 1.5billion > big.txt (~15Gb)
first, the head and tail:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=0 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[#]}"
do
# Extract the lines
head -n $i "$INPUT_FILE" | tail -n +$START > "file$IDX.txt"
#
(( IDX++ ))
START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"
The 2nd: sed:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=1 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[#]}"
do
T=$(( i+1 ))
# Extract the lines using sed command
sed -n -e " $START, $i p" -e "$T q" "$INPUT_FILE" > "file$IDX.txt"
(( IDX++ ))
START=$T
done
# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"
the last, awk
awk -v nums="400000 700000 1200000" 'BEGIN{c=split(nums,a)} {
for(i=1; i<=c; i++){
if( NR<=a[i] ){
print > "file" i ".txt"
next
}
}
print > "file" c+1 ".txt"
}' big.txt
From my testing (using time command), the head+tail is the fastest:
real 73.48
user 1.42
sys 17.62
the sed one:
real 144.75
user 105.68
sys 15.58
the awk one:
real 234.21
user 187.92
sys 3.98
The awk went through the file only once, why it is much slower than the other two? Also, I thought the tail and head would be the slowest solution, how come it's so fast? I guess it might be something to do with the awk's redirection? (print > file)
Can someone explain it to me? Thank you.
Can awk be faster than head and tail for this?
No, it will be slower, at least for a reasonable number of chunks for a large input file. Because it will read every line and do some work with it. On the other hand, head and tail will read massively the newline characters, without doing anything, will seek until they find the line number provided by the argument. Then they don't have again to read line by line and decide what to do, but dump the content, similar to cat.
If we increase the number of chunks, if the array of splitting line numbers is getting larger and larger, then we will reach a point where the cost of calling many head and tail processes will overcome the cost of one awk process, and from that point after, awk would be faster.
awk script improvement
This awk is slow because of that loop! Just think that for the last output file, for every line to print, we run 4 iterations until we print the line. Of course the time complexity still remains linear to the input, but all these checks and assignments have costs that can be observed as input grows. It can be much improved, e.g. like this:
> cat tst.awk
BEGIN {
a[1]
a[40000]
a[70000]
a[120000]
}
NR in a {
close(out)
out = "file" ++i ".txt"
}
{ print > out }
Here we test only NR per line, actually we almost only print.
awk -f tst.awk big.txt
Testing
Here is some basic testing, I did a file, not huge, 5.2M lines.
> wc -l big.txt
5288558 big.txt
Now, with that loop, it really matters where you split the file! If you have to write most of the rows into the last chunks, that means more iterations, it is slower
> head -1 test.sh
awk -v nums="100000 200000 300000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m10.960s
user 0m10.823s
sys 0m0.066s
If most rows goes to first file (that means one iteration and next) it becomes faster!
> head -1 test.sh
awk -v nums="5000000 5100000 5200000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m6.914s
user 0m6.838s
sys 0m0.043s
With the above modification it should be fast enough regardless the cut points.
> time awk -f tst.awk big.txt
real 0m4.270s
user 0m4.185s
sys 0m0.048s
For awk, each line requires a loop, comparisons, and creating the filename. Maybe awk performs also the hard task of parsing each line.
You may want to try the following experiments
try mawk (fast implementation of awk) and check if it is much faster.
remove print > "file" i ".txt" see how much time it saves.

mawk syntax appropriate >1000 field file for counting non-numeric data column-wise?

The following silly hard-coding of what ought to be some kind of loop or parallel construct, works nominally, but it is poor mawk syntax. My good mawk syntax attempts have all failed, using for loops in mawk (not shown) and gnu parallel (not shown).
It really needs to read the CSV file from disk just 1 time, not one time per column, because I have a really big CSV file (millions of rows, thousands of columns). My original code worked fine-ish (not shown) but it read the whole disk file again for every column and it was taking hours and I killed it after realizing what was happening. I have a fast solid state disk using a GPU connector slot so disk reads are blazing fast on this device. Thus CPU is the bottleneck here. Code sillyness is even more of a bottleneck if I have to hard-code 4000 lines of basically the same statements except for column number.
The code is column-wise making counts of non-numeric values. I need some looping (for-loop) or parallel (preferred) because while the following works correctly on 2 columns, it is not a scalable way to write mawk code for thousands of columns.
tail -n +1 pht.csv | awk -F"," '(($1+0 != $1) && ($1!="")){cnt1++}; (($2+0 != $2) && ($2!="")){cnt2++} END{print cnt1+0; print cnt2+0}'
2
1
How can the "column 1 processing; column 2 processing;" duplicate code be reduced? How can looping be introduced? How can gnu parallel be introduced? Thanks much. New to awk, I am. Not new to other languages.
I keep expecting some clever combo of one or more of the following bash commands is going to solve this handily, but here I am many hours later with nothing to show. I come with open hands. Alms for the code-poor?
seq 1 2 ( >>2 for real life CSV file)
tail (to skip the header or not as needed)
mawk (nice-ish row-wise CSV file processing, with that handy syntax I showed you in my demo for finding non-numerics easily in a supposedly all-numeric CSV datafile of jumbo dimensions)
tr (removes newline which is handy for transpose-ish operations)
cut (to grab a column at a time)
parallel (fast is good, and I have mucho cores needing something to work on, and phat RAM)
Sorry, I am absolutely required to not use CSV specific libraries like python pandas or R dataframes. My hands are tied here. Sorry. Thank you for being so cool about it. I can only use bash command lines in this case.
My mawk can handle 32000+ columns so NF is not a problem here, unlike some other awk I've seen. I have less than 32000 columns (but not by that much).
Datafile pht.csv contains the following 3x2 dataset:
cat pht.csv
8,T1,
T13,3,T1
T13,,-6.350818276405334473e-01
don't have access to mawk but you can do something equivalent to this
awk -F, 'NR>1 {for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) a[i]++}
END {for(i=1;i in a;i++) print a[i]}' file
shouldn't take more than few minutes even for million records.
For recognizing exponential notation regex test is not going to work and you need to revert to $1+0!=$1 test as mentioned in the comments. Note that you don't have to check null string separately.
None of the solutions so far parallelize. Let's change that.
Assume you have a solution that works in serial and can read from a pipe:
doit() {
# This solution gives 5-10 MB/s depending on system
# Changed so it now also treats '' as zero
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d*(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
export -f doit
doit() {
# Somewhat faster - but regards empty fields as zero
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit
To parallelize this we need to split the big file into chunks and pass each chunk to the serial solution:
# This will spawn a process for each core
parallel --pipe-part -a pht.csv --block -1 doit > blocksums
(You need version 20161222 or later to use '--block -1').
To deal with the header we compute result of the header, but we negate the result:
head -n1 pht.csv | doit | perl -pe 's/(^|,)/$1-/g' > headersum
Now we can simply sum up the headersum and the blocksums:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = ",";print "#s\n" }'
Or if you prefer the output line by line:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = "\n";print "#s\n" }'
Is this what you're trying to do?
$ awk -v RS='[\n,]' '($1+0) != $1' file | sort | uniq -c
1 T1
2 T13
The above uses GNU awk for multi-char RS and should run in seconds for an input file like you describe. If you don't have GNU awk you could do:
$ tr ',' $'\n' < file | awk '($1+0) != $1' | sort | uniq -c
1 T1
2 T13
I'm avoiding the approach of using , as a FS since then you'd have to use $i in a loop which would cause awk to do field splitting for every input line which adds on time but you could try it:
$ awk -F, '{for (i=1;i<=NF;i++) if (($i+0) != $i) print $i}' file | sort | uniq -c
1 T1
2 T13
You could do the unique counting all in awk with an array indexed by the non-numeric values but then you potentially have to store a lot of data in memory (unlike with sort which uses temp swap files as necessary) so YMMV with that approach.
I solved it independently. What finally did it for me was the dynamic variable creation examples at the following URL. http://cfajohnson.com/shell/cus-faq-2.html#Q24
Here is the solution I developed. Note: I have added another column with some missing data for a more complete unit test. Mine is not necessarily the best solution, which is TBD. It works correctly on the small csv shown is all I know at the moment. The best solution will also need to run really fast on a 40 GB csv file (not shown haha).
$ cat pht.csv
8,T1,
T13,3,T1
T13,,0
$ tail -n +1 pht.csv | awk -F"," '{ for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }'
2
1
1
ps. Honestly I am not satisfied with my own answer. They say that premature optimization is the root of all evil. Well that maxim does not apply here. I really, really want gnu parallel in there, instead of the for-loop if possible, because I have a need for speed.
Final note: Below I am sharing performance timings of sequential and parallel versions, and best available unit test dataset. Special thank you to Ole Tange for big help developing code to use his nice gnu parallel command in this application.
Unit test datafile, final version:
$ cat pht2.csv
COLA99,COLB,COLC,COLD
8,T1,,T1
T13,3,T1,0.03
T13,,-6.350818276405334473e-01,-0.036
Timing on big data (not shown) for sequential version of column-wise non-numeric counts:
ga#ga-HP-Z820:/mnt/fastssd$ time tail -n +2 train_all.csv | awk -F"," '{ for(i=1; i<=NF; i++){ cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }' > /dev/null
real 35m37.121s
Timing on big data for parallel version of column-wise non-numeric counts:
# Correctness - 2 1 1 1 is the correct output.
#
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m14.253s
doit1() {
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m59.960s
doit2() {
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit1
parallel --pipe-part -a "$fn" --block -1 doit1 > blocksums
if [ $csvheader -eq 1 ]
then
head -n1 "$fn" | doit1 | perl -pe 's/(^|,)/$1-/g' > headersum
cat headersum blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
else
cat blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
fi
NEW: Here is the ROW-wise (not column-wise) counts in sequential code:
tail -n +2 train_all.csv | awk -F"," '{ cnt=0; for(i=1; i<=NF; i++){ cnt+=(($i+0)!=$i) && ($i!="") } print cnt; }' > train_all_cnt_nonnumerics_rowwwise.out.txt
Context: Project is machine learning. This is part of a data exploration. ~25x parallel speedup seen on Dual Xeon 32 virtual / 16 physical core shared memory host using Samsung 950 Pro SSD storage: (32x60) seconds sequential time, 74 sec parallel time. AWESOME!

Performance Issue with While and Read

I have a many-line file containing commas. I want to remove all of the characters appearing after a comma from the line, including the comma. I have a bash script which does this, but it isn't fast enough.
Input:
hello world, def
Output:
hllo worl
My slow script:
#!/bin/bash
while read line; do
values="${line#*, }"
phrase="${line%, *}"
echo "${phrase//[$values]}"
done < "$1"
I want to improve the performance.
Any suggestions?
Using Perl
$ perl -F',' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hlloworl
If you don't want to count the space after the comma:
$ perl -F',\s*' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hllo worl
Perl excels at text manipulation like this, so I'd expect this to be pretty quick.
Getting rid of the while loop could give your code a boost, most programs take a file as input and will do the reading for you.
You can replace your program with the following and report the times:
cut -d"," -f1 < file
You can try with awk, changing the field separator to ,:
awk 'BEGIN {FS=","}; {print $1}' file
Also you could try with sed (with the modifications suggested by #Qualia):
sed -r -i "s/,.*//g" file
Beware though, that the -i flag will inplace edit your file, if that is not the desired effect you can just do:
sed -r "s/,.*//g" file
An AWK solution (edited taking inspiration from #glenn jackman's perl solution):
awk -F", " '{ gsub("["$2"]",""); print $1 }' "$1"
With this sort of line processing, it's often better to use a compiled solution. I would use Haskell for its expressiveness:
-- answer.hs
import Data.List(nub, delete)
import Data.Char(isSpace)
main = interact (unlines . (map perLine) . lines)
perLine = strSetDiff . break (==',')
strSetDiff (s, ',':' ':sub) = filter (`notElem` sub)) s
strSetDiff (s, _) = s
Compile with the command ghc -O2 answer.hs.
This breaks each line into two lists s and sub on ,, removes the ", " from sub, and then filters s to remove characters that are elements of sub. If there is no comma, the result is the whole line.
This assumes a space always follows a ,. Otherwise remove the ' ': and replace notElem sub with notElem (dropWhile isSpace sub)
Time taken for an 80000 line file consisting of 10 lines repeated 8000 times:
$ time ./answer <infile >outfile
0.38s user 0.00s system 99% cpu 0.386 total
$ time [glenn jackman\'s perl]
0.68s user 0.00s system 99% cpu 0.691 total
$ time awk -F", " '{ gsub("["$2"]",""); print $1 }' infile > outfile
0.85s user 0.04s system 99% cpu 0.897 total
$ time ./ElBarajas.sh infile > outfile
2.77s user 0.32s system 99% cpu 3.105 total
Personally, I'm willing to admit defeat - the perl solution seems best to me.

How to get the biggest number in a file?

I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.
I thought about doing the following:
grep -o '[-0-9]*' myfile | sort -rn | head -1
This uses grep to get all the integers from the file, outputting one per line. Then, sort sorts them and head prints the very first one.
But then thought that sort -r may cause some overhead, so I went for:
grep -o '[-0-9]*' myfile | sort -n | tail -1
To see what is fastest, I created a huge file with some random data, such like this:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done
So that the file contains 150K lines.
Now I compare the performance in my GNU bash version 4.2 and sys is way smaller for sort -rn:
$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234
real 0m1.823s
user 0m1.865s
sys 0m0.045s
$ cp myfile myfile2 #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234
real 0m1.864s
user 0m1.926s
sys 0m0.027s
So I have two questions here:
What is best, sort -r | tail -1 or sort -rn | head -1?
Is there a fastest way to get the maximum integer in a given file?
Testing the solutions
So I ran all the commands and compared the time it gets them to find the value. To make things more reliable, I created a bigger file, 10 times bigger than the one I mentioned in the question:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile
1500000 13000000 62000000 myfile
Benchmark, from which I see hek2mgl's solution is the fastest:
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234
real 0m3.979s
user 0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile
42342234
real 0m2.203s
user 0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234
real 0m0.926s
user 0m0.848s
sys 0m0.077s
$ time tr ' ' '\n' < myfile | sort -rn | head -1
42342234
real 0m11.089s
user 0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} #F} END {print $max' myfile
real 0m6.166s
user 0m6.146s
sys 0m0.011s
I'm surprised by awk's speed here. perl is usually pretty speedy, but:
$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767
real 0m0.890s
user 0m0.887s
sys 0m0.003s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} #F} END {print $max' rand
32767
real 0m1.110s
user 0m1.107s
sys 0m0.002s
I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:
$ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
32767
real 0m0.565s
user 0m0.539s
sys 0m0.025s
Takes a little more "sys" time, but less real time.
Works with a file with only negative numbers too:
$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
-42
In awk you can say:
awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
Explanation
In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.
In the code above using minimal functions and commands will allow for faster execution.
for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
this way is usually faster than using custom ones as awk is optimised
to use the default
if(int($i)) - Checks if the field is not equal to zero and as strings are set to zero
by int, does not execute the next block if the field is a string. I
believe this is the quickest way to perform this check
{a[$i]=$i} - Sets an array variable with the number as key and value. This means
there will only be as many array variables as there are numbers in
the file and will hopefully be quicker than a comparison of every
number
END{x=asort(a) - At the end of the file, use asort on the array and store the s
size of the array in x.
print a[x] - Print the last element in the array.
Benchmark
Mine:
time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
took
real 0m0.434s
user 0m0.357s
sys 0m0.008s
hek2mgl's:
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file
took
real 0m1.256s
user 0m1.134s
sys 0m0.019s
For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using
Changing
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'
to
awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'
provides the time
real 0m0.574s
user 0m0.497s
sys 0m0.011s
Which is still a little slower than my command.
I believe the slight difference that is still present is due to asort() only working on around 6 numbers as they are only saved once in the array.
In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.
I think they would be around the same speed if all the numbers in the file were unique.
Tom Fenech's:
time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile
real 0m0.716s
user 0m0.612s
sys 0m0.013s
A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.
Glenn Jackman's:
time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file
real 0m1.492s
user 0m1.258s
sys 0m0.022s
and
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
real 0m0.790s
user 0m0.686s
sys 0m0.034s
The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.
Notes
All times are representative of the average of 3 tests
I'm sure a C implementation optimized using assembler will be the fastest. Also I could think of a program which separates the file into multiple chunks and maps every chunk onto a single processor core, and afterwards just get's the maximum of nproc remaning numbers.
Just using the existing command line tools, have you tried awk?
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
Looks like it can do the job in ~50% of the time compared to the perl command in the accepted answer:
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
cp myfile myfile2
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2
Gave me:
42342234
real 0m0.360s
user 0m0.340s
sys 0m0.020s
42342234
real 0m0.193s <-- Good job awk! You are the winner.
user 0m0.185s
sys 0m0.008s
I suspect this will be fastest:
$ tr ' ' '\n' < file | sort -rn | head -1
42342234
Third run:
$ time tr ' ' '\n' < file | sort -rn | head -1
42342234
real 0m0.078s
user 0m0.000s
sys 0m0.076s
btw DON'T WRITE SHELL LOOPS to manipulate text, even if it's creating sample input files:
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile
real 0m0.109s
user 0m0.031s
sys 0m0.061s
$ wc -l myfile
150000 myfile
compared to the shell loop suggested in the question:
$ time for i in {1..50000}; do cat a >> myfile2 ; done
real 26m38.771s
user 1m44.765s
sys 17m9.837s
$ wc -l myfile2
150000 myfile2
If we want something that more robustly handles input files that contain digits in strings that are not integers, we need something like this:
$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015
$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
42342234
3624
123
73
-23
$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real 0m0.109s
user 0m0.000s
sys 0m0.076s
$ wc -l myfileB
250000 myfileB
$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real 0m2.480s
user 0m2.509s
sys 0m0.108s
Note that the input file has more lines than the original and with this input the above robust grep solution is actually faster than the original I posted at the start of this question:
$ time tr ' ' '\n' < myfileB | sort -rn | head -1
42342234
real 0m4.836s
user 0m4.445s
sys 0m0.277s

Extract specified lines from a file

I have a file and I want to extract specific lines from that file like lines 2, 10, 15,21, .... and so on. There are around 200 thousand lines to be extracted from the file. How can I do it efficiently in bash
Maybe looking for:
sed -n -e 1p -e 4p afile
Put the linenumbers of the lines you want in a file called "wanted", like this:
2
10
15
21
Then run this script:
#!/bin/bash
while read w
do
sed -n ${w}p yourfile
done < wanted
TOTALLY ALTERNATIVE METHOD
Or you could let "awk" do it all for you, like this which is probably miles faster since you won't have to create 200,000 sed processes:
awk 'FNR==NR{a[$1]=1;next}{if(FNR in a){print;}}' wanted yourfile
The FNR==NR portion detects when awk is reading the file called "wanted" and if so, it sets element "$1" of array "a" to "1" so we know that this line number is wanted. The stuff in the second set of curly braces is active when processing your bigger file only and it prints the current line if its linenumber is in the array "a" we created when reading the "wanted" file.
$ gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
Wanted line numbers have to be stored in lines delimited by newline and they may safely be in random order. It almost exactly the same as #Mark Setchell’s second method, but uses a little more clear way to determine which file is current. Although this ARGIND is GNU extension, so gawk. If you are limited to original AWK or mawk, you can write it as:
$ awk 'FILENAME==ARGV[1] { L[$0]++ }; FILENAME==ARGV[2] && FNR in L' lines file > file.lines
Efficiency test:
$ awk 'BEGIN { for (i=1; i<=1000000; i++) print i }' > file
$ shuf -i 1-1000000 -n 200000 > lines
$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
real 0m1.734s
user 0m1.460s
sys 0m0.052s
UPD:
As #Costi Ciudatu pointed out, there is room for impovement for the case when all wanted lines are in the head of a file.
#!/usr/bin/gawk -f
ARGIND==1 { L[$0]++ }
ENDFILE { L_COUNT = FNR }
ARGIND==2 && FNR in L { L_PRINTED++; print }
ARGIND==2 && L_PRINTED == L_COUNT { exit 0 }
Sript interrupts when last line is printed, so now it take few milliseconds to filter out 2000 random lines from first 1 % of a one million lines file.
$ time ./getlines.awk lines file > file.lines
real 0m0.016s
user 0m0.012s
sys 0m0.000s
While reading a whole file still takes about a second.
$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines
real 0m0.780s
user 0m0.756s
sys 0m0.016s
Provided your system supports sed -f - (i.e. for sed to read its script on standard input; it works on Linux, but not on some other platforms) you can turn the file of line numbers into a sed script, naturally using sed:
sed 's/$/p/' lines | sed -n -f - inputfile >output
If the lines you're interested in are close to the beginning of the file, you can make use of head and tail to efficiently extract specific lines.
For your example line numbers (assuming that list doesn't go on until close to 200,000), a dummy but still efficient approach to read those lines would be the following:
for n in 2 10 15 21; do
head -n $n /your/large/file | tail -1
done
sed Example
sed -n '2p' file
awk Example
awk 'NR==2' file
this will print 2nd line of file
use same logic in loop & try.
say a for loop
for VARIABLE in 2 10 15 21
do
awk "NR==$VARIABLE" file
done
Give your line numbers this way..

Resources