How to get the biggest number in a file? - performance

I want to get the maximum number in a file, where numbers are integers that can occur in any place of the file.
I thought about doing the following:
grep -o '[-0-9]*' myfile | sort -rn | head -1
This uses grep to get all the integers from the file, outputting one per line. Then, sort sorts them and head prints the very first one.
But then thought that sort -r may cause some overhead, so I went for:
grep -o '[-0-9]*' myfile | sort -n | tail -1
To see what is fastest, I created a huge file with some random data, such like this:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ for i in {1..50000}; do cat a >> myfile ; done
So that the file contains 150K lines.
Now I compare the performance in my GNU bash version 4.2 and sys is way smaller for sort -rn:
$ time grep -o '[-0-9]*' myfile | sort -n | tail -1
42342234
real 0m1.823s
user 0m1.865s
sys 0m0.045s
$ cp myfile myfile2 #to prevent using cached info
$ time grep -o '[-0-9]*' myfile2 | sort -rn | head -1
42342234
real 0m1.864s
user 0m1.926s
sys 0m0.027s
So I have two questions here:
What is best, sort -r | tail -1 or sort -rn | head -1?
Is there a fastest way to get the maximum integer in a given file?
Testing the solutions
So I ran all the commands and compared the time it gets them to find the value. To make things more reliable, I created a bigger file, 10 times bigger than the one I mentioned in the question:
$ cat a
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=500000;i++) print s}' > myfile
$ wc myfile
1500000 13000000 62000000 myfile
Benchmark, from which I see hek2mgl's solution is the fastest:
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' myfile
42342234
real 0m3.979s
user 0m3.970s
sys 0m0.007s
$ time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' myfile
42342234
real 0m2.203s
user 0m2.196s
sys 0m0.006s
$ time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
42342234
real 0m0.926s
user 0m0.848s
sys 0m0.077s
$ time tr ' ' '\n' < myfile | sort -rn | head -1
42342234
real 0m11.089s
user 0m11.049s
sys 0m0.086s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} #F} END {print $max' myfile
real 0m6.166s
user 0m6.146s
sys 0m0.011s

I'm surprised by awk's speed here. perl is usually pretty speedy, but:
$ for ((i=0; i<1000000; i++)); do echo $RANDOM; done > rand
$ time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' rand
32767
real 0m0.890s
user 0m0.887s
sys 0m0.003s
$ time perl -MList::Util=max -lane '$m = max $m, map {0+$_} #F} END {print $max' rand
32767
real 0m1.110s
user 0m1.107s
sys 0m0.002s
I think I've found a winner: With perl, slurp the file as a single string, find the (possibly negative) integers, and take the max:
$ time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' rand
32767
real 0m0.565s
user 0m0.539s
sys 0m0.025s
Takes a little more "sys" time, but less real time.
Works with a file with only negative numbers too:
$ cat file
hello -42 world
$ perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
-42

In awk you can say:
awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
Explanation
In my experience awk is the fastest text processing language for most tasks and the only thing I have seen of comparable speed (on Linux systems) are programs written in C/C++.
In the code above using minimal functions and commands will allow for faster execution.
for(i=1;i<=NF;i++) - Loops through fields on the line. Using the default FS/RS and looping
this way is usually faster than using custom ones as awk is optimised
to use the default
if(int($i)) - Checks if the field is not equal to zero and as strings are set to zero
by int, does not execute the next block if the field is a string. I
believe this is the quickest way to perform this check
{a[$i]=$i} - Sets an array variable with the number as key and value. This means
there will only be as many array variables as there are numbers in
the file and will hopefully be quicker than a comparison of every
number
END{x=asort(a) - At the end of the file, use asort on the array and store the s
size of the array in x.
print a[x] - Print the last element in the array.
Benchmark
Mine:
time awk '{for(i=1;i<=NF;i++)if(int($i)){a[$i]=$i}}END{x=asort(a);print a[x]}' file
took
real 0m0.434s
user 0m0.357s
sys 0m0.008s
hek2mgl's:
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]' file
took
real 0m1.256s
user 0m1.134s
sys 0m0.019s
For those wondering why it is faster it is due to using the default FS and RS which awk is optimised for using
Changing
awk '{m=(m<$0 && int($0))?$0:m}END{print m}' RS='[[:space:]*]'
to
awk '{for(i=1;i<=NF;i++)m=(m<$i && int($i))?$i:m}END{print m}'
provides the time
real 0m0.574s
user 0m0.497s
sys 0m0.011s
Which is still a little slower than my command.
I believe the slight difference that is still present is due to asort() only working on around 6 numbers as they are only saved once in the array.
In comparison, the other command is performing a comparison on every single number in the file which will be more computationally expensive.
I think they would be around the same speed if all the numbers in the file were unique.
Tom Fenech's:
time awk -v RS="[^-0-9]+" '$0>max{max=$0}END{print max}' myfile
real 0m0.716s
user 0m0.612s
sys 0m0.013s
A drawback of this approach, though, is that if all the numbers are below zero then max will be blank.
Glenn Jackman's:
time awk 'NR==1 || max < 0+$0 {max=0+$0} END {print max}' RS='[[:space:]]+' file
real 0m1.492s
user 0m1.258s
sys 0m0.022s
and
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' file
real 0m0.790s
user 0m0.686s
sys 0m0.034s
The good thing about perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' is that it is the only answer on here that will work if 0 appears in the file as the largest number and also works if all numbers are negative.
Notes
All times are representative of the average of 3 tests

I'm sure a C implementation optimized using assembler will be the fastest. Also I could think of a program which separates the file into multiple chunks and maps every chunk onto a single processor core, and afterwards just get's the maximum of nproc remaning numbers.
Just using the existing command line tools, have you tried awk?
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile
Looks like it can do the job in ~50% of the time compared to the perl command in the accepted answer:
time perl -MList::Util=max -0777 -nE 'say max /-?\d+/g' myfile
cp myfile myfile2
time awk '{for(i=1;i<=NF;i++){m=(m<$i)?$i:m}}END{print m}' RS='$' FPAT='-{0,1}[0-9]+' myfile2
Gave me:
42342234
real 0m0.360s
user 0m0.340s
sys 0m0.020s
42342234
real 0m0.193s <-- Good job awk! You are the winner.
user 0m0.185s
sys 0m0.008s

I suspect this will be fastest:
$ tr ' ' '\n' < file | sort -rn | head -1
42342234
Third run:
$ time tr ' ' '\n' < file | sort -rn | head -1
42342234
real 0m0.078s
user 0m0.000s
sys 0m0.076s
btw DON'T WRITE SHELL LOOPS to manipulate text, even if it's creating sample input files:
$ time awk -v s="$(cat a)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfile
real 0m0.109s
user 0m0.031s
sys 0m0.061s
$ wc -l myfile
150000 myfile
compared to the shell loop suggested in the question:
$ time for i in {1..50000}; do cat a >> myfile2 ; done
real 26m38.771s
user 1m44.765s
sys 17m9.837s
$ wc -l myfile2
150000 myfile2
If we want something that more robustly handles input files that contain digits in strings that are not integers, we need something like this:
$ cat b
hello 123 how are you i am fine 42342234 and blab bla bla
and 3624 is another number
but this is not enough for -23 234245
73 starts a line
avoid these: 3.14 or 4-5 or $15 or 2:30 or 05/12/2015
$ grep -o -E '(^| )[-]?[0-9]+( |$)' b | sort -rn
42342234
3624
123
73
-23
$ time awk -v s="$(cat b)" 'BEGIN{for (i=1;i<=50000;i++) print s}' > myfileB
real 0m0.109s
user 0m0.000s
sys 0m0.076s
$ wc -l myfileB
250000 myfileB
$ time grep -o -E '(^| )-?[0-9]+( |$)' myfileB | sort -rn | head -1 | tr -d ' '
42342234
real 0m2.480s
user 0m2.509s
sys 0m0.108s
Note that the input file has more lines than the original and with this input the above robust grep solution is actually faster than the original I posted at the start of this question:
$ time tr ' ' '\n' < myfileB | sort -rn | head -1
42342234
real 0m4.836s
user 0m4.445s
sys 0m0.277s

Related

Stream filter large number of lines that are specified by line number from stdin

I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!
If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499

How to search for a matching string in a file bottom-to-top without using tac? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I need to grep through a file, starting at the bottom of the file until I get to the first date that appears "2021-04-04", and then return that date. I don't want to start from the top and work my way down to the first line as there's thousands of lines in each file.
Example file contents:
random text on first line
random text on second line
2021-01-01
random text on fourth line
2021-02-03
random text on sixth line
2021-03-03
2021-04-04
Random text on ninth line
tac isn't available on MacOS so I can't use it.
"thousands of lines" are nothing, they'll be processed in the blink of an eye. Once you get into 10s of millions of lines THEN you could start thinking about a performance improvement if it became necessary.
All you need is:
awk '/[0-9]{4}(-[0-9]{2}){2}/{line=$0} END{if (line!="") print line}' file
Here's the 3rd-run timing comparison for finding the last line containing 2 or more consecutive 5s in a 100000 line file generated by seq 100000 > file100k, i.e. where the target string is just 45 lines from the end of the input file, with and without tac:
$ time awk '/5{2}/{line=$0} END{if (line!="") print line}' file100k
99955
real 0m0.056s
user 0m0.031s
sys 0m0.000s
$ time tac file100k | awk '/5{2}/{print; exit}'
99955
real 0m0.056s
user 0m0.015s
sys 0m0.030s
As you can see, both ran in a fraction of a second and using tac did nothing to improve the speed of execution. Switching to tac+grep doesn't make it any faster either, it still just takes 1/20th of a second:
$ time tac file100k | grep -m1 '5\{2\}'
99955
real 0m0.057s
user 0m0.015s
sys 0m0.015s
In case you ever do need it in future, though, here's how to implement an efficient tac if you don't have it:
$ mytac() { cat -n "${#:--}" | sort -k1,1rn | cut -d$'\t' -f2-; }
$ seq 5 | mytac
5
4
3
2
1
The above mytac() function just adds line numbers to the input, sorts those in reverse and then removes them again. If your cat doesn't have -n to add line numbers then you can use nl if you have it or awk -v OFS='\t' '{print NR, $0}' will always work.
Use tac:
#!/bin/bash
function process_file_backwords(){
tac $1 | while IFS= read line; do
# Grep for xxxx-xx-xx number matching
first_date=$(echo $line | grep '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}' | awk -F '"' '{ print $2}')
# Check if the variable is not empty, if yes break the loop
[ ! -z $first_date ] && echo $first_date && break
done
}
echo $(process_file_backwords $1)
Note: Make sure you add empty line at the of the file so tac will not concatenate the last two lines.
Note: Remove the awk part if the file contains strings without ".
On MacOS
You can use tail -r which will do the same thing as tac but you may have to supply the number of lines you want tail to output from your file. Something like this should work:
tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
-r tells tail to output its last line first
-n takes a numeric argument telling how many lines tail should output
wc -l outputs the line count and filename of a given file
cut -d ' ' splits the above on the space character and -f 1 takes the first "field" which will be our line count
$ cat myfile.txt
foo
this is a date 2021-04-03
bar
this is another date 2021-04-04 for example
$ tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
grep options:
The -m 1 option will quit after the first result.
The -o option
will return only the string matching the pattern (i.e. your date)
The -P option uses the perl regex engine which is really down to
preference but I personally prefer the regex syntax (seems to use
fewer backslashes \)
On Linux
You can use tac (cat in reverse) and pipe that into your grep. e.g.:
$ tac myfile.txt
this is another date 2021-04-04 for example
bar
this is a date 2021-04-03
foo
$ tac myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
You can use perl to reverse the lines and grep for the 1st match too.
perl -e 'print reverse<>' inputFile | grep -m1 '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}'

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

Sort a text file by line length including spaces

I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $# | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
Answer
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key option.)
Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
Test input including some lines of equal length:
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.
Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s
The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]*://'
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
awk '{ print length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]* //'
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example
1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
https://raku.org
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1

Bash command to sum a column of numbers [duplicate]

This question already has answers here:
Shell command to sum integers, one per line?
(45 answers)
Closed 7 years ago.
I want a bash command that I can pipe into that will sum a column of numbers. I just want a quick one liner that will do something essentially like this:
cat FileWithColumnOfNumbers.txt | sum
Using existing file:
paste -sd+ infile | bc
Using stdin:
<cmd> | paste -sd+ | bc
Edit:
With some paste implementations you need to be more explicit when reading from stdin:
<cmd> | paste -sd+ - | bc
Options used:
-s (serial) - merges all the lines into a single line
-d - use a non-default delimiter (the character + in this case)
I like the chosen answer. However, it tends to be slower than awk since 2 tools are needed to do the job.
$ wc -l file
49999998 file
$ time paste -sd+ file | bc
1448700364
real 1m36.960s
user 1m24.515s
sys 0m1.772s
$ time awk '{s+=$1}END{print s}' file
1448700364
real 0m45.476s
user 0m40.756s
sys 0m0.287s
The following command will add all the lines(first field of the awk output)
awk '{s+=$1} END {print s}' filename
Does two lines count?
awk '{ sum += $1; }
END { print sum; }' "$#"
You can then use it without the superfluous 'cat':
sum < FileWithColumnOfNumbers.txt
sum FileWithColumnOfNumbers.txt
FWIW: on MacOS X, you can do it with a one-liner:
awk '{ sum += $1; } END { print sum; }' "$#"
[a followup to ghostdog74s comments]
bash-2.03$ uname -sr
SunOS 5.8
bash-2.03$ perl -le 'print for 1..49999998' > infile
bash-2.03$ wc -l infile
49999998 infile
bash-2.03$ time paste -sd+ infile | bc
bundling space exceeded on line 1, teletype
Broken Pipe
real 0m0.062s
user 0m0.010s
sys 0m0.010s
bash-2.03$ time nawk '{s+=$1}END{print s}' infile
1249999925000001
real 2m0.042s
user 1m59.220s
sys 0m0.590s
bash-2.03$ time /usr/xpg4/bin/awk '{s+=$1}END{print s}' infile
1249999925000001
real 2m27.260s
user 2m26.230s
sys 0m0.660s
bash-2.03$ time perl -nle'
$s += $_; END { print $s }
' infile
1.249999925e+15
real 1m34.663s
user 1m33.710s
sys 0m0.650s
You can use bc (calculator). Assuming your file with #s is called "n":
$ cat n
1
2
3
$ (cat n | tr "\012" "+" ; echo "0") | bc
6
The tr changes all newlines to "+"; then we append 0 after the last plus, then we pipe the expression (1+2+3+0) to the calculator
Or, if you are OK with using awk or perl, here's a Perl one-liner:
$perl -nle '$sum += $_ } END { print $sum' n
6
while read -r num; do ((sum += num)); done < inputfile; echo $sum
Use a for loop to iterate over your file …
sum=0; for x in `cat <your-file>`; do let sum+=x; done; echo $sum
If you have ruby installed
cat FileWithColumnOfNumbers.txt | xargs ruby -e "puts ARGV.map(&:to_i).inject(&:+)"
[root#pentest3r ~]# (find / -xdev -size +1024M) | (while read a ; do aa=$(du -sh $a | cut -d "." -f1 ); o=$(( $o+$aa )); done; echo "$o";)

Resources