I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!
If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I need to grep through a file, starting at the bottom of the file until I get to the first date that appears "2021-04-04", and then return that date. I don't want to start from the top and work my way down to the first line as there's thousands of lines in each file.
Example file contents:
random text on first line
random text on second line
2021-01-01
random text on fourth line
2021-02-03
random text on sixth line
2021-03-03
2021-04-04
Random text on ninth line
tac isn't available on MacOS so I can't use it.
"thousands of lines" are nothing, they'll be processed in the blink of an eye. Once you get into 10s of millions of lines THEN you could start thinking about a performance improvement if it became necessary.
All you need is:
awk '/[0-9]{4}(-[0-9]{2}){2}/{line=$0} END{if (line!="") print line}' file
Here's the 3rd-run timing comparison for finding the last line containing 2 or more consecutive 5s in a 100000 line file generated by seq 100000 > file100k, i.e. where the target string is just 45 lines from the end of the input file, with and without tac:
$ time awk '/5{2}/{line=$0} END{if (line!="") print line}' file100k
99955
real 0m0.056s
user 0m0.031s
sys 0m0.000s
$ time tac file100k | awk '/5{2}/{print; exit}'
99955
real 0m0.056s
user 0m0.015s
sys 0m0.030s
As you can see, both ran in a fraction of a second and using tac did nothing to improve the speed of execution. Switching to tac+grep doesn't make it any faster either, it still just takes 1/20th of a second:
$ time tac file100k | grep -m1 '5\{2\}'
99955
real 0m0.057s
user 0m0.015s
sys 0m0.015s
In case you ever do need it in future, though, here's how to implement an efficient tac if you don't have it:
$ mytac() { cat -n "${#:--}" | sort -k1,1rn | cut -d$'\t' -f2-; }
$ seq 5 | mytac
5
4
3
2
1
The above mytac() function just adds line numbers to the input, sorts those in reverse and then removes them again. If your cat doesn't have -n to add line numbers then you can use nl if you have it or awk -v OFS='\t' '{print NR, $0}' will always work.
Use tac:
#!/bin/bash
function process_file_backwords(){
tac $1 | while IFS= read line; do
# Grep for xxxx-xx-xx number matching
first_date=$(echo $line | grep '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}' | awk -F '"' '{ print $2}')
# Check if the variable is not empty, if yes break the loop
[ ! -z $first_date ] && echo $first_date && break
done
}
echo $(process_file_backwords $1)
Note: Make sure you add empty line at the of the file so tac will not concatenate the last two lines.
Note: Remove the awk part if the file contains strings without ".
On MacOS
You can use tail -r which will do the same thing as tac but you may have to supply the number of lines you want tail to output from your file. Something like this should work:
tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
-r tells tail to output its last line first
-n takes a numeric argument telling how many lines tail should output
wc -l outputs the line count and filename of a given file
cut -d ' ' splits the above on the space character and -f 1 takes the first "field" which will be our line count
$ cat myfile.txt
foo
this is a date 2021-04-03
bar
this is another date 2021-04-04 for example
$ tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
grep options:
The -m 1 option will quit after the first result.
The -o option
will return only the string matching the pattern (i.e. your date)
The -P option uses the perl regex engine which is really down to
preference but I personally prefer the regex syntax (seems to use
fewer backslashes \)
On Linux
You can use tac (cat in reverse) and pipe that into your grep. e.g.:
$ tac myfile.txt
this is another date 2021-04-04 for example
bar
this is a date 2021-04-03
foo
$ tac myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
You can use perl to reverse the lines and grep for the 1st match too.
perl -e 'print reverse<>' inputFile | grep -m1 '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}'
i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file
I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $# | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
Answer
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key option.)
Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
Test input including some lines of equal length:
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.
Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s
The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]*://'
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
awk '{ print length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]* //'
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example
1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
https://raku.org
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1
This question already has answers here:
Shell command to sum integers, one per line?
(45 answers)
Closed 7 years ago.
I want a bash command that I can pipe into that will sum a column of numbers. I just want a quick one liner that will do something essentially like this:
cat FileWithColumnOfNumbers.txt | sum
Using existing file:
paste -sd+ infile | bc
Using stdin:
<cmd> | paste -sd+ | bc
Edit:
With some paste implementations you need to be more explicit when reading from stdin:
<cmd> | paste -sd+ - | bc
Options used:
-s (serial) - merges all the lines into a single line
-d - use a non-default delimiter (the character + in this case)
I like the chosen answer. However, it tends to be slower than awk since 2 tools are needed to do the job.
$ wc -l file
49999998 file
$ time paste -sd+ file | bc
1448700364
real 1m36.960s
user 1m24.515s
sys 0m1.772s
$ time awk '{s+=$1}END{print s}' file
1448700364
real 0m45.476s
user 0m40.756s
sys 0m0.287s
The following command will add all the lines(first field of the awk output)
awk '{s+=$1} END {print s}' filename
Does two lines count?
awk '{ sum += $1; }
END { print sum; }' "$#"
You can then use it without the superfluous 'cat':
sum < FileWithColumnOfNumbers.txt
sum FileWithColumnOfNumbers.txt
FWIW: on MacOS X, you can do it with a one-liner:
awk '{ sum += $1; } END { print sum; }' "$#"
[a followup to ghostdog74s comments]
bash-2.03$ uname -sr
SunOS 5.8
bash-2.03$ perl -le 'print for 1..49999998' > infile
bash-2.03$ wc -l infile
49999998 infile
bash-2.03$ time paste -sd+ infile | bc
bundling space exceeded on line 1, teletype
Broken Pipe
real 0m0.062s
user 0m0.010s
sys 0m0.010s
bash-2.03$ time nawk '{s+=$1}END{print s}' infile
1249999925000001
real 2m0.042s
user 1m59.220s
sys 0m0.590s
bash-2.03$ time /usr/xpg4/bin/awk '{s+=$1}END{print s}' infile
1249999925000001
real 2m27.260s
user 2m26.230s
sys 0m0.660s
bash-2.03$ time perl -nle'
$s += $_; END { print $s }
' infile
1.249999925e+15
real 1m34.663s
user 1m33.710s
sys 0m0.650s
You can use bc (calculator). Assuming your file with #s is called "n":
$ cat n
1
2
3
$ (cat n | tr "\012" "+" ; echo "0") | bc
6
The tr changes all newlines to "+"; then we append 0 after the last plus, then we pipe the expression (1+2+3+0) to the calculator
Or, if you are OK with using awk or perl, here's a Perl one-liner:
$perl -nle '$sum += $_ } END { print $sum' n
6
while read -r num; do ((sum += num)); done < inputfile; echo $sum
Use a for loop to iterate over your file …
sum=0; for x in `cat <your-file>`; do let sum+=x; done; echo $sum
If you have ruby installed
cat FileWithColumnOfNumbers.txt | xargs ruby -e "puts ARGV.map(&:to_i).inject(&:+)"
[root#pentest3r ~]# (find / -xdev -size +1024M) | (while read a ; do aa=$(du -sh $a | cut -d "." -f1 ); o=$(( $o+$aa )); done; echo "$o";)