How to implement a counter using dictionary in Bash

How to implement a counter using dictionary in Bash - bash

#!/bin/bash
clear
Counter() {
declare -A dict
while read line; do
if [[ -n "${dict[$line]}" ]]; then
((${dict[$line]}+1))
else
dict["$line"]=1
fi
done < /home/$USER/.bash_history
echo ${!dict[#]} ${dict[#]}
}
Counter
I'm trying to write a script that counts the most used commands in your bash history using dictionary to store commands as keys and amount of times you used a command as a value but my code fails successefully.
Can you help me write the code that works.
Python code for reference:
def Counter(file):
dict = {}
for line in file.read().splitlines():
if line not in dict:
dict[line] = 1
else:
dict[line] += 1
for k, v in sorted(dict.items(), key=lambda x: x[1]):
print(f"{k} was used {v} times")
with open("/home/igor/.bash_history") as bash:
Counter(bash)
Output:
echo $SHELL was used 11 times
sudo apt-get update was used 14 times
ls -l was used 14 times
ldd /opt/pt/bin/PacketTracer7 was used 15 times
zsh was used 17 times
ls was used 26 times

There's no need to initialize the value to 1 for the first occurrence. Bash can do that for you.
The problem is you can't use an empty string as a key, so prepend something and remove it when showing the value.
#! /bin/bash
Counter() {
declare -A dict
while read line; do
c=${dict[x$line]}
dict[x$line]=$((c+1))
done < /home/$USER/.bash_history
for k in "${!dict[#]}" ; do
echo "${dict[$k]}"$'\t'"${k#x}"
done
}
Counter | sort -n

Python code for reference:
To count occurrences of lines, in shell you would typically do sort | uniq -c | sort you would do:
sort ~/.bash_history | uniq -c | sort -rn | head

Related

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)

You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.

As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.

This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.

I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.

I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.

This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Loop Script from Input File

I have a reference file with device names in them. For example WABEL8499IPM101. I'm using this script to set the base name (without the last 3 digits) to look at the reference file and see what is already used. If 101 is used it will create a file for me with 102, 103 if I request 2 total. I'm looking to use an input file to run it multiple times. I'm also trying to figure out how to start at 101 if there isn't a name found when searching the reference file
I would like to loop this using an input file instead of manually entering bash test.sh WABEL8499IPM 2 each time. I would like to be able to build an input file of all the names that need compared and then output. It would also be nice that if there isn't a match that it starts creating names at WABEL8499IPM101 instead of just WABEL8499IPM1.
Input file example:
ColumnA (BASE NAME) ColumnB (QUANTITY)
WABEL8499IPM 2
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum);
do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Usage:
bash test.sh WABEL8499IPM 2
Result in the log file:
WABEL8499IPM109
WABEL8499IPM110

Just wrap a loop around your code instead of assuming the args come in on the command line.
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
while read device_name quantityNum
do max_sequence_name=$( grep -o -e "$device_name[0-9]*" $SRCFILE |
sort --reverse | head -n 1)
max_sequence_num=${max_sequence_name: -3}
array_new_sequence_name=()
for i in $(seq 1 $quantityNum)
do cnum=$((max_sequence_num + i))
array_new_sequence_name+=("$device_name$cnum")
done
for sqn in ${array_new_sequence_name[#]};
do echo $sqn >> $LOGFILE
done
done < input.file
I'd maybe pass the input file as the parameter now.

How do I pick random unique lines from a text file in shell?

I have a text file with an unknown number of lines. I need to grab some of those lines at random, but I don't want there to be any risk of repeats.
I tried this:
jot -r 3 1 `wc -l<input.txt` | while read n; do
awk -v n=$n 'NR==n' input.txt
done
But this is ugly, and doesn't protect against repeats.
I also tried this:
awk -vmax=3 'rand() > 0.5 {print;count++} count>max {exit}' input.txt
But that obviously isn't the right approach either, as I'm not guaranteed even to get max lines.
I'm stuck. How do I do this?

This might work for you:
shuf -n3 file
shuf is one of GNU coreutils.

If you have Python accessible (change the 10 to what you'd like):
python -c 'import random, sys; print("".join(random.sample(sys.stdin.readlines(), 10)).rstrip("\n"))' < input.txt
(This will work in Python 2.x and 3.x.)
Also, (again change the 10 to the appropriate value):
sort -R input.txt | head -10

If jot is on your system, then I guess you're running FreeBSD or OSX rather than Linux, so you probably don't have tools like rl or sort -R available.
No worries. I had to do this a while ago. Try this instead:
$ printf 'one\ntwo\nthree\nfour\nfive\n' > input.txt
$ cat rndlines
#!/bin/sh
# default to 3 lines of output
lines="${1:-3}"
# default to "input.txt" as input file
input="${2:-input.txt}"
# First, put a random number at the beginning of each line.
while read line; do
printf '%8d%s\n' $(jot -r 1 1 99999999) "$line"
done < "$input" |
sort -n | # Next, sort by the random number.
sed 's/^.\{8\}//' | # Last, remove the number from the start of each line.
head -n "$lines" # Show our output
$ ./rndlines input.txt
two
one
five
$ ./rndlines input.txt
four
two
three
$
Here's a 1-line example that also inserts the random number a little more cleanly using awk:
$ printf 'one\ntwo\nthree\nfour\nfive\n' | awk 'BEGIN{srand()} {printf("%8d%s\n", rand()*10000000, $0)}' | sort -n | head -n 3 | cut -c9-
Note that different versions of sed (in FreeBSD and OSX) may require the -E option instead of -r to handle ERE instead or BRE dialect in the regular expression if you want to use that explictely, though everything I've tested works with escapted bounds in BRE. (Ancient versions of sed (HP/UX, etc) might not support this notation, but you'd only be using those if you already knew how to do this.)

This should do the trick, at least with bash and assuming your environment has the other commands available:
cat chk.c | while read x; do
echo $RANDOM:$x
done | sort -t: -k1 -n | tail -10 | sed 's/^[0-9]*://'
It basically outputs your file, placing a random number at the start of each line.
Then it sorts on that number, grabs the last 10 lines, and removes that number from them.
Hence, it gives you ten random lines from the file, with no repeats.
For example, here's a transcript of it running three times with that chk.c file:
====
pax$ testprog chk.c
} else {
}
newNode->next = NULL;
colm++;
====
pax$ testprog chk.c
}
arg++;
printf (" [%s] n", currNode->value);
free (tempNode->value);
====
pax$ testprog chk.c
char tagBuff[101];
}
return ERR_OTHER;
#define ERR_MEM 1
===
pax$ _

sort -Ru filename | head -5
will ensure no duplicates. Not all implementations of sort have the -R option.

To get N random lines from FILE with Perl:
perl -MList::Util=shuffle -e 'print shuffle <>' FILE | head -N

Here's an answer using ruby if you don't want to install anything else:
cat filename | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
for example, given a file (dups.txt) that looks like:
1 2
1 3
2
1 2
3
4
1 3
5
6
6
7
You might get the following output (or some permutation):
cat dups.txt| ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
4
6
5
1 2
2
3
7
1 3
Further example from the comments:
printf 'test\ntest1\ntest2\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test1
test
test2
Of course if you have a file with repeated lines of test you'll get just one line:
printf 'test\ntest\ntest\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test

What's an easy way to read random line from a file?

What's an easy way to read random line from a file in a shell script?

You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE

Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1

sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)

This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.

perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}

Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.

Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from

Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name

A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.

#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}

Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.

Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Shell command to sum integers, one per line?

I am looking for a command that will accept (as input) multiple lines of text, each line containing a single integer, and output the sum of these integers.
As a bit of background, I have a log file which includes timing measurements. Through grepping for the relevant lines and a bit of sed reformatting I can list all of the timings in that file. I would like to work out the total. I can pipe this intermediate output to any command in order to do the final sum. I have always used expr in the past, but unless it runs in RPN mode I do not think it is going to cope with this (and even then it would be tricky).
How can I get the summation of integers?

Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile

Paste typically merges lines of multiple files, but it can also be used to convert individual lines of a file into a single line. The delimiter flag allows you to pass a x+x type equation to bc.
paste -s -d+ infile | bc
Alternatively, when piping from stdin,
<commands> | paste -s -d+ - | bc

The one-liner version in Python:
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"

I would put a big WARNING on the commonly approved solution:
awk '{s+=$1} END {print s}' mydatafile # DO NOT USE THIS!!
that is because in this form awk uses a 32 bit signed integer representation: it will overflow for sums that exceed 2147483647 (i.e., 2^31).
A more general answer (for summing integers) would be:
awk '{s+=$1} END {printf "%.0f\n", s}' mydatafile # USE THIS INSTEAD

Plain bash:
$ cat numbers.txt
1
2
3
4
5
6
7
8
9
10
$ sum=0; while read num; do ((sum += num)); done < numbers.txt; echo $sum
55

With jq:
seq 10 | jq -s 'add' # 'add' is equivalent to 'reduce .[] as $item (0; . + $item)'

dc -f infile -e '[+z1<r]srz1<rp'
Note that negative numbers prefixed with minus sign should be translated for dc, since it uses _ prefix rather than - prefix for that. For example, via tr '-' '_' | dc -f- -e '...'.
Edit: Since this answer got so many votes "for obscurity", here is a detailed explanation:
The expression [+z1<r]srz1<rp does the following:
[ interpret everything to the next ] as a string
+ push two values off the stack, add them and push the result
z push the current stack depth
1 push one
<r pop two values and execute register r if the original top-of-stack (1)
is smaller
] end of the string, will push the whole thing to the stack
sr pop a value (the string above) and store it in register r
z push the current stack depth again
1 push 1
<r pop two values and execute register r if the original top-of-stack (1)
is smaller
p print the current top-of-stack
As pseudo-code:
Define "add_top_of_stack" as:
Remove the two top values off the stack and add the result back
If the stack has two or more values, run "add_top_of_stack" recursively
If the stack has two or more values, run "add_top_of_stack"
Print the result, now the only item left in the stack
To really understand the simplicity and power of dc, here is a working Python script that implements some of the commands from dc and executes a Python version of the above command:
### Implement some commands from dc
registers = {'r': None}
stack = []
def add():
stack.append(stack.pop() + stack.pop())
def z():
stack.append(len(stack))
def less(reg):
if stack.pop() < stack.pop():
registers[reg]()
def store(reg):
registers[reg] = stack.pop()
def p():
print stack[-1]
### Python version of the dc command above
# The equivalent to -f: read a file and push every line to the stack
import fileinput
for line in fileinput.input():
stack.append(int(line.strip()))
def cmd():
add()
z()
stack.append(1)
less('r')
stack.append(cmd)
store('r')
z()
stack.append(1)
less('r')
p()

Pure and short bash.
f=$(cat numbers.txt)
echo $(( ${f//$'\n'/+} ))

perl -lne '$x += $_; END { print $x; }' < infile.txt

My fifteen cents:
$ cat file.txt | xargs | sed -e 's/\ /+/g' | bc
Example:
$ cat text
1
2
3
3
4
5
6
78
9
0
1
2
3
4
576
7
4444
$ cat text | xargs | sed -e 's/\ /+/g' | bc
5148

I've done a quick benchmark on the existing answers which
use only standard tools (sorry for stuff like lua or rocket),
are real one-liners,
are capable of adding huge amounts of numbers (100 million), and
are fast (I ignored the ones which took longer than a minute).
I always added the numbers of 1 to 100 million which was doable on my machine in less than a minute for several solutions.
Here are the results:
Python
:; seq 100000000 | python -c 'import sys; print sum(map(int, sys.stdin))'
5000000050000000
# 30s
:; seq 100000000 | python -c 'import sys; print sum(int(s) for s in sys.stdin)'
5000000050000000
# 38s
:; seq 100000000 | python3 -c 'import sys; print(sum(int(s) for s in sys.stdin))'
5000000050000000
# 27s
:; seq 100000000 | python3 -c 'import sys; print(sum(map(int, sys.stdin)))'
5000000050000000
# 22s
:; seq 100000000 | pypy -c 'import sys; print(sum(map(int, sys.stdin)))'
5000000050000000
# 11s
:; seq 100000000 | pypy -c 'import sys; print(sum(int(s) for s in sys.stdin))'
5000000050000000
# 11s
Awk
:; seq 100000000 | awk '{s+=$1} END {print s}'
5000000050000000
# 22s
Paste & Bc
This ran out of memory on my machine. It worked for half the size of the input (50 million numbers):
:; seq 50000000 | paste -s -d+ - | bc
1250000025000000
# 17s
:; seq 50000001 100000000 | paste -s -d+ - | bc
3750000025000000
# 18s
So I guess it would have taken ~35s for the 100 million numbers.
Perl
:; seq 100000000 | perl -lne '$x += $_; END { print $x; }'
5000000050000000
# 15s
:; seq 100000000 | perl -e 'map {$x += $_} <> and print $x'
5000000050000000
# 48s
Ruby
:; seq 100000000 | ruby -e "puts ARGF.map(&:to_i).inject(&:+)"
5000000050000000
# 30s
C
Just for comparison's sake I compiled the C version and tested this also, just to have an idea how much slower the tool-based solutions are.
#include <stdio.h>
int main(int argc, char** argv) {
long sum = 0;
long i = 0;
while(scanf("%ld", &i) == 1) {
sum = sum + i;
}
printf("%ld\n", sum);
return 0;
}
 
:; seq 100000000 | ./a.out
5000000050000000
# 8s
Conclusion
C is of course fastest with 8s, but the Pypy solution only adds a very little overhead of about 30% to 11s. But, to be fair, Pypy isn't exactly standard. Most people only have CPython installed which is significantly slower (22s), exactly as fast as the popular Awk solution.
The fastest solution based on standard tools is Perl (15s).

Using the GNU datamash util:
seq 10 | datamash sum 1
Output:
55
If the input data is irregular, with spaces and tabs at odd places, this may confuse datamash, then either use the -W switch:
<commands...> | datamash -W sum 1
...or use tr to clean up the whitespace:
<commands...> | tr -d '[[:blank:]]' | datamash sum 1
If the input is large enough, the output will be in scientific notation.
seq 100000000 | datamash sum 1
Output:
5.00000005e+15
To convert that to decimal, use the the --format option:
seq 100000000 | datamash --format '%.0f' sum 1
Output:
5000000050000000

Plain bash one liner
$ cat > /tmp/test
1
2
3
4
5
^D
$ echo $(( $(cat /tmp/test | tr "\n" "+" ) 0 ))

BASH solution, if you want to make this a command (e.g. if you need to do this frequently):
addnums () {
local total=0
while read val; do
(( total += val ))
done
echo $total
}
Then usage:
addnums < /tmp/nums

You can using num-utils, although it may be overkill for what you need. This is a set of programs for manipulating numbers in the shell, and can do several nifty things, including of course, adding them up. It's a bit out of date, but they still work and can be useful if you need to do something more.
https://suso.suso.org/programs/num-utils/index.phtml
It's really simple to use:
$ seq 10 | numsum
55
But runs out of memory for large inputs.
$ seq 100000000 | numsum
Terminado (killed)

The following works in bash:
I=0
for N in `cat numbers.txt`
do
I=`expr $I + $N`
done
echo $I

I realize this is an old question, but I like this solution enough to share it.
% cat > numbers.txt
1
2
3
4
5
^D
% cat numbers.txt | perl -lpe '$c+=$_}{$_=$c'
15
If there is interest, I'll explain how it works.

Cannot avoid submitting this, it is the most generic approach to this Question, please check:
jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc
It is to be found over here, I was the OP and the answer came from the audience:
Most elegant unix shell one-liner to sum list of numbers of arbitrary precision?
And here are its special advantages over awk, bc, perl, GNU's datamash and friends:
it uses standards utilities common in any unix environment
it does not depend on buffering and thus it does not choke with really long inputs.
it implies no particular precision limits -or integer size for that matter-, hello AWK friends!
no need for different code, if floating point numbers need to be added, instead.
it theoretically runs unhindered in the minimal of environments

sed 's/^/.+/' infile | bc | tail -1

Pure bash and in a one-liner :-)
$ cat numbers.txt
1
2
3
4
5
6
7
8
9
10
$ I=0; for N in $(cat numbers.txt); do I=$(($I + $N)); done; echo $I
55

Alternative pure Perl, fairly readable, no packages or options required:
perl -e "map {$x += $_} <> and print $x" < infile.txt

For Ruby Lovers
ruby -e "puts ARGF.map(&:to_i).inject(&:+)" numbers.txt

Here's a nice and clean Raku (formerly known as Perl 6) one-liner:
say [+] slurp.lines
We can use it like so:
% seq 10 | raku -e "say [+] slurp.lines"
55
It works like this:
slurp without any arguments reads from standard input by default; it returns a string. Calling the lines method on a string returns a list of lines of the string.
The brackets around + turn + into a reduction meta operator which reduces the list to a single value: the sum of the values in the list. say then prints it to standard output with a newline.
One thing to note is that we never explicitly convert the lines to numbers—Raku is smart enough to do that for us. However, this means our code breaks on input that definitely isn't a number:
% echo "1\n2\nnot a number" | raku -e "say [+] slurp.lines"
Cannot convert string to number: base-10 number must begin with valid digits or '.' in '⏏not a number' (indicated by ⏏)
in block <unit> at -e line 1

You can do it in python, if you feel comfortable:
Not tested, just typed:
out = open("filename").read();
lines = out.split('\n')
ints = map(int, lines)
s = sum(ints)
print s
Sebastian pointed out a one liner script:
cat filename | python -c"from fileinput import input; print sum(map(int, input()))"

The following should work (assuming your number is the second field on each line).
awk 'BEGIN {sum=0} \
{sum=sum + $2} \
END {print "tot:", sum}' Yourinputfile.txt

$ cat n
2
4
2
7
8
9
$ perl -MList::Util -le 'print List::Util::sum(<>)' < n
32
Or, you can type in the numbers on the command line:
$ perl -MList::Util -le 'print List::Util::sum(<>)'
1
3
5
^D
9
However, this one slurps the file so it is not a good idea to use on large files. See j_random_hacker's answer which avoids slurping.

One-liner in Racket:
racket -e '(define (g) (define i (read)) (if (eof-object? i) empty (cons i (g)))) (foldr + 0 (g))' < numlist.txt

C (not simplified)
seq 1 10 | tcc -run <(cat << EOF
#include <stdio.h>
int main(int argc, char** argv) {
int sum = 0;
int i = 0;
while(scanf("%d", &i) == 1) {
sum = sum + i;
}
printf("%d\n", sum);
return 0;
}
EOF)

My version:
seq -5 10 | xargs printf "- - %s" | xargs | bc

C++ (simplified):
echo {1..10} | scc 'WRL n+=$0; n'
SCC project - http://volnitsky.com/project/scc/
SCC is C++ snippets evaluator at shell prompt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to implement a counter using dictionary in Bash - bash

Python code for reference: To count occurrences of lines, in shell you would typically do sort | uniq -c | sort you would do: sort ~/.bash_history | uniq -c | sort -rn | head

Related

while loops in parallel with input from splited file

Loop Script from Input File

How do I pick random unique lines from a text file in shell?

What's an easy way to read random line from a file?

Shell command to sum integers, one per line?

Categories

Resources