Bash summing numbers in a file - bash

I have a file which looks like this:
aaa 15
aaa 12
bbb 131
bbb 12
ccc 123
ddddd 1
ddddd 2
ddddd 3
I would like to get a sum for each unique element in the left side like this and also calculate the rounded percentage each of this represents out of the total:
aaa 27 - 9%
bbb 143 - 48%
ccc 123 - 41%
ddddd 6 - 2%
How would I accomplish this in BASH?

Since I cannot find any proper duplicate, I am posting an answer. Feel free to report a good one, so I will delete my answer and close as duplicate.
awk '{count[$1]+=$2} END {for (i in count) print i, count[i]}' file
This creates an array count[key]=value that keeps track of the value for a given key. Finally, it loops through the values and prints them.
It returns:
aaa 27
ccc 123
bbb 143
ddddd 6
To show percentages, just keep track of the total sum and divide accordingly:
awk '{tot+=$2; count[$1]+=$2}
END {for (i in count)
printf "%s %d - %d%%\n", i, count[i], (count[i]/tot)*100
}' file
So you can get:
aaa 27 - 9%
ccc 123 - 41%
bbb 143 - 47%
ddddd 6 - 2%

Since you asked for Bash, here's a Bash≥4 solution (needs Bash≥4 for associative arrays):
#!/bin/bash
declare -Ai sums
while read -r ref num; do
# check that num is a valid number or continue
[[ $num = +([[:digit:]]) ]] || continue
sums[$ref]+=$(( 10#$num ))
done < file
for ref in "${!sums[#]}"; do
printf '%s %d\n' "$ref" "${sums[$ref]}"
done
The output is not sorted; pipe through sort (or use a sorting algorithm) to sort it.
So now you added the percentage requirement! I hope you're not going to edit the question further adding more and more stuff…
Once we have the associative array sums, we can sum the sums:
sum=0
for x in "${sums[#]}"; do ((sum+=x)); done
and print the percentage:
for ref in "${!sums[#]}"; do
printf '%s %d - %d%%\n' "$ref" "${sums[$ref]}" "$((100*${sums[$ref]}/sum))"
done

And a solution for bash 3, without associative arrays:
while read key value
do
keys=$(echo -e "$keys\n$key")
var=data_$key
(($var=${!var}+$value))
((total=total+$value))
done < input_file
unique=$(echo "${keys:1}" | sort -u)
while read key
do
var=data_$key
((percentage=100*${!var} / total))
echo "$key $percentage%"
done <<EOF
$unique
EOF
Changed to use indirect variable references, rather than the more traditional eval.

Related

Replace a value in a file by another one (bash/awk)

I have a file (a coordinates file for those who know what it is) like following :
1 C 1
2 C 1 1 1.60000
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
and so on.. My idea is to replace the value "1.60000" in the second line, by other values using a for loop.
I would like the value to start at, lets say 0, and stop at 2.0 for example, with a increment step of 0.05
Here is what I already tried:
#! /bin/bash
a=0;
for ((i=0; i<=10 (for example); i++)); do
awk '{if ((NR==2) && ($5=="1.60000")) {($5=a)} print $0 }' file.dat > ${i}_file.dat
a=$((a+0.05))
done
But, unfortunately it doesn't work. I tried a lot of combination for the {$5=a} statement but without conclusive results.
Here is what I obtained:
1 C 1
2 C 1 1
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
The value 1.6000 simply dissapear or at least replaced by a blank.
Any advice ?
Thanks a lot,
Pierre-Louis
for this perhaps sed is a better alternative
$ v=0.00; for((i=0; i<=40; i++)) do
sed '2s/1.60/'"$v"'/' file > file_"$i";
v=$(echo "$v + 0.05" | bc | xargs printf "%.2f\n");
done
Explanation
sed '2s/1.60/'"$v"'/' file change the value 1.60 on second line with the value of variable v
floating point arithmetic in bash is hard, this adds 0.05 to the value and formats it (0.05 instead of .05) so that we can use it in the substitution with sed.
Exercise to you: in bash try to add 0.05 to 0.05 and format the output as 0.10 with leading zero.
example with awk (glenn's suggestion)
for ((i=0; i<=10; i++)); do
awk -v "i=$i" '
(FNR==2){ $5=sprintf("%2.1f ",i*0.5); print $0 }
' file.dat # > $i_file.dat # uncomment for a file output
done
advantage: it's awk who manage floating-point arithmetic

Convert decimal to Base-4 in bash

I have been using a pretty basic, and for the most part straight forward, method to converting base-10 numbers {1..256} to base-4 or quaternary numbers. I have been using simple division $(($NUM/4)) to get the main result in order to get the remainders $(($NUM%4)) and then printing the remainders in reverse to arrive at the result. I use the following bash script to do this:
#!/bin/bash
NUM="$1"
main() {
local EXP1=$(($NUM/4))
local REM1=$(($NUM%4))
local EXP2=$(($EXP1/4))
local REM2=$(($EXP1%4))
local EXP3=$(($EXP2/4))
local REM3=$(($EXP2%4))
local EXP4=$(($EXP3/4))
local REM4=$(($EXP3%4))
echo "
$EXP1 remainder $REM1
$EXP2 remainder $REM2
$EXP3 remainder $REM3
$EXP4 remainder $REM4
Answer: $REM4$REM3$REM2$REM1
"
}
main
This script works fine for numbers 0-255 or 1-256. But beyond this(these) ranges, results become mixed and often repeated or inaccurate. This isn't so much of a problem as I don't intend to convert numbers beyond 256 or less than 0 (negative numbers [yet]).
My question is: "Is there a more simplified method to do this, possibly using expr or bc?
Base 4 conversion in bash
int2b4() {
local val out num ret=\\n;
for ((val=$1;val;val/=4)){
out=$((val%4))$out;
}
printf ${2+-v} $2 %s${ret[${2+1}]} $out
}
Invoked with only 1 argument, this will convert to base 4 and print the result followed by a newline. If a second argument is present, a variable of this name will be populated, no printing.
int2b4 135
2013
int2b4 12345678
233012011032
int2b4 5432 var
echo $var
1110320
Detailled explanation:
The main part is (could be written):
out=""
for (( val=$1 ; val > 0 ; val = val / 4 )) ;do
out="$((val%4))$out"
done
We're conversion loop could be easily understood (i hope)
local ensure out val num to be local empty variables and initialise locally ret='\n'
printf line use some bashisms
${2+-v} is emppty if $2 is empty and represent -v if not.
${ret[${2+1}]} become respectively ${ret[]} ( or ${ret[0]} ) and ${ret[1]}
So this line become
printf "%s\n" $out
if no second argument ($2) and
printf -v var "%s" $out
if second argument is var (Note that no newline will be appended to a populated variable, but added for terminal printing).
Conversion back to decimal:
There is a bashism letting you compute with arbitrary base, under bash:
echo $((4#$var))
5432
echo $((4#1110320))
5432
In a script:
for integer in {1234..1248};do
int2b4 $integer quaternary
backint=$((4#$quaternary))
echo $integer $quaternary $backint
done
1234 103102 1234
1235 103103 1235
1236 103110 1236
1237 103111 1237
1238 103112 1238
1239 103113 1239
1240 103120 1240
1241 103121 1241
1242 103122 1242
1243 103123 1243
1244 103130 1244
1245 103131 1245
1246 103132 1246
1247 103133 1247
1248 103200 1248
Create a look-up table taking advantage of brace expansion
$ echo {a..c}
a b c
$ echo {a..c}{r..s}
ar as br bs cr cs
$ echo {0..3}{0..3}
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33
and so, for 0-255 in decimal to base-4
$ base4=({0..3}{0..3}{0..3}{0..3})
$ echo "${base4[34]}"
0202
$ echo "${base4[255]}"
3333

bash grep -e to array in a loop

I have a text with repeated data patterns, and grep keeps getting all matches without stop.
for ((count = 1; count !=17; count++)); do # 17 times
xuz1[count]=`grep -e "1 O1" $out_file | cut -c10-29`
xuz2[count]=`grep -e "2 O2" $out_file | cut -c10-29`
xuz3[count]=`grep -e "3 O3" $out_file | cut -c10-29`
echo ${xuz1[count]}
echo ${xuz2[count]}
echo ${xuz3[count]}
done
data looks like:
some text.....
Text....
.....
1 O1 111111 111111 111111
2 O2 222211 222211 222211
3 O3 643653 652346 757686
some text.....
1 O1 111122 111122 111122
2 O2 222222 222222 222222
3 O3 343653 652346 757683
some text.....
1 O1 111333 111333 111333
2 O2 222333 222333 222333
3 O3 343653 652346 757684
.
.
.
And result I'm getting:
xuz1[1] = 111111 111111 111111
xuz2[1] = 222211 222211 222211
xuz3[1] = 643653 652346 757686
xuz1[2] = 111111 111111 111111
xuz2[2] = 222211 222211 222211
xuz3[2] = 643653 652346 757686
...
looking for result like this:
xuz1[1]=111111 111111 111111
xuz2[1]=222211 222211 222211
xuz3[1]=343653 652346 757683
xuz1[2]=111122 111122 111122
xuz2[2]=222222 222222 222222
xuz3[2]=343653 652346 757684
also tried "grep -m 1 -e"
Which way should I go?
for now I ended up with one line
grep -A4 -e "1 O1" $out_file | cut -c10-29
Some text.... Is a huge text part.
A little bash script with a single grep is enough
grep -E '^[0-9]+ +O[0-9]+ +.*'|
while read idx oidx cols; do
if ((idx == 1)); then
let ++i
name=xuz$i
let j=1
fi
echo "$name[$j]=$cols"
let ++j
done
You haven't really described what you want, but I guess something like this.
awk '! /^[1-9][0-9]* O[0-9] / { n++; m=0; if (NR>1) print ""; next }
{ print "xuz" ++m "[" n "]=" substr($0, 10) }' "$out_file"
If the regex doesn't match, we assume we are looking at one of the "some text" pieces, and that this starts a new record. Increment n and reset m. Otherwise, print the output for this item within this record.
If some text could be more than one line, you will need a minor change, but I hope this should be enough at least to send you in the right direction.
You can do this in pure Bash, too, though this is going to be highly inefficient - you would expect a Bash while read loop to be at least a hundred times slower than Awk, and the code is markedly less idiomatic and elegant.
while read -r m x result; do
case $m::$x in
[1-9]::O[1-9])
printf 'xuz%d[%d]=%s\n' $m $n "$result;;
*)
# If n is unset, don't print an empty line
printf '%s' "${n+$'\n'}"
let ((n++));;
esac
done <"$out_file"
I would aggressively challenge any requirement to do this in pure Bash. If it's for homework, the requirement is unrealistic, and a core skill for shell script authors is to understand the limits of the shell and the strengths of the common support tools like Awk. The Awk language is virtually guaranteed to be available wherever you have a shell, in particular a heavy shell like Bash. (In a limited e.g. embedded environment, a limited shell like Dash would make more sense. Then e.g. the let keyword won't be available, though it should not be hard to make this script properly portable.)
The case statement accepts glob patterns, not regular expressions, so the pattern here is slightly less general (we accept one positive digit in the first field).
Thank you all for participating in discussion.
*** this is my home project to help my wife do extract data from research calculations /// speed up is around 400 times **
file used for extracting data from, contains around 2000 lines,
needed data blocks look like this
and they're repeated 10-20 times in the file.
uiyououy COORDINATES
NR ATOM CCCCC X Y Z
1 O1 8.00 0.000000000 0.882236820 -0.789494235
2 O2 8.00 0.000000000 -1.218250722 -1.644061652
3 O3 8.00 0.000000000 1.218328524 0.400260050
4 O4 8.00 0.000000000 -0.882314622 2.033295837
Text text text text
tons of text
to extract 4 lines I used expression below
grep -A4 --no-group-separator -e "1 O1" $from_file | cut -c23-64
>xyz_temp.txt
# grep 4 lines at once to txt
sed -i '/^[ \t]*$/d' xyz_temp.txt
#del empty lines from xyz txt
next is to convert string in to numbers (should use '| bc -l' for arithmetic)
while IFS= read line
do
IFS=' ' read -r -a arr_line <<< "$line"
# break line of xyz into 3 numbers
s1=$(echo "${arr_line[0]}" \* 0.529177249 | bc -l)
# some math convertion
s2=$(echo "${arr_line[1]}" \* 0.529177249 | bc -l)
s3=$(echo "${arr_line[2]}" \* 0.529177249 | bc -l)
#-------to array non sorted ------------
arr[$n]=${n}";"${from_file}";"${gd_}";"${frt[count_4s]}";"${n4}";"${s1}";"${s2}";"${s3}
echo ${arr[n]}
#--------------------------------------------
done <"$from_file_txt"
sort array
IFS=$'\n' sorted=($(sort -t \; -k4 -k5 -g <<<"${arr[*]}"))
# -t separator ';' -k column -g generic * to get new line output
#-k4 -k5 sort by column 4 then5
#printf "%s\n" "${sorted[*]}"
unset IFS
There is Last part which will combine data to result view
echo "$n"
n2=1
n42=1
count_4s2=1
i=0
echo "============================== sorted =============================="
################### loop for empty 4s lines
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
printf "%s\n" "${sorted[i]}"
while [ $i -lt $((n-2)) ]
do
i=$((i+1))
if [ "$n42" = "4" ] # 1234
then n42=0
count_4s2=$((count_4s2+1))
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
fi
#--------------------------------------------
n2=$((n2+1))
n42=$((n42+1))
printf "%s\n" "${sorted[i]}"
done ############# while
#00000000000000000000000000000000000000
printf "%s\n"
echo ==END===END===END==
Output looks like this
============================== sorted ==============================
;;;;;1;
17;A-13_A1+.out;1.3;0.4;1;0;.221176355474853043;-.523049776514580244
18;A-13_A1+.out;1.3;0.4;2;0;-.550350051428402955;-.734584881824005358
19;A-13_A1+.out;1.3;0.4;3;0;.665269869069959489;.133910683627893251
20;A-13_A1+.out;1.3;0.4;4;0;-.336096173116409577;1.123723974181515102
;;;;;2;
13;A-13_A1+.out;1.3;0.45;1;0;.279265277182782148;-.504490787956469897
14;A-13_A1+.out;1.3;0.45;2;0;-.583907412327951988;-.759310392973448167
15;A-13_A1+.out;1.3;0.45;3;0;.662538493711206290;.146829200993661293
16;A-13_A1+.out;1.3;0.45;4;0;-.357896358566036450;1.116971979936256771
;;;;;3;
9;A-13_A1+.out;1.3;0.5;1;0;.339333719743262501;-.482029749553797105
10;A-13_A1+.out;1.3;0.5;2;0;-.612395507070451545;-.788968880150283253
11;A-13_A1+.out;1.3;0.5;3;0;.658674809217196345;.163289820251690233
12;A-13_A1+.out;1.3;0.5;4;0;-.385613021360830052;1.107708808923212876
==END===END===END==
*note : some code might not shown here
next step is to paste it to excel with ; separator.

Bash/shell script: create four random-length strings with fixed total length

I would like to create four strings, each with a random length, but their total length should be 10. So possible length combinations could be:
3 3 3 1
or
4 0 2 2
Which would then (respectively) result in strings like this:
111 222 333 4
or
1111 33 44
How could I do this?
$RANDOM will give you a random integer in range 0..32767.
Using some arithmetic expansion you can do:
remaining=10
for i in {1..3}; do
next=$((RANDOM % remaining)) # get a number in range 0..$remaining
echo -n "$next "
((remaining -= next))
done
echo $remaining
Update: to repeat the number N times, you can use a function like this:
repeat() {
for ((i=0; i<$1; i++)); do
echo -n $1
done
echo
}
repeat 3
333
Here is an algorithm:
Make first 3 strings with random length, which is not greater than sum of lenght (each time substract it). And rest of length - it's your last string.
Consider this:
sumlen=10
for i in {1..3}
do
strlen=$(($RANDOM % $sumlen)); sumlen=$(($sumlen-$strlen)); echo $strlen
done
echo $sumlen
This will output your lengths, now you can create strings, suppose you know how
alternative awk solution
awk 'function r(n) {return int(n*rand())}
BEGIN{srand(); s=10;
for(i=1;i<=3;i++) {a=r(s); s-=a; print a}
print s}'
3
5
1
1
srand() to set a randomized seed, otherwise will generate the same random numbers each time.
Here you can combine the next task of generating the strings into the same awk script
$ awk 'function r(n) {return int(n*rand())};
function rep(n,t) {c="";for(i=1;i<=n;i++) c=c t; return c}
BEGIN{srand(); s=10;
for(j=1;j<=3;j++) {a=r(s); s-=a; printf("%s ", rep(a,j))}
printf("%s\n", rep(s,j))}'
generated
1111 2 3 4444

Pick and print one of three strings at random in Bash script

How can print a value, either 1, 2 or 3 (at random). My best guess failed:
#!/bin/bash
1 = "2 million"
2 = "1 million"
3 = "3 million"
print randomint(1,2,3)
To generate random numbers with bash use the $RANDOM internal Bash function:
arr[0]="2 million"
arr[1]="1 million"
arr[2]="3 million"
rand=$[ $RANDOM % 3 ]
echo ${arr[$rand]}
From bash manual for RANDOM:
Each time this parameter is
referenced, a random integer between 0
and 32767 is generated. The sequence
of random numbers may be initialized
by assigning a value to RANDOM. If
RANDOM is unset,it loses its
special properties, even if it is
subsequently reset.
Coreutils shuf
Present in Coreutils, this function works well if the strings don't contain newlines.
E.g. to pick a letter at random from a, b and c:
printf 'a\nb\nc\n' | shuf -n1
POSIX eval array emulation + RANDOM
Modifying Marty's eval technique to emulate arrays (which are non-POSIX):
a1=a
a2=b
a3=c
eval echo \$$(expr $RANDOM % 3 + 1)
This still leaves the RANDOM non-POSIX.
awk's rand() is a POSIX way to get around that.
64 chars alpha numeric string
randomString32() {
index=0
str=""
for i in {a..z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {A..Z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {0..9}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {1..64}; do str="$str${arr[$RANDOM%$index]}"; done
echo $str
}
~.$ set -- "First Expression" Second "and Last"
~.$ eval echo \$$(expr $RANDOM % 3 + 1)
and Last
~.$
Want to corroborate using shuf from coreutils using the nice -n1 -e approach.
Example usage, for a random pick among the values a, b, c:
CHOICE=$(shuf -n1 -e a b c)
echo "choice: $CHOICE"
I looked at the balance for two samples sizes (1000, and 10000):
$ for lol in $(seq 1000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
350 a
316 b
334 c
$ for lol in $(seq 10000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
3315 a
3377 b
3308 c
Ref: https://www.gnu.org/software/coreutils/manual/html_node/shuf-invocation.html

Resources