String to Integer conversions in shell script and back to String - bash

I would like to do String to Integer conversion, operate that integer and back to string in shell.
I have
input_sub=000
while [ -d $input_dir ]
do
echo $input_sub
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
# then convert that 1 to 001
done
Don't mind much about the while condition.
I would like to do what is described in the comments.
How can I do this?

You can do what you need in POSIX shell, but you must protect against numbers with leading zeros being interpreted as octal numbers. To do what you want, you need a way to remove the leading zeros for your conversion to a number. While bash provides a simple built-in parameter expansion that will work, in POSIX shell, you are stuck using the old expr syntax or calling a utility like sed or grep.
To trim the leading zeros using expr, you must first know how many there are. The old POSIX shell expr provides two expressions that will work. The first called index can return the index of the first character in $input_sub that is not 0. Which gives you the index (1-based) where the first non-zero digit is found. The form you can use is:
## get index of first non-zero digit, POSIX compliant
nonzero=$(expr index "$input_sub" [123456789])
With the index of the first non-zero digit in $nonzero, you can use the substr expression to obtain the number without leading zeros (you know the max number of digits is 3, so obtain the substring from the index to 3), e.g.
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
You need to be able to handle 000 as $inpu_sub, so go ahead and add a if .. then ... else ... fi to handle that case, e.g.
if [ "$nonzero" -eq 0 ]; then
num=0
else
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
fi
Now you can simply add 1 to get your new number:
newnum=$((num + 1))
To convert the number back to a string of 3 characters representing the number with leading zeros replaced, just use printf with the "%03d" conversion specifier, e.g.
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
Putting together a short example showing the progression that takes place, I have replaced your while loop with a loop that will loop 21 times from 0 to 20 to show the operation and I have added printf statements to show the numbers and conversion back to string. You simply restore your while and remove the extra printf statements for your use:
#!/bin/sh
input_sub=000
# while [ -d $input_dir ]
while [ "$input_sub" != "020" ] ## temporary loop 000 to 009
do
printf "input_sub: %s " "$input_sub"
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
## get index of first non-zero digit, POSIX compliant
nonzero=$(expr index "$input_sub" [123456789])
if [ "$nonzero" -eq 0 ]; then
num=0
else
num=$(expr substr "$input_sub" "$nonzero" 3) ## remove leading 0's
fi
newnum=$((num + 1))
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
printf "%2d + 1 = %2d => input_sub: %s\n" "$num" "$newnum" "$input_sub"
done
Example Use/Output
Showing the conversions with the modified while loop, you would get:
$ sh str2int2str.sh
input_sub: 000 0 + 1 = 1 => input_sub: 001
input_sub: 001 1 + 1 = 2 => input_sub: 002
input_sub: 002 2 + 1 = 3 => input_sub: 003
input_sub: 003 3 + 1 = 4 => input_sub: 004
input_sub: 004 4 + 1 = 5 => input_sub: 005
input_sub: 005 5 + 1 = 6 => input_sub: 006
input_sub: 006 6 + 1 = 7 => input_sub: 007
input_sub: 007 7 + 1 = 8 => input_sub: 008
input_sub: 008 8 + 1 = 9 => input_sub: 009
input_sub: 009 9 + 1 = 10 => input_sub: 010
input_sub: 010 10 + 1 = 11 => input_sub: 011
input_sub: 011 11 + 1 = 12 => input_sub: 012
input_sub: 012 12 + 1 = 13 => input_sub: 013
input_sub: 013 13 + 1 = 14 => input_sub: 014
input_sub: 014 14 + 1 = 15 => input_sub: 015
input_sub: 015 15 + 1 = 16 => input_sub: 016
input_sub: 016 16 + 1 = 17 => input_sub: 017
input_sub: 017 17 + 1 = 18 => input_sub: 018
input_sub: 018 18 + 1 = 19 => input_sub: 019
input_sub: 019 19 + 1 = 20 => input_sub: 020
This has been done in POSIX shell given your tag [shell]. If you have bash available, you can shorten and make the script a bit more efficient by using bash built-ins instead of expr. That said, for 1000 directories max -- you won't notice much difference. Let me know if you have further questions.
Bash Solution Per-Request in Comment
If you do have bash available, then the [[ ... ]] expression provides the =~ operator which allows an extended REGEX match on the right hand side (e.g. [[ $var =~ REGEX ]]) The REGEX can contain capture groups (parts of the REGEX enclosed by (..)), that are used to fill the BASH_REMATCH array where ${BASH_REMATCH[0]} contains the total expression matched and ${BASH_REMATCH[1]} ... contain each captured part of the regex.
So using [[ ... =~ ... ]] with a capture on the number beginning with [123456789] will leave the wanted number in ${BASH_REMATCH[1]} allowing you to compute the new number using the builtin, e.g.
#!/bin/bash
input_sub=000
# while [ -d $input_dir ]
while [ "$input_sub" != "020" ] ## temporary loop 000 to 020
do
printf "input_sub: %s " "$input_sub"
# HERE I would like to fist convert 000 to 0
# then add 1 to it 0-> 1
## [[ .. =~ REGEX ]], captures between (...) in array BASH_REMATCH
if [[ $input_sub =~ ^0*([123456789]+[0123456789]*)$ ]]
then
num=${BASH_REMATCH[1]} ## use number if not all zeros
else
num=0 ## handle 000 case
fi
newnum=$((num + 1))
# then convert that 1 to 001
input_sub=$(printf "%03d" "$newnum")
printf "%2d + 1 = %2d => input_sub: %s\n" "$num" "$newnum" "$input_sub"
done
(same output)
Let me know if you have further questions.

Related

Count overlapping occurrences of a substring *in a very large file* using Bash

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.
Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?
For example,
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
222
333
444
555
666
must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)
Find 6 overlapping substrings aaa in the string
line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"
You don't want to see the strings, you want to count them.
When you try
# wrong
grep -o -F "aaa" <<< "${line}" | wc -l
you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with
grep -Eo "a{3,}" <<< "${line}"
Result
aaa
aaaa
aaaaa
Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa.
Compare the total count of characters with the number of lines (wc):
match lines chars add_to_total
aaa 1 4 1
aaaa 1 5 2
aaaaa 1 6 3
For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate
15 characters - (3 lines * 3 characters) = 15 - 9 = 6
In code:
read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
Or for a file
read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are
abcdabc
abcdabcdabc
abcdabcdabcdabc
Use testline
line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"
you need "abc(dabc)+" and have
match lines chars add_to_total
abcdabcdabcdabc 1 16 3
abcdabcdabcdabc 1 16 3
abcdabc 1 8 1
abcdabcdabc 1 12 2
For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate
(52 characters / fixed 4) / 4 lines = 13 - 4 = 9
In code:
read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"
When you have a large file, you might want to split it first.
I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):
Use the summation method :
# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
${input……} | WHINY_USERS=1 {m,g}awk '
BEGIN {
1 FS = "[^a]+(aa?[^a]+)*"
1 OFS = "|"
1 PROCINFO["sorted_in"] = "#ind_str_asc"
} {
2 _ = ""
2 OFS = "|"
2 gsub("^[|]*|[|]*$",_, $!(NF=NF))
2 split(_,__)
split($-_,___,"[|]+")
12 for (_ in ___) {
12 __[___[_]]++
}
2 _____=____=_<_
2 OFS = "\t"
2 print " -- line # "(NR)
7 for (_ in __) {
7 print sprintf(" %20s",_), __[_], \
______=__[_] * (length(_)-2),\
"| "(____+=__[_]), _____+=______
}
print "" }'
|
-- line # 1
aaa 3 3 | 3 3
aaaa 2 4 | 5 7
aaaaa 3 9 | 8 16
aaaaaaaaaaaaaaa 1 13 | 9 29
-- line # 2
aaa 1 1 | 1 1
aaaa 1 2 | 2 3
aaaaa 1 3 | 3 6
Print out all the copies of that substring :
{m,g}awk' {
2 printf("%s%.*s",____=$(_=_<_),_, NF=NF)
9 do { _+=gsub(__,_____)
} while(index($+__,__))
2 if(_) {
2 ____=substr(____,-_<_,_)
2 gsub(".", (":")__, ____)
2 print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
} else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'
|
aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
aaa:aaa:aaa:aaa:aaa:aaa

How to generate N columns with printf

I'm currently using:
printf "%14s %14s %14s %14s %14s %14s\n" $(cat NFE.txt)>prueba.txt
This reads a list in NFE.txt and generates 6 columns. I need to generate N columns where N is a variable.
Is there a simple way of saying something like:
printf "N*(%14s)\n" $(cat NFE.txt)>prueba.txt
Which generates the desire output?
# T1 is a white string with N blanks
T1=$(printf "%${N}s")
# Replace every blank in T with string %14s and assign to T2
T2="${T// /%14s }"
# Pay attention to that T2 contains a trailing blank.
# ${T2% } stands for T2 without a trailing blank
printf "${T2% }\n" $(cat NFE.txt)>prueba.txt
You can do this although i don't know how robust it will be
$(printf 'printf '; printf '%%14s%0.s' {1..6}; printf '\\n') $(<file)
^
This is your variable number of strings
It prints out the command with the correct number of string and executes it in a subshell.
Input
10 20 30 40 50 1 0
1 3 45 6 78 9 4 3
123 4
5 4 8 4 2 4
Output
10 20 30 40 50 1
0 1 3 45 6 78
9 4 3 123 4 5
4 8 4 2 4
You could write this in pure bash, but then you could just use an existing language. For example:
printf "$(python -c 'print("%14s "*6)')\n" $(<NFE.txt)
In pure bash, you could write, for example:
repeat() { (($1)) && printf "%s%s" "$2" "$(times $(($1-1)) "$2")"; }
and then use that in the printf:
printf "$(repeat 6 "%14s ")\n" $(<NFE.txt)

How do i split the input into chunks of six entries each using bash?

This is the script which i run to output the raw data of data_tripwire.sh
#!/bin/sh
LOG=/var/log/syslog-ng/svrs/sec2tes1
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
CBS=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.41 |sort|uniq | wc -l`
echo $CBS >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
GFS=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.31 |sort|uniq | wc -l`
echo $GFS >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
HR1=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.10.1 |sort|uniq | wc -l `
echo $HR1 >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
HR2=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.21.12 |sort|uniq | wc -l`
echo $HR2 >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
PAYROLL=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.21.18 |sort|uniq | wc -l`
echo $PAYROLL >> /home/secmgr/attmrms1/data_tripwire1.sh
done
for count in 6 5 4 3 2 1 0
do
MONTH=`date -d"$count month ago" +"%Y-%m"`
INCV=`bzcat $LOG/$MONTH*.log.bz2|grep 10.55.22.71 |sort|uniq | wc -l`
echo $INCV >> /home/secmgr/attmrms1/data_tripwire1.sh
done
data_tripwire.sh
91
58
54
108
52
18
8
81
103
110
129
137
84
15
14
18
11
17
12
6
1
28
6
14
8
8
0
0
28
24
25
23
21
13
9
4
18
17
18
30
13
3
I want to do the first 6 entries(91,58,54,108,52,18) from the output above. Then it will break out of the loop.After that it will continue for the next 6 entries.Then it will break out of the loop again....
The problem now is that it reads all the 42 numbers without breaking out of the loop.
This is the output of the table
Tripwire
Month CBS GFS HR HR Payroll INCV
cb2db1 gfs2db1 hr2web1 hrm2db1 hrm2db1a incv2svr1
2013-07 85 76 12 28 26 4
2013-08 58 103 18 6 24 18
2013-09 54 110 11 14 25 17
2013-10 108 129 17 8 23 18
2013-11 52 137 12 8 21 30
2013-12 18 84 6 0 13 13
2014-01 8 16 1 0 9 3
The problem now is that it read the total 42 numbers from 85...3
I want to make a loop which run from july till jan for one server.Then it will do the average mean and standard deviation calculation which is already done below.
After that done, it will continue the next cycle of 6 numbers for the next server and it will do the same like initial cycle.Assistance is required for the for loops which has break and continue in it or any simpler.
This is my standard deviation calculation
count=0 # Number of data points; global.
SC=3 # Scale to be used by bc. three decimal places.
E_DATAFILE=90 # Data file error
## ----------------- Set data file ---------------------
if [ ! -z "$1" ] # Specify filename as cmd-line arg?
then
datafile="$1" # ASCII text file,
else #+ one (numerical) data point per line!
datafile=/home/secmgr/attmrms1/data_tripwire1.sh
fi # See example data file, below.
if [ ! -e "$datafile" ]
then
echo "\""$datafile"\" does not exist!"
exit $E_DATAFILE
fi
Calculate the mean
arith_mean ()
{
local rt=0 # Running total.
local am=0 # Arithmetic mean.
local ct=0 # Number of data points.
while read value # Read one data point at a time.
do
rt=$(echo "scale=$SC; $rt + $value" | bc)
(( ct++ ))
done
am=$(echo "scale=$SC; $rt / $ct" | bc)
echo $am; return $ct # This function "returns" TWO values!
# Caution: This little trick will not work if $ct > 255!
# To handle a larger number of data points,
#+ simply comment out the "return $ct" above.
} <"$datafile" # Feed in data file.
sd ()
{
mean1=$1 # Arithmetic mean (passed to function).
n=$2 # How many data points.
sum2=0 # Sum of squared differences ("variance").
avg2=0 # Average of $sum2.
sdev=0 # Standard Deviation.
while read value # Read one line at a time.
do
diff=$(echo "scale=$SC; $mean1 - $value" | bc)
# Difference between arith. mean and data point.
dif2=$(echo "scale=$SC; $diff * $diff" | bc) # Squared.
sum2=$(echo "scale=$SC; $sum2 + $dif2" | bc) # Sum of squares.
done
avg2=$(echo "scale=$SC; $sum2 / $n" | bc) # Avg. of sum of squares.
sdev=$(echo "scale=$SC; sqrt($avg2)" | bc) # Square root =
echo $sdev # Standard Deviation.
} <"$datafile" # Rewinds data file.
Showing the output
mean=$(arith_mean); count=$? # Two returns from function!
std_dev=$(sd $mean $count)
echo
echo "<tr><th>Servers</th><th>"Number of data points in \"$datafile"\"</th> <th>Arithmetic mean (average)</th><th>Standard Deviation</th></tr>" >> $HTML
echo "<tr><td>cb2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>gfs2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hr2web1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hrm2db1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>hrm2db1a<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo "<tr><td>incv21svr1<td>$count<td>$mean<td>$std_dev</tr>" >> $HTML
echo
I want to split the input into chunks of six entries each with the arithmetic mean and the sd of the entries 1..6, then of the entries 7..12, then of 13..18 etc.
This is the output of the table i want.
Tripwire
Month CBS GFS HR HR Payroll INCV
cb2db1 gfs2db1 hr2web1 hrm2db1 hrm2db1a incv2svr1
2013-07 85 76 12 28 26 4
2013-08 58 103 18 6 24 18
2013-09 54 110 11 14 25 17
2013-10 108 129 17 8 23 18
2013-11 52 137 12 8 21 30
2013-12 18 84 6 0 13 13
2014-01 8 16 1 0 9 3
*Standard
deviation
(7mths) 31.172 35.559 5.248 8.935 5.799 8.580
* Mean
(7mths) 54.428 94.285 11.142 9.142 20.285 14.714
paste - - - - - - < data_tripwire.sh | while read -a values; do
# values is an array with 6 values
# ${values[0]} .. ${values[5]}
arith_mean "${values[#]}"
done
This means you have to rewrite your function so they don't use read: change
while read value
to
for value in "$#"
#Matt, yes change both functions to iterate over arguments instead of reading from stdin. Then, you will pass the data file (now called "data_tripwire1.sh" (terrible file extension for data, use .txt or .dat)) into paste to reformat the data so that the first 6 values now form the first row. Read the line into the array values (using read -a values) and invoke the functions :
arith_mean () {
local sum=$(IFS=+; echo "$*")
echo "scale=$SC; ($sum)/$#" | bc
}
sd () {
local mean=$1
shift
local sum2=0
for i in "$#"; do
sum2=$(echo "scale=$SC; $sum2 + ($mean-$i)^2" | bc)
done
echo "scale=$SC; sqrt($sum2/$#)"|bc
}
paste - - - - - - < data_tripwire1.sh | while read -a values; do
mean=$(arith_mean "${values[#]}")
sd=$(sd $mean "${values[#]}")
echo "${values[#]} $mean $sd"
done | column -t
91 58 54 108 52 18 63.500 29.038
8 81 103 110 129 137 94.666 42.765
84 15 14 18 11 17 26.500 25.811
12 6 1 28 6 14 11.166 8.648
8 8 0 0 28 24 11.333 10.934
25 23 21 13 9 4 15.833 7.711
18 17 18 30 13 3 16.500 7.973
Note you don't need to return a fancy value from the functions: you know how many points you pass in.
Based on Glenn's answer I propose this which needs very little changes to the original:
paste - - - - - - < data_tripwire.sh | while read -a values
do
for value in "${values[#]}"
do
echo "$value"
done | arith_mean
for value in "${values[#]}"
do
echo "$value"
done | sd
done
You can type (or copy & paste) this code directly in an interactive shell. It should work out of the box. Of course, this is not feasible if you intend to use this often, so you can put that code into a text file, make that executable and call that text file as a shell script. In this case you should add #!/bin/bash as first line in that file.
Credit to Glenn Jackman for the use of paste - - - - - - which is the real solution I'd say.
The functions will now be able to only read 6 items in datafile.
arith_mean ()
{
local rt=0 # Running total.
local am=0 # Arithmetic mean.
local ct=0 # Number of data points.
while read value # Read one data point at a time.
do
rt=$(echo "scale=$SC; $rt + $value" | bc)
(( ct++ ))
done
am=$(echo "scale=$SC; $rt / $ct" | bc)
echo $am; return $ct # This function "returns" TWO values!
# Caution: This little trick will not work if $ct > 255!
# To handle a larger number of data points,
#+ simply comment out the "return $ct" above.
} <(awk -v block=$i 'NR > (6* (block - 1)) && NR < (6 * block + 1) {print}' "$datafile") # Feed in data file.
sd ()
{
mean1=$1 # Arithmetic mean (passed to function).
n=$2 # How many data points.
sum2=0 # Sum of squared differences ("variance").
avg2=0 # Average of $sum2.
sdev=0 # Standard Deviation.
while read value # Read one line at a time.
do
diff=$(echo "scale=$SC; $mean1 - $value" | bc)
# Difference between arith. mean and data point.
dif2=$(echo "scale=$SC; $diff * $diff" | bc) # Squared.
sum2=$(echo "scale=$SC; $sum2 + $dif2" | bc) # Sum of squares.
done
avg2=$(echo "scale=$SC; $sum2 / $n" | bc) # Avg. of sum of squares.
sdev=$(echo "scale=$SC; sqrt($avg2)" | bc) # Square root =
echo $sdev # Standard Deviation.
} <(awk -v block=$i 'NR > (6 * (block - 1)) && NR < (6 * block + 1) {print}' "$datafile") # Rewinds data file.
From main you will need to set your blocks to read.
for((i=1; i <= $(( $(wc -l $datafile | sed 's/[A-Za-z \/]*//g') / 6 )); i++))
do
mean=$(arith_mean); count=$? # Two returns from function!
std_dev=$(sd $mean $count)
done
Of course it is better to move the wc -l outside of the loop for faster execution. But you get the idea.
The syntax error occured between < and ( due to space. There shouldn't be a space between them. Sorry for the typo.
cat <(awk -F: '{print $1}' /etc/passwd) works.
cat < (awk -F: '{print $1}' /etc/passwd) syntax error near unexpected token `('

Calculate mean, variance and range using Bash script

Given a file file.txt:
AAA 1 2 3 4 5 6 3 4 5 2 3
BBB 3 2 3 34 56 1
CCC 4 7 4 6 222 45
Does any one have any ideas on how to calculate the mean, variance and range for each item, i.e. AAA, BBB, CCC respectively using Bash script? Thanks.
Here's a solution with awk, which calculates:
minimum = smallest value on each line
maximum = largest value on each line
average = μ = sum of all values on each line, divided by the count of the numbers.
variance = 1/n × [(Σx)² - Σ(x²)] where
n = number of values on the line = NF - 1 (in awk, NF = number of fields on the line)
(Σx)² = square of the sum of the values on the line
Σ(x²) = sum of the squares of the values on the line
awk '{
min = max = sum = $2; # Initialize to the first value (2nd field)
sum2 = $2 * $2 # Running sum of squares
for (n=3; n <= NF; n++) { # Process each value on the line
if ($n < min) min = $n # Current minimum
if ($n > max) max = $n # Current maximum
sum += $n; # Running sum of values
sum2 += $n * $n # Running sum of squares
}
print $1 ": min=" min ", avg=" sum/(NF-1) ", max=" max ", var=" ((sum*sum) - sum2)/(NF-1);
}' filename
Output:
AAA: min=1, avg=3.45455, max=6, var=117.273
BBB: min=1, avg=16.5, max=56, var=914.333
CCC: min=4, avg=48, max=222, var=5253
Note that you can save the awk script (everything between, but not including, the single-quotes) in a file, say called script, and execute it with awk -f script filename
You can use python:
$ AAA() { echo "$#" | python -c 'from sys import stdin; nums = [float(i) for i in stdin.read().split()]; print(sum(nums)/len(nums))'; }
$ AAA 1 2 3 4 5 6 3 4 5 2 3
3.45454545455
Part 1 (mean):
mean () {
len=$#
echo $* | tr " " "\n" | sort -n | head -n $(((len+1)/2)) | tail -n 1
}
nMean () {
echo -n "$1 "
shift
mean $*
}
mean usage:
nMean AAA 3 4 5 6 3 4 3 6 2 4
4
Part 2 (variance):
variance () {
count=$1
avg=$2
shift
shift
sum=0
for n in $*
do
diff=$((avg-n))
quad=$((diff*diff))
sum=$((sum+quad))
done
echo $((sum/count))
}
sum () {
form="$(echo $*)"
formula=${form// /+}
echo $((formula))
}
nVariance () {
echo -n "$1 "
shift
count=$#
s=$(sum $*)
avg=$((s/$count))
var=$(variance $count $avg $*)
echo $var
}
usage:
nVariance AAA 3 4 5 6 3 4 3 6 2 4
1
Part 3 (range):
range () {
min=$1
max=$1
for p in $* ; do
(( $p < $min )) && min=$p
(( $p > $max )) && max=$p
done
echo $min ":" $max
}
nRange () {
echo -n "$1 "
shift
range $*
}
usage:
nRange AAA 1 2 3 4 5 6 3 4 5 2 3
AAA 1 : 6
nX is short for named X, named mean, named variance, ... .
Note, that I use integer arithmetic, which is, what is possible with the shell. To use floating point arithmetic, you would use bc, for instance. Here you loose precision, which might be acceptable for big natural numbers.
Process all 3 commands for an input line:
processLine () {
nVariance $*
nMean $*
nRange $*
}
Read the data from a file, line by line:
# data:
# AAA 1 2 3 4 5 6 3 4 5 2 3
# BBB 3 2 3 34 56 1
# CCC 4 7 4 6 222 45
while read line
do
processLine $line
done < data
update:
Contrary to my expectation, it doesn't seem easy to handle an unknown number of arguments with functions in bc, for example min (3, 4, 5, 2, 6).
But the need to call bc can be reduced to 2 places, if the input are integers. I used a precision of 2 ("scale=2") - you may change this to your needs.
variance () {
count=$1
avg=$2
shift
shift
sum=0
for n in $*
do
diff="($avg-$n)"
quad="($diff*$diff)"
sum="($sum+$quad)"
done
# echo "$sum/$count"
echo "scale=2;$sum/$count" | bc
}
nVariance () {
echo -n "$1 "
shift
count=$#
s=$(sum $*)
avg=$(echo "scale=2;$s/$count" | bc)
var=$(variance $count $avg $*)
echo $var
}
The rest of the code can stay the same. Please verify that the formula for the variance is correct - I used what I had in mind:
For values (1, 5, 9), I sum up (15) divide by count (3) => 5.
Then I create the diff to the avg for each value (-4, 0, 4), build the square (16, 0, 16), sum them up (32) and divide by count (3) => 10.66
Is this correct, or do I need a square root somewhere ;) ?
Note, that I had to correct the mean calculation. For 1, 5, 9, the mean is 5, not 1 - am I right? It now uses sort -n (numeric) and (len+1)/2.
There is a typo in the accepted answer that causes the variance to be miscalculated. In the print statement:
", var=" ((sum*sum) - sum2)/(NF-1)
should be:
", var=" (sum2 - ((sum*sum)/NF))/(NF-1)
Also, it is better to use something like Welford's algorithm to calculate variance; the algorithm in the accepted answer is unstable when the variance is small relative to the mean:
foo="1 2 3 4 5 6 3 4 5 2 3";
awk '{
M = 0;
S = 0;
for (k=1; k <= NF; k++) {
x = $k;
oldM = M;
M = M + ((x - M)/k);
S = S + (x - M)*(x - oldM);
}
var = S/(NF - 1);
print " var=" var;
}' <<< $foo

In bash, how could I add integers with leading zeroes and maintain a specified buffer

For example, I want to count from 001 to 100. Meaning the zero buffer would start off with 2, 1, then eventually 0 when it reaches 100 or more.
ex:
001
002
...
010
011
...
098
099
100
I could do this if the numbers had a predefined number of zeroes with printf "%02d" $i. But that's static and not dynamic and would not work in my example.
If by static versus dynamic you mean that you'd like to be able to use a variable for the width, you can do this:
$ padtowidth=3
$ for i in 0 {8..11} {98..101}; do printf "%0*d\n" $padtowidth $i; done
000
008
009
010
011
098
099
100
101
The asterisk is replaced by the value of the variable it corresponds to in the argument list ($padtowidth in this case).
Otherwise, the only reason your example doesn't work is that you use "2" (perhaps as if it were the maximum padding to apply) when it should be "3" (as in my example) since that value is the resulting total width (not the pad-only width).
If your system has it, try seq with the -w (--equal-width) option:
$ seq -s, -w 1 10
01,02,03,04,05,06,07,08,09,10
$ for i in `seq -w 95 105` ; do echo -n " $i" ; done
095 096 097 098 099 100 101 102 103 104 105
In Bash version 4 (use bash -version) you can use brace expansion. Putting a 0 before either limit forces the numbers to be padded by zeros
echo {01..100} # 001 002 003 ...
echo {03..100..3} # 003 006 009 ...
#!/bin/bash
max=100;
for ((i=1;i<=$max;i++)); do
printf "%0*d\n" ${#max} $i
done
The code above will auto-pad your numbers with the correct number of 0's based upon how many digits the max/terminal value contains. All you need to do is change the max variable and it will handle the rest.
Examples:
max=10
01
02
03
04
05
06
07
08
09
10
max=100
001
002
003
004
005
006
...
097
098
099
100
max=1000
0001
0002
0003
0004
0005
0006
...
0997
0998
0999
1000
# jot is available on FreeBSD, Mac OS X, ...
jot -s " " -w '%03d' 5
jot -s " " -w '%03d' 10
jot -s " " -w '%03d' 50
jot -s " " -w '%03d' 100
If you need to pad values up to a variable number with variable padding:
$values_count=514;
$padding_width=5;
for i in 0 `seq 1 $(($values_count - 1))`; do printf "%0*d\n" $padding_width $i; done;
This would print out 00000, 00001, ... 00513.
(I didn't find any of the current answers meeting my need)

Resources