How would I loop over pairs of values without repetition in bash? - bash

I'm using a particular program that would require me to examine pairs of variables in a text file by specifying the pairs using indices.
For example:
gcta --reml-bivar 1 2 --grm test --pheno test.phen --out test
Where 1 and 2 would correspond to values from the first two columns in a text file. If I had 50 columns and wanted to examine each pair without repetition (1&2, 2&3, 1&3 ... 50), what would be the best way to automate this by looping through this? So essentially the script would be executing the same command but taking in pairs of indices like:
gcta --reml-bivar 1 3 --grm test --pheno test.phen --out test
gcta --reml-bivar 1 4 --grm test --pheno test.phen --out test
... so on and so forth. Thanks!

Since you haven't shown us any sample input we're just guessing but if your input is list of numbers (extracted from a file or otherwise) then here's an approach:
$ cat combinations.awk
###################
# Calculate all combinations of a set of strings, see
# https://rosettacode.org/wiki/Combinations#AWK
###################
function get_combs(A,B, i,n,comb) {
## Default value for r is to choose 2 from pool of all elements in A.
## Can alternatively be set on the command line:-
## awk -v r=<number of items being chosen> -f <scriptname>
n = length(A)
if (r=="") r = 2
comb = ""
for (i=1; i <= r; i++) { ## First combination of items:
indices[i] = i
comb = (i>1 ? comb OFS : "") A[indices[i]]
}
B[comb]
## While 1st item is less than its maximum permitted value...
while (indices[1] < n - r + 1) {
## loop backwards through all items in the previous
## combination of items until an item is found that is
## less than its maximum permitted value:
for (i = r; i >= 1; i--) {
## If the equivalently positioned item in the
## previous combination of items is less than its
## maximum permitted value...
if (indices[i] < n - r + i) {
## increment the current item by 1:
indices[i]++
## Save the current position-index for use
## outside this "for" loop:
p = i
break
}
}
## Put consecutive numbers in the remainder of the array,
## counting up from position-index p.
for (i = p + 1; i <= r; i++) indices[i] = indices[i - 1] + 1
## Print the current combination of items:
comb = ""
for (i=1; i <= r; i++) {
comb = (i>1 ? comb OFS : "") A[indices[i]]
}
B[comb]
}
}
# Input should be a list of strings
{
split($0,A)
delete B
get_combs(A,B)
PROCINFO["sorted_in"] = "#ind_str_asc"
for (comb in B) {
print comb
}
}
.
$ awk -f combinations.awk <<< '1 2 3 4'
1 2
1 3
1 4
2 3
2 4
3 4
.
$ while read -r a b; do
echo gcta --reml-bivar "$a" "$b" --grm test --pheno test.phen --out test
done < <(awk -f combinations.awk <<< '1 2 3 4')
gcta --reml-bivar 1 2 --grm test --pheno test.phen --out test
gcta --reml-bivar 1 3 --grm test --pheno test.phen --out test
gcta --reml-bivar 1 4 --grm test --pheno test.phen --out test
gcta --reml-bivar 2 3 --grm test --pheno test.phen --out test
gcta --reml-bivar 2 4 --grm test --pheno test.phen --out test
gcta --reml-bivar 3 4 --grm test --pheno test.phen --out test
Remove the echo when you're done testing and happy with the output.
In case anyone's reading this and wants permutations instead of combinations:
$ cat permutations.awk
###################
# Calculate all permutations of a set of strings, see
# https://en.wikipedia.org/wiki/Heap%27s_algorithm
function get_perm(A, i, lgth, sep, str) {
lgth = length(A)
for (i=1; i<=lgth; i++) {
str = str sep A[i]
sep = " "
}
return str
}
function swap(A, x, y, tmp) {
tmp = A[x]
A[x] = A[y]
A[y] = tmp
}
function generate(n, A, B, i) {
if (n == 1) {
B[get_perm(A)]
}
else {
for (i=1; i <= n; i++) {
generate(n - 1, A, B)
if ((n%2) == 0) {
swap(A, 1, n)
}
else {
swap(A, i, n)
}
}
}
}
function get_perms(A,B) {
generate(length(A), A, B)
}
###################
# Input should be a list of strings
{
split($0,A)
delete B
get_perms(A,B)
PROCINFO["sorted_in"] = "#ind_str_asc"
for (perm in B) {
print perm
}
}
.
$ awk -f permutations.awk <<< '1 2 3 4'
1 2 3 4
1 2 4 3
1 3 2 4
1 3 4 2
1 4 2 3
1 4 3 2
2 1 3 4
2 1 4 3
2 3 1 4
2 3 4 1
2 4 1 3
2 4 3 1
3 1 2 4
3 1 4 2
3 2 1 4
3 2 4 1
3 4 1 2
3 4 2 1
4 1 2 3
4 1 3 2
4 2 1 3
4 2 3 1
4 3 1 2
4 3 2 1
Both of the above use GNU awk for sorted_in to sort the output. If you don't have GNU awk you can still use the scripts as-is and if you need to sort the output then pipe it to sort.

If I understand you correctly and you don't need pairs looks like '1 1', '2 2', ... and '1 2', '2 1' ... try this script
#!/bin/bash
for i in $(seq 1 49);
do
for j in $(seq $(($i + 1)) 50);
do gcta --reml-bivar "$i $j" --grm test --pheno test.phen --out test
done;
done;

1 and 2 would correspond to values from the first two columns in a text file.
each pair without repetition
So let's walk through this process:
We repeat the first column from the file times the file length
We repeat each value (each line) from the second column from the file times the file length
We join the repeated columns -> we have all combinations
We need to filter "repetitions", we can just join the file with the original file and filter out repeating columns
So we get each pair without repetitions.
Then we just read the file line by line.
The script:
# create an input file cause you didn't provide any
cat << EOF > in.txt
1 a
2 b
3 c
4 d
EOF
# get file length
inlen=$(<in.txt wc -l)
# join the columns
paste -d' ' <(
# repeat the first column inlen times
# https://askubuntu.com/questions/521465/how-can-i-repeat-the-content-of-a-file-n-times
seq "$inlen" |
xargs -I{} cut -d' ' -f1 in.txt
) <(
# repeat each line inlen times
# https://unix.stackexchange.com/questions/81904/repeat-each-line-multiple-times
awk -v IFS=' ' -v v="$inlen" '{for(i=0;i<v;i++)print $2}' in.txt
) |
# filter out repetitions - ie. filter original lines from the file
sort |
comm --output-delimiter='' -3 <(sort in.txt) - |
# read the file line by line
while read -r one two; do
echo "$one" "$two"
done
will output:
1 b
1 c
1 d
2 a
2 c
2 d
3 a
3 b
3 d
4 a
4 b
4 c

#!/bin/bash
#set the length of the combination depending the
#user's choice
eval rg+=({1..$2})
#the code builds the script and runs it (eval)
eval `
#Character range depending on user selection
for i in ${rg[#]} ; do
echo "for c$i in {1..$1} ;do "
done ;
#Since the script is based on a code that brings
#all possible combinations even with duplicates -
#this is where the deduplication
#prevention conditioning set by (the script writes
#the conditioning code)
op1=$2
op2=$(( $2 - 1 ))
echo -n "if [ 1 == 1 ] "
while [ $op1 -gt 1 ] ; do
echo -n \&\& [ '$c'$op1 != '$c'$op2 ]' '
op2=$(( op2 -1 )
if [ $op2 == 0 ] ; then
op1=$(( op1 - 1 ))
op2=$(( op1 - 1 ))
fi
done ;
echo ' ; then'
echo -n "echo "
for i in ${rg[#]} ;
do
echo -n '$c'$i
done ;
echo \;
echo fi\;
for i in ${rg[#]} ; do
echo 'done ;'
done;`
example: range length
$ ./combs.bash '{1..2} {a..c} \$ \#' 4
12ab$
12ab#
12acb
12ac$
12ac#
12a$b
12a$c
12a$#
12a#b
12a#c
12a#$
..........

#!/bin/bash
len=$2
eval c=($1)
per()
{
((`grep -Poi '[^" ".]'<<<$2|sort|uniq|wc -l` < $((len - ${1}))))&&{ return;}
(($1 == 0))&&{ echo $2;return;}
for i in ${c[#]} ; do
per "$((${1} - 1 ))" "$2 $i"
done
}
per "$2" ""
#example
$ ./neto '{0..3} {a..d} \# \!' 7
0 1 2 3 a b c
0 1 2 3 a b d
0 1 2 3 a b #
0 1 2 3 a b !
0 1 2 3 a c b
0 1 2 3 a c d
0 1 2 3 a c #
0 1 2 3 a c !
0 1 2 3 a d b
...

Related

Bash - Sum of all the multiples of 3 or 5 below N - timed-out

I'm trying to calculate the sum of all the multiples of 3 or 5 below N in bash but my attempts fail at the speed benchmark.
The input format is described as follow:
The first line is T, which denotes the number of test cases, followed by T lines, each containing a value of N.
Sample input:
2
10
100
Expected output:
23
2318
Here are my attemps:
With bc:
#!/bin/bash
readarray input
printf 'n=%d-1; x=n/3; y=n/5; z=n/15; (1+x)*x/2*3 + (1+y)*y/2*5 - (1+z)*z/2*15\n' "${input[#]:1}" |
bc
With pure bash:
#!/bin/bash
read t
while (( t-- ))
do
read n
echo "$(( --n, x=n/3, y=n/5, z=n/15, (1+x)*x/2*3 + (1+y)*y/2*5 - (1+z)*z/2*15 ))"
done
remark: I'm using t because the input doesn't end with a newline...
Both solutions are evaluated as "too slow", but I really don't know what could be further improved. Do you have an idea?
With awk:
BEGIN {
split("0 0 3 3 8 14 14 14 23 33 33 45 45 45", sums)
split("0 0 1 1 2 3 3 3 4 5 5 6 6 6", ns)
}
NR > 1 {
print fizzbuzz_sum($0 - 1)
}
function fizzbuzz_sum(x, q, r) {
q = int(x / 15)
r = x % 15
return q*60 + q*(q-1)/2*105 + sums[r] + (x-r)*ns[r]
}
It's pretty fast on my old laptop that has an AMD A9-9410 processor
$ printf '%s\n' 2 10 100 | awk -f fbsum.awk
23
2318
$
$ time seq 0 1000000 | awk -f fbsum.awk >/dev/null
real 0m1.532s
user 0m1.542s
sys 0m0.010s
$
And with bc, in case you need it to be capable of handling big numbers too:
{
cat <<EOF
s[1] = 0; s[2] = 0; s[3] = 3; s[4] = 3; s[5] = 8
s[6] = 14; s[7] = 14; s[8] = 14; s[9] = 23; s[10] = 33
s[11] = 33; s[12] = 45; s[13] = 45; s[14] = 45
n[1] = 0; n[2] = 0; n[3] = 1; n[4] = 1; n[5] = 2
n[6] = 3; n[7] = 3; n[8] = 3; n[9] = 4; n[10] = 5
n[11] = 5; n[12] = 6; n[13] = 6; n[14] = 6
define f(x) {
auto q, r
q = x / 15
r = x % 15
return q*60 + q*(q-1)/2*105 + s[r] + (x-r)*n[r]
}
EOF
awk 'NR > 1 { printf "f(%s - 1)\n", $0 }'
} | bc
It's much slower though.
$ printf '%s\n' 2 10 100 | sh ./fbsum.sh
23
2318
$
$ time seq 0 1000000 | sh ./fbsum.sh >/dev/null
real 0m4.980s
user 0m5.224s
sys 0m0.358s
$
Let's start from the basics and try to optimize it as much as possible:
#!/usr/bin/env bash
read N
sum=0
for ((i=1;i<N;++i)); do
if ((i%3 == 0 )) || (( i%5 == 0 )); then
(( sum += i ))
fi
done
echo $sum
In the above, we run the loop N times, perform minimally N comparisons and maximally 2N sums (i and sum). We could speed this up by doing multiple loops with steps of 3 and 5, however, we have to take care of double counting:
#!/usr/bin/env bash
read N
sum=0
for ((i=N-N%3;i>=3;i-=3)); do (( sum+=i )); done
for ((i=N-N%5;i>=5;i-=5)); do (( i%3 == 0 )) && continue; ((sum+=i)); done
echo $sum
We have now maximally 2N/3 + 2N/5 = 16N/15 sums and N/5 comparisons. This is already much faster. We could still optimise it by adding an extra loop with a step of 3*5 to subtract the double counting.
#!/usr/bin/env bash
read N
sum=0
for ((i=N-N%3 ; i>=3 ; i-=3 )); do ((sum+=i)); done
for ((i=N-N%5 ; i>=5 ; i-=5 )); do ((sum+=i)); done
for ((i=N-N%15; i>=15; i-=15)); do ((sum-=i)); done
echo $sum
This brings us to maximally 2(N/3 + N/5 + N/15) = 17N/15 additions and zero comparisons. This is optimal, however, we still have a call to an arithmetic expression per cycle. This we could absorb into the for-loop:
#!/usr/bin/env bash
read N
sum=0
for ((i=N-N%3 ; i>=3 ; sum+=i, i-=3 )); do :; done
for ((i=N-N%5 ; i>=5 ; sum+=i, i-=5 )); do :; done
for ((i=N-N%15; i>=15; sum-=i, i-=15)); do :; done
echo $sum
Finally, the easiest would be to use the formula of the Arithmetic Series removing all loops. Having in mind that bash uses integer arithmetic (i.e m = p*(m/p) + m%p), one can write
#!/usr/bin/env bash
read N
(( sum = ( (3 + N-N%3) * (N/3) + (5 + N-N%5) * (N/5) - (15 + N-N%15) * (N/15) ) / 2 ))
echo $sum
The latter is the fastest possible way (with the exception of numbers below 15) as it does not call any external binary such as bc or awk and performs the task without any loops.
What about something like this
#! /bin/bash
s35() {
m=$(($1-1)); echo $(seq -s+ 3 3 $m) $(seq -s+ 5 5 $m) 0 | bc
}
read t
while read n
do
s35 $n
done
or
s35() {
m=$(($1-1));
{ sort -nu <(seq 3 3 $m) <(seq 5 5 $m) | tr '\n' +; echo 0; } | bc
}
to remove duplicates.
This Shellcheck-clean pure Bash code processes input from echo 1000000; seq 1000000 (one million inputs) in 40 seconds on an unexotic Linux VM:
#! /bin/bash -p
a=( -15 1 -13 -27 -11 -25 -9 7 -7 -21 -5 11 -3 13 -1 )
b=( 0 -8 -2 18 22 40 42 28 28 42 40 22 18 -2 -8 )
read -r t
while (( t-- )); do
read -r n
echo "$(( m=n%15, ((7*n+a[m])*n+b[m])/30 ))"
done
The code depends on the fact that the sum for each value n can be calculated with a quadratic function of the form (7*n**2+A*n+B)/30. The values of A and B depend on the value of n modulo 15. The arrays a and b in the code contain the values of A and B for each possible modulus value ({0..14}). (To avoid doing the algebra I wrote a little Bash program to generate the a and b arrays.)
The code can easily be translated to other programming languages, and would run much faster in many of them.
For a pure bash approach,
#!/bin/bash
DBG=1
echo -e "This will generate the series sum for multiples of each of 3 and 5 ..."
echo -e "\nEnter the number of summation sets to be generated => \c"
read sets
for (( k=1 ; k<=${sets} ; k++))
do
echo -e "\n============================================================"
echo -e "Enter the maximum value of a multiple => \c"
read max
echo ""
for multiplier in 3 5
do
sum=0
iter=$((max/${multiplier}))
for (( i=1 ; i<=${iter} ; i++ ))
do
next=$((${i}*${multiplier}))
sum=$((sum+=${next}))
test ${DBG} -eq 1 && echo -e "\t ${next} ${sum}"
done
echo -e "TOTAL: ${sum} for ${iter} multiples of ${multiplier} <= ${max}\n"
done
done
The session log when DBG=1:
This will generate the series sum for multiples of each of 3 and 5 ...
Enter the number of summation sets to be generated => 2
============================================================
Enter the maximum value of a multiple => 15
3 3
6 9
9 18
12 30
15 45
TOTAL: 45 for 5 multiples of 3 <= 15
5 5
10 15
15 30
TOTAL: 30 for 3 multiples of 5 <= 15
============================================================
Enter the maximum value of a multiple => 12
3 3
6 9
9 18
12 30
TOTAL: 30 for 4 multiples of 3 <= 12
5 5
10 15
TOTAL: 15 for 2 multiples of 5 <= 12
While awk will always be faster than shell, with bash you can use ((m % 3 == 0)) || ((m % 5 == 0)) to identify the multiples of 3 and 5 less than n. You will have to see if it passes the time constraints, but it should be relatively quick,
#!/bin/bash
declare -i t n sum ## handle t, n and sum as integer values
read t || exit 1 ## read t or handle error
while ((t--)); do ## loop t times
sum=0 ## initialize sum zero
read n || exit 1 ## read n or handle error
## loop from 3 to < n
for ((m = 3; m < n; m++)); do
## m is multiple of 3 or multiple of 5
((m % 3 == 0)) || ((m % 5 == 0)) && {
sum=$((sum + m)) ## add m to sum
}
done
echo $sum ## output sum
done
Example Use/Output
With the script in mod35sum.sh and your data in dat/mod35sum.txt you would have:
$ bash sum35mod.sh < dat/sum35mod.txt
23
2318

Sorting Columns From File BASH

I have the following shell script that reads in data from a file inputted at the command line. The file is a matrix of numbers, and I need to separate the file by columns and then sort the columns. Right now I can read the file and output the individual columns but I am getting lost on how to sort. I have inputted a sort statement, but it only sorts the first column.
EDIT:
I have decided to take another route and actual transpose the matrix to turn the columns into rows. Since I have to later calculate the mean and median and have already successfully done this for the file row-wise earlier in the script - it was suggested to me to try and "spin" the matrix if you will to turn the columns into rows.
Here is my UPDATED code
declare -a col=( )
read -a line < "$1"
numCols=${#line[#]} # save number of columns
index=0
while read -a line ; do
for (( colCount=0; colCount<${#line[#]}; colCount++ )); do
col[$index]=${line[$colCount]}
((index++))
done
done < "$1"
for (( width = 0; width < numCols; width++ )); do
for (( colCount = width; colCount < ${#col[#]}; colCount += numCols ) ); do
printf "%s\t" ${col[$colCount]}
done
printf "\n"
done
This gives me the following output:
1 9 6 3 3 6
1 3 7 6 4 4
1 4 8 8 2 4
1 5 9 9 1 7
1 5 7 1 4 7
Though I'm now looking for:
1 3 3 6 6 9
1 3 4 4 6 7
1 2 4 4 8 8
1 1 5 7 9 9
1 1 4 5 7 7
To try and sort the data, I have tried the following:
sortCol=${col[$colCount]}
eval col[$colCount]='($(sort <<<"${'$sortCol'[*]}"))'
Also: (which is how I sorted the row after reading in from line)
sortCol=( $(printf '%s\t' "${col[$colCount]}" | sort -n) )
If you could provide any insight on this, it would be greatly appreciated!
Note, as mentioned in the comments, a pure bash solution isn't pretty. There are a number of ways to do it, but this is probably the most straight forward. The following requires reading all values per line into the array, and saving the matrix stride so it can be transposed to read all column values into a row matrix and sorted. All sorted columns are inserted into new row matrix a2. Transposing that row matrix yields your original matrix back in column sort order.
Note this will work for any rank of column matrix in your file.
#!/bin/bash
test -z "$1" && { ## validate number of input
printf "insufficient input. usage: %s <filename>\n" "${0//*\//}"
exit 1;
}
test -r "$1" || { ## validate file was readable
printf "error: file not readable '%s'. usage: %s <filename>\n" "$1" "${0//*\//}"
exit 1;
}
## function: my sort integer array - accepts array and returns sorted array
## Usage: array=( "$(msia ${array[#]})" )
msia() {
local a=( "$#" )
local sz=${#a[#]}
local _tmp
[[ $sz -lt 2 ]] && { echo "Warning: array not passed to fxn 'msia'"; return 1; }
for((i=0;i<$sz;i++)); do
for((j=$((sz-1));j>i;j--)); do
[[ ${a[$i]} -gt ${a[$j]} ]] && {
_tmp=${a[$i]}
a[$i]=${a[$j]}
a[$j]=$_tmp
}
done
done
echo ${a[#]}
unset _tmp
unset sz
return 0
}
declare -a a1 ## declare arrays and matrix variables
declare -a a2
declare -i cnt=0
declare -i stride=0
declare -i sz=0
while read line; do ## read all lines into array
a1+=( $line );
(( cnt == 0 )) && stride=${#a1[#]} ## calculate matrix stride
(( cnt++ ))
done < "$1"
sz=${#a1[#]} ## calculate matrix size
## print original array
printf "\noriginal array:\n\n"
for ((i = 0; i < sz; i += stride)); do
for ((j = 0; j < stride; j++)); do
printf " %s" ${a1[i+j]}
done
printf "\n"
done
## sort columns from stride array
for ((j = 0; j < stride; j++)); do
for ((i = 0; i < sz; i += stride)); do
arow+=( ${a1[i+j]} )
done
a2+=( $(msia ${arow[#]}) ) ## create sorted array
unset arow
done
## print the sorted array
printf "\nsorted array:\n\n"
for ((j = 0; j < cnt; j++)); do
for ((i = 0; i < sz; i += cnt)); do
printf " %s" ${a2[i+j]}
done
printf "\n"
done
exit 0
Output
$ bash sort_cols2.sh dat/matrix.txt
original array:
1 1 1 1 1
9 3 4 5 5
6 7 8 9 7
3 6 8 9 1
3 4 2 1 4
6 4 4 7 7
sorted array:
1 1 1 1 1
3 3 2 1 1
3 4 4 5 4
6 4 4 7 5
6 6 8 9 7
9 7 8 9 7
Awk script
awk '
{for(i=1;i<=NF;i++)a[i]=a[i]" "$i} #Add to column array
END{
for(i=1;i<=NF;i++){
split(a[i],b) #Split column
x=asort(b) #sort column
for(j=1;j<=x;j++){ #loop through sort
d[j]=d[j](d[j]~/./?" ":"")b[j] #Recreate lines
}
}
for(i=1;i<=NR;i++)print d[i] #Print lines
}' file
Output
1 1 1 1 1
3 3 2 1 1
3 4 4 5 4
6 4 4 7 5
6 6 8 9 7
9 7 8 9 7
Here's my entry in this little exercise. Should handle an arbitrary number of columns. I assume they're space-separated:
#!/bin/bash
linenumber=0
while read line; do
i=0
# Create an array for each column.
for number in $line; do
[ $linenumber == 0 ] && eval "array$i=()"
eval "array$i+=($number)"
(( i++ ))
done
(( linenumber++ ))
done <$1
IFS=$'\n'
# Sort each column
for j in $(seq 0 $i ); do
thisarray=array$j
eval array$j='($(sort <<<"${'$thisarray'[*]}"))'
done
# Print each array's 0'th entry, then 1, then 2, etc...
for k in $(seq 0 ${#array0[#]}); do
for j in $(seq 0 $i ); do
eval 'printf ${array'$j'['$k']}" "'
done
echo ""
done
Not bash but i think this python code worths a look showing how this task can be achieved using built-in functions.
From the interpreter:
$ cat matrix.txt
1 1 1 1 1
9 3 4 5 5
6 7 8 9 7
3 6 8 9 1
3 4 2 1 4
6 4 4 7 7
$ python
Python 2.7.3 (default, Jun 19 2012, 17:11:17)
[GCC 4.4.3] on hp-ux11
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> f = open('./matrix.txt')
>>> for row in zip(*[sorted(list(a))
for a in zip(*[a.split() for a in f.readlines()])]):
... print ' '.join(row)
...
1 1 1 1 1
3 3 2 1 1
3 4 4 5 4
6 4 4 7 5
6 6 8 9 7
9 7 8 9 7

Repeat an element n number of times in an array

Basically, I am trying to repeat each element in the following array [1 2 3] 4 times such that I will get something like this:
[1 1 1 1 2 2 2 2 3 3 3 3]
I tried a very stupid line of code i.e. abc=('1%.0s' {1..4}). But it failed miserably.
I am looking for an efficient one line solution to this problem and preferably, without using loops. If it is not possible to achieve this with just one line, then use loops.
Unless you're trying to avoid loops you can do:
arr=(1 2 3)
for i in ${arr[#]}; do for ((n=1; n<=4; n++)) do echo -n "$i ";done; done; echo
1 1 1 1 2 2 2 2 3 3 3 3
To store the results in an array:
aarr=($(for i in ${arr[#]}; do for ((n=1; n<=4; n++)) do echo -n "$i ";done; done;))
declare -p aarr
declare -a aarr='([0]="1" [1]="1" [2]="1" [3]="1" [4]="2" [5]="2" [6]="2" [7]="2" [8]="3" [9]="3" [10]="3" [11]="3")'
This does what you need and stores it in an array:
declare -a res=($(for v in 1 2 3; do for i in {1..4}; do echo $v; done; done))
Taking your idea to the next step:
$ a=(1 2 3)
$ b=($(for x in "${a[#]}"; do printf "$x%.0s " {1..4}; done))
$ echo ${b[#]}
1 1 1 1 2 2 2 2 3 3 3 3
Alternatively, using sed:
$ echo ${a[*]} | sed -r 's/[[:alnum:]]+/& & & &/g'
1 1 1 1 2 2 2 2 3 3 3 3
Or, using awk:
$ echo ${a[*]} | awk -v RS='[ \n]' '{for (i=1;i<=4;i++)printf "%s ", $0;} END{print""}'
1 1 1 1 2 2 2 2 3 3 3 3
Simple one liner:
for x in 1 2 3 ; do array+="$(printf "%1.0s$x" {1..4})" ;done
Similar to what you wanted.

how to subtract fields pairwise in bash?

I have a large dataset that looks like this:
5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5
For every line, I want to subtract the first field from the second, third from fourth and so on deepening on the number of fields (always even). Then, I want to report those lines for which difference from all the pairs exceeds a certain limit (say 2). I should also be able to report next best lines i.e., lines in which one pairwise comparison fails to meet the limit, but all other pairs meet the limit.
from the above example, if I set a limit to 2 then, my output file should contain
best lines:
2 5 3 7 1 6 # because (5-2), (7-3), (6-1) are all > 2
4 8 1 8 6 9 # because (8-4), (8-1), (9-6) are all > 2
next best line(s)
1 5 2 9 4 5 # because except (5-4), both (5-1) and (9-2) are > 2
My current approach is to read every line, save each field as a variable, do subtraction.
But I don't know how to proceed further.
Thanks,
Prints "best" lines to the file "best", and prints "next best" lines to the file "nextbest"
awk '
{
fail_count=0
for (i=1; i<NF; i+=2){
if ( ($(i+1) - $i) <= threshold )
fail_count++
}
if (fail_count == 0)
print $0 > "best"
else if (fail_count == 1)
print $0 > "nextbest"
}
' threshold=2 inputfile
Pretty straightforward stuff.
Loop through fields 2 at a time.
If (next field - current field) does not exceed threshold, increment fail_count
If that line's fail_count is zero, that means it belongs to "best" lines.
Else if that line's fail_count is one, it belongs to "next best" lines.
Here's a bash-way to do it:
#!/bin/bash
threshold=$1
shift
file="$#"
a=($(cat "$file"))
b=$(( ${#a[#]}/$(cat "$file" | wc -l) ))
for ((r=0; r<${#a[#]}/b; r++)); do
br=$((b*r))
for ((c=0; c<b; c+=2)); do
if [[ $(( ${a[br + c+1]} - ${a[br + c]} )) < $threshold ]]; then
break; fi
if [[ $((c+2)) == $b ]]; then
echo ${a[#]:$br:$b}; fi
done
done
Usage:
$ ./script.sh 2 yourFile.txt
2 5 3 7 1 6
4 8 1 8 6 9
This output can then easily be redirected:
$ ./script.sh 2 yourFile.txt > output.txt
NOTE: this does not work properly if you have those empty lines between each line...But I'm sure the above will get you well on your way.
I probably wouldn't do that in bash. Personally, I'd do it in Python, which is generally good for those small quick-and-dirty scripts.
If you have your data in a text file, you can read here about how to get that data into Python as a list of lines. Then you can use a for-loop to process each line:
threshold = 2
results = []
for line in content:
numbers = [int(n) for n in line.split()] # Split it into a list of numbers
pairs = zip(numbers[::2],numbers[1::2]) # Pair up the numbers two and two.
result = [abs(y - x) for (x,y) in pairs] # Subtract the first number in each pair from the second.
if sum(result) > threshold:
results.append(numbers)
Yet another bash version:
First a check function that return nothing but a result code:
function getLimit() {
local pairs=0 count=0 limit=$1 wantdiff=$2
shift 2
while [ "$1" ] ;do
[ $(( $2-$1 )) -ge $limit ] && : $((count++))
: $((pairs++))
shift 2
done
test $((pairs-count)) -eq $wantdiff
}
than now:
while read line ;do getLimit 2 0 $line && echo $line;done <file
2 5 3 7 1 6
4 8 1 8 6 9
and
while read line ;do getLimit 2 1 $line && echo $line;done <file
1 5 2 9 4 5
If you can use awk
$ cat del1
5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5
1 5 2 9 4 5 3 9
$ cat del1 | awk '{
> printf "%s _ ",$0;
> for(i=1; i<=NF; i+=2){
> printf "%d ",($(i+1)-$i)};
> print NF
> }' | awk '{
> upper=0;
> for(i=1; i<=($NF/2); i++){
> if($(NF-i)>threshold) upper++
> };
> printf "%d _ %s\n", upper, $0}' threshold=2 | sort -nr
3 _ 4 8 1 8 6 9 _ 4 7 3 6
3 _ 2 5 3 7 1 6 _ 3 4 5 6
3 _ 1 5 2 9 4 5 3 9 _ 4 7 1 6 8
2 _ 1 5 2 9 4 5 _ 4 7 1 6
0 _ 5 6 5 6 3 5 _ 1 1 2 6
You can process result further according to your needs. The result is sorted by ‘goodness’ order.

Calculate mean, variance and range using Bash script

Given a file file.txt:
AAA 1 2 3 4 5 6 3 4 5 2 3
BBB 3 2 3 34 56 1
CCC 4 7 4 6 222 45
Does any one have any ideas on how to calculate the mean, variance and range for each item, i.e. AAA, BBB, CCC respectively using Bash script? Thanks.
Here's a solution with awk, which calculates:
minimum = smallest value on each line
maximum = largest value on each line
average = μ = sum of all values on each line, divided by the count of the numbers.
variance = 1/n × [(Σx)² - Σ(x²)] where
n = number of values on the line = NF - 1 (in awk, NF = number of fields on the line)
(Σx)² = square of the sum of the values on the line
Σ(x²) = sum of the squares of the values on the line
awk '{
min = max = sum = $2; # Initialize to the first value (2nd field)
sum2 = $2 * $2 # Running sum of squares
for (n=3; n <= NF; n++) { # Process each value on the line
if ($n < min) min = $n # Current minimum
if ($n > max) max = $n # Current maximum
sum += $n; # Running sum of values
sum2 += $n * $n # Running sum of squares
}
print $1 ": min=" min ", avg=" sum/(NF-1) ", max=" max ", var=" ((sum*sum) - sum2)/(NF-1);
}' filename
Output:
AAA: min=1, avg=3.45455, max=6, var=117.273
BBB: min=1, avg=16.5, max=56, var=914.333
CCC: min=4, avg=48, max=222, var=5253
Note that you can save the awk script (everything between, but not including, the single-quotes) in a file, say called script, and execute it with awk -f script filename
You can use python:
$ AAA() { echo "$#" | python -c 'from sys import stdin; nums = [float(i) for i in stdin.read().split()]; print(sum(nums)/len(nums))'; }
$ AAA 1 2 3 4 5 6 3 4 5 2 3
3.45454545455
Part 1 (mean):
mean () {
len=$#
echo $* | tr " " "\n" | sort -n | head -n $(((len+1)/2)) | tail -n 1
}
nMean () {
echo -n "$1 "
shift
mean $*
}
mean usage:
nMean AAA 3 4 5 6 3 4 3 6 2 4
4
Part 2 (variance):
variance () {
count=$1
avg=$2
shift
shift
sum=0
for n in $*
do
diff=$((avg-n))
quad=$((diff*diff))
sum=$((sum+quad))
done
echo $((sum/count))
}
sum () {
form="$(echo $*)"
formula=${form// /+}
echo $((formula))
}
nVariance () {
echo -n "$1 "
shift
count=$#
s=$(sum $*)
avg=$((s/$count))
var=$(variance $count $avg $*)
echo $var
}
usage:
nVariance AAA 3 4 5 6 3 4 3 6 2 4
1
Part 3 (range):
range () {
min=$1
max=$1
for p in $* ; do
(( $p < $min )) && min=$p
(( $p > $max )) && max=$p
done
echo $min ":" $max
}
nRange () {
echo -n "$1 "
shift
range $*
}
usage:
nRange AAA 1 2 3 4 5 6 3 4 5 2 3
AAA 1 : 6
nX is short for named X, named mean, named variance, ... .
Note, that I use integer arithmetic, which is, what is possible with the shell. To use floating point arithmetic, you would use bc, for instance. Here you loose precision, which might be acceptable for big natural numbers.
Process all 3 commands for an input line:
processLine () {
nVariance $*
nMean $*
nRange $*
}
Read the data from a file, line by line:
# data:
# AAA 1 2 3 4 5 6 3 4 5 2 3
# BBB 3 2 3 34 56 1
# CCC 4 7 4 6 222 45
while read line
do
processLine $line
done < data
update:
Contrary to my expectation, it doesn't seem easy to handle an unknown number of arguments with functions in bc, for example min (3, 4, 5, 2, 6).
But the need to call bc can be reduced to 2 places, if the input are integers. I used a precision of 2 ("scale=2") - you may change this to your needs.
variance () {
count=$1
avg=$2
shift
shift
sum=0
for n in $*
do
diff="($avg-$n)"
quad="($diff*$diff)"
sum="($sum+$quad)"
done
# echo "$sum/$count"
echo "scale=2;$sum/$count" | bc
}
nVariance () {
echo -n "$1 "
shift
count=$#
s=$(sum $*)
avg=$(echo "scale=2;$s/$count" | bc)
var=$(variance $count $avg $*)
echo $var
}
The rest of the code can stay the same. Please verify that the formula for the variance is correct - I used what I had in mind:
For values (1, 5, 9), I sum up (15) divide by count (3) => 5.
Then I create the diff to the avg for each value (-4, 0, 4), build the square (16, 0, 16), sum them up (32) and divide by count (3) => 10.66
Is this correct, or do I need a square root somewhere ;) ?
Note, that I had to correct the mean calculation. For 1, 5, 9, the mean is 5, not 1 - am I right? It now uses sort -n (numeric) and (len+1)/2.
There is a typo in the accepted answer that causes the variance to be miscalculated. In the print statement:
", var=" ((sum*sum) - sum2)/(NF-1)
should be:
", var=" (sum2 - ((sum*sum)/NF))/(NF-1)
Also, it is better to use something like Welford's algorithm to calculate variance; the algorithm in the accepted answer is unstable when the variance is small relative to the mean:
foo="1 2 3 4 5 6 3 4 5 2 3";
awk '{
M = 0;
S = 0;
for (k=1; k <= NF; k++) {
x = $k;
oldM = M;
M = M + ((x - M)/k);
S = S + (x - M)*(x - oldM);
}
var = S/(NF - 1);
print " var=" var;
}' <<< $foo

Resources