Optimally finding the index of the maximum element in BASH array

Optimally finding the index of the maximum element in BASH array - bash

I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.

Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.

It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0

Related

Need help to find average, min and max values in shell script from text file

I'm working on a shell script right now. I need to loop through a text file, grab the text from it, and find the average number, max number and min number from each line of numbers then print them in a chart with the name of each line. This is the text file:
Experiment1 9 8 1 2 9 0 2 3 4 5
collect1 83 39 84 2 1 3 0 9
jump1 82 -1 9 26 8 9
exp2 22 0 7 1 0 7 3 2
jump2 88 7 6 5
taker1 5 5 44 2 3
so far all I can do is loop through it and print each line like so:
#!/bin/bash
while read line
do
echo $line
done < mystats.txt
I'm a beginner and nothing I've found online has helped me.

One way, using perl for all the calculations:
$ perl -MList::Util=min,max,sum -anE 'BEGIN { say "Name\tAvg\tMin\tMax" }
$n = shift #F; say join("\t", $n, sum(#F)/#F, min(#F), max(#F))' mystats.txt
Name Avg Min Max
Experiment1 4.3 0 9
collect1 27.625 0 84
jump1 22.1666666666667 -1 82
exp2 5.25 0 22
jump2 26.5 5 88
taker1 11.8 2 44
It uses autosplit mode (-a) to split each line into an array (Much like awk), and the standard List::Util module's math functions to calculate the mean, min, and max of each line's numbers.
And here's a pure bash version using nothing but builtins (Though I don't recommend doing this; among other things bash doesn't do floating point math, so the averages are off):
#!/usr/bin/env bash
printf "Name\tAvg\tMin\tMax\n"
while read name nums; do
read -a numarr <<< "$nums"
total=0
min=${numarr[0]}
max=${numarr[0]}
for n in "${numarr[#]}"; do
(( total += n ))
if [[ $n -lt $min ]]; then
min=$n
fi
if [[ $n -gt $max ]]; then
max=$n
fi
done
(( avg = total / ${#numarr[*]} ))
printf "%s\t%d\t%d\t%d\n" "$name" "$avg" "$min" "$max"
done < mystats.txt

Using awk:
awk '{
min = $2; max = $2; sum = $2;
for (i=3; i<=NF; i++) {
if (min > $i) min = $i;
if (max < $i) max = $i;
sum+=$i }
printf "for %-20s min=%10i max=%10i avg=%10.3f\n", $1, min, max, sum/(NF-1) }' mystats.txt

Bash - Print first available (not in the array) number divisible by 8

I have an array in Bash that will print out a series of numbers. I would like to find the first available (read: not in the array) number divisible by 8 (including 0).
for i in "${NUMS[#]}"
do
echo "$i"
done
Will output:
0
1
2
3
8
9
10
11
So in this example, the value would be "16". If 0 or 8 were missing from that array, those would have been selected.
I'm looking at something like:
echo "${NUMS[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 0; i in a; ++i); print i }'
which will give me the first missing integer (4), but have not yet gotten a working result for a multiple of 8.

This should work:
printf '%s\n' "${NUMS[#]}" |
sort -n |
awk 'BEGIN { num=0 } $0 == num { num+=8 } END { print num }'
The idea is to start looking for the number 0, if you find it you start looking for 8 and so on. The variable num gets incremented by 8 each time the number is found to give the next multiple of 8 that hasn't been seen yet.
Sort is only needed if the array isn't already ordered.

Another solution I had working prior to reading Graeme's (much better) solution:
POSSIBLE_VALUES=($(seq 0 8 255))
for i in ${POSSIBLE_VALUES[#]}
do
match=0
for j in ${NUMS[#]}
do
if [ "${i}" == "${j}" ]
then
match=1
break
fi
done
if [ "${match}" == 0 ]
then
c+=($i)
fi
done
echo ${c[0]}

sort a line with bunch of numbers

I have a line that goes like:
string 2 2 3 3 1 4
where the 2nd, 4th and 6th columns represent an ID (assuming each ID number is unique) and 3rd, 5th and 7th columns represent some data associated with respective ID.
How can I re-arrange the line so that it will be sorted by the ID?
string 1 4 2 2 3 3
Note: a line may have any number of IDs, unlike the example.
Using shell script, I'm thinking something like
while read n
do
echo $(echo $n | sork -k (... stuck here) )
done < infile

Another bash alternative which does not rely on how many ids there are:
#!/usr/bin/env bash
x='string 2 2 3 3 1 4'
out="${x%% *}"
in=($x)
for (( i = 1; i < ${#in[*]}; i += 2 ))
do
new[${in[i]}]=${in[i+1]}
done
for i in ${!new[#]}
do
out="$out $i ${new[i]}"
done
echo $out
You can put a loop around the lot if you then want to read a file

I'll add an gawk solution to your long list of options.
This is a standalone script:
#!/usr/bin/env gawk -f
{
line=$1
# Collect the tuples into values of an array,
for (i=2;i<NF;i+=2) a[i]=$i FS $(i+1)
# This sorts the array "a" by value, numerically, ascending...
asort(a, a, "#val_num_asc")
# And this for loop gathers the result.
for (i=0; i<length(a); i++) line=line FS a[i]
# Finally, print the line,
print line
# and clear the array for the next round.
delete a
}
This works by copying your tuples into an array, sorting the array, then reassembling the sorted tuples in a for loop that prints the array elements.
Note that it's gawk-only (not traditional awk) because of the use of asort().
$ cat infile
string 2 2 3 3 1 4
other 5 1 20 9 3 7
$ ./sorttuples infile
string 1 4 2 2 3 3
other 3 7 5 1 20 9

As a bash script this can be done with:
Code:
#!/usr/bin/env bash
# send field pairs as separate lines
function emit_line() {
while [ $# -gt 0 ] ; do
echo "$1" "$2"
shift; shift
done
}
# break the line into pieces and send to sort
function sort_line() {
echo $1
shift
emit_line $* | sort
}
# loop through the lines in the file and sort by key-value pairs
while read n; do
echo $(sort_line $n)
done < infile
File infile:
string 2 2 3 3 1 4
string 2 2 0 3 4 4 1 7
string 2 2 0 3 2 1
Output:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
string 0 3 2 1 2 2
Update:
Cribbing the sort from grail's version, to remove the (much slower) external sort:
function sort_line() {
line="$1"
shift
while [ $# -gt 0 ] ; do
data[$1]=$2
shift; shift
done
for i in ${!data[#]}; do
out="$line $i ${data[i]}"
done
unset data
echo $line
}
while read n; do
sort_line $n
done < infile

You can use python for this. This function breaks up the column into a list of tuples that can then be sorted. itertools.chain is then used to re-assemble the key values pairs.
Code:
import itertools as it
def sort_line(line):
# split the line on white space
x = line.split()
# make a tuple of key value pairs
as_tuples = [tuple(x[i:i+2]) for i in range(1, len(x), 2)]
# sort the tuples, and flatten them with chain
sorted_kv = list(it.chain(*sorted(as_tuples)))
# join the results back into a string
return ' '.join([x[0]] + sorted_kv)
Test Code:
data = [
"string 2 2 3 3 1 4",
"string 2 2 0 3 4 4 1 7",
]
for line in data:
print(sort_line(line))
Results:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4

If operator inside for loop

I have input file as below, need to do this conversion col1*0 + col2*1 + col3*2 for every 3 column triplet.
input.txt - All positive numbers, can be decimals, real file has 1000s of columns.
0 0 0 1 0 0
0 1 0 0 0 1
0 0 1 0 0 0
I have the below gawk line that does that:
gawk '{for(i=1;i<=NF;i+=3)x=(x?x FS:"")(($(i+1))+($(i+2)*2));print x;x=y}' input.txt
0 0
1 2
2 0
Additionally, I need to check if 3 numbers are all zeros, if they are all zeros then the conversion should be -9.
Pseudo code:
if($i==0 & $(i+1)==0 & $(i+2)==0) {-9} else {$(i+1)+$(i+2)*2}
#or as all numbers are positive.
if(($i+$(i+1)+$(i+2))==0) {-9} else {$(i+1)+$(i+2)*2}
Expected output:
-9 0
1 2
2 -9
Data description:
This data is output from IMPUTE2 software - a genotype imputation and haplotype phasing program. Rows are SNPs, columns are samples. Every SNP is represented by 3 columns. 3 numbers per SNP with range 0-1 (probability of allele AA AB BB). So in above example we have 3 SNPs and 2 samples. Imputation can also be represented as dosage value, 1 number per SNP with range 0-2. We are trying to covert probability format into dosage format. When IMPUTE2 can't give any probabilities to any of the alleles, it outputs as 0 0 0, then we should convert as no call -9.

You want the sum to be different if the three given columns are 0. For this, you can expand the ternary operator to something like>
gawk '{ for(i=1;i<=NF;i+=3) {
x=$(i+1) + $(i+2)*2; # the sum
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
}
print res; res="" # print stored line and empty for next loop
}' file
That is, append the value -9 if all the elements are 0. Otherwise, the calculated x:
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^
if three columns are 0..........|
If all values are positive, the check can be reformatted to just compare if the sum is 0 or not.
($i + $(i+1) + $(i+2)) ? x : -9
Testing with your file apparently works:
$ gawk '{for(i=1;i<=NF;i+=3) {x=$(i+1) + $(i+2)*2; res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)} print res; res=""}' file
-9 0
1 2
2 -9

another awk one-liner (assuming non-negative input values)
$ awk '{c1=$2+2*$3;c2=$5+2*$6; print c1||$1?c1:-9,c2||$4?c2:-9}' lop
-9 0
1 2
2 -9

Calculate Median in Multiple Rows

I have a file name numbers, simply contain bunch random numbers
1 2 3
7 5 9
2 2 9
5 4 5
7 2 6
I have to create a script that find the median for each row, and here is my code:
while read -a row
do
for i in "${row[#]}"
do
length=`expr ${#row[#]} % 2`
if [ $length -ne 0 ] ; then
mid=`expr ${#row[#]} / 2`
echo ${row[middle]}
elif [ $length -eq 0 ] ; then
val1=`expr ${#row[#]} / 2`
val2=`expr (${$row[#]} / 2) + 1`
mid=`expr ($val1 + $val2) / 2`
echo $mid
done | sort -n
done < numbers
However this doesn't work, it shows error instead. What mistake did I do in this code? Also I still haven't figure out where is the proper way to place the sort -n since it needs to be sorted first before calculate the median, right?

Bash can only do integer arithmetic, you need a tool like bc to compute the average:
#!/bin/bash
while read -a n ; do
n=($(IFS=$'\n' ; echo "${n[*]}" | sort -n))
len=${#n[#]}
if (( len % 2 )) ; then
echo ${n[ len / 2 ]}
else
bc -l <<< "scale=1; (${n[ len / 2 - 1 ]} + ${n[ len / 2 ]}) / 2"
fi
done
I'd probably reach for a higher level language, e.g. Perl:
#!/usr/bin/perl
use warnings;
use strict;
while (<>) {
my #n = sort { $a <=> $b } split;
print #n % 2 ? $n[ #n / 2 ]
: ($n[ #n / 2 - 1 ] + $n[ #n / 2 ]) / 2,
"\n";
}

I just had to awk it, for the fun of it.
Notice I don't use an if but fractions of indexes.
awk '{
split($0,a) # create array a from input line
asort(a,b) # sort array into array b (gnu awk specific)
# add twice the median, or around the median and divide by 2
print ( b[int(NF/2+0.7)] + b[int(NF/2+1.2)] )/2
}' numbers
Shortened (67 chars):
awk '{split($0,a);asort(a,b);print(b[int(NF/2+0.7)]+b[int(NF/2+1.2)])/2}' numbers
66 chars golf :-)
awk '{split($0,a);asort(a,b);$0=(b[int(NF/2+0.7)]+b[int(NF/2+1.2)])/2}1' numbers

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimally finding the index of the maximum element in BASH array - bash

Related

Need help to find average, min and max values in shell script from text file

Bash - Print first available (not in the array) number divisible by 8

sort a line with bunch of numbers

If operator inside for loop

Calculate Median in Multiple Rows

Categories

Resources