Calculate Median in Multiple Rows - bash

I have a file name numbers, simply contain bunch random numbers
1 2 3
7 5 9
2 2 9
5 4 5
7 2 6
I have to create a script that find the median for each row, and here is my code:
while read -a row
do
for i in "${row[#]}"
do
length=`expr ${#row[#]} % 2`
if [ $length -ne 0 ] ; then
mid=`expr ${#row[#]} / 2`
echo ${row[middle]}
elif [ $length -eq 0 ] ; then
val1=`expr ${#row[#]} / 2`
val2=`expr (${$row[#]} / 2) + 1`
mid=`expr ($val1 + $val2) / 2`
echo $mid
done | sort -n
done < numbers
However this doesn't work, it shows error instead. What mistake did I do in this code? Also I still haven't figure out where is the proper way to place the sort -n since it needs to be sorted first before calculate the median, right?

Bash can only do integer arithmetic, you need a tool like bc to compute the average:
#!/bin/bash
while read -a n ; do
n=($(IFS=$'\n' ; echo "${n[*]}" | sort -n))
len=${#n[#]}
if (( len % 2 )) ; then
echo ${n[ len / 2 ]}
else
bc -l <<< "scale=1; (${n[ len / 2 - 1 ]} + ${n[ len / 2 ]}) / 2"
fi
done
I'd probably reach for a higher level language, e.g. Perl:
#!/usr/bin/perl
use warnings;
use strict;
while (<>) {
my #n = sort { $a <=> $b } split;
print #n % 2 ? $n[ #n / 2 ]
: ($n[ #n / 2 - 1 ] + $n[ #n / 2 ]) / 2,
"\n";
}

I just had to awk it, for the fun of it.
Notice I don't use an if but fractions of indexes.
awk '{
split($0,a) # create array a from input line
asort(a,b) # sort array into array b (gnu awk specific)
# add twice the median, or around the median and divide by 2
print ( b[int(NF/2+0.7)] + b[int(NF/2+1.2)] )/2
}' numbers
Shortened (67 chars):
awk '{split($0,a);asort(a,b);print(b[int(NF/2+0.7)]+b[int(NF/2+1.2)])/2}' numbers
66 chars golf :-)
awk '{split($0,a);asort(a,b);$0=(b[int(NF/2+0.7)]+b[int(NF/2+1.2)])/2}1' numbers

Related

Optimally finding the index of the maximum element in BASH array

I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.
Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.
It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0

calculate mean on even and odd lines independently in bash

I have a file with 2 columns and many rows. I would like to calculate the mean for each column for odd and even lines independantly, so that in the end I would have a file with 4 values: 2 columns with odd and even mean.
My file looks like this:
2 4
4 4
6 8
3 5
6 9
2 1
In the end I would like to obtain a file with the mean of 2,6,6 and 4,3,2 in the first column and the mean of 4,8,9 and 4,5,1 in the second column, that is:
4.66 7
3 3.33
If anyone could give me some advice I'd really appreaciate it, for the moment I'm only able to calculate the mean for all rows (not even and odd). Thank you very much in advance!
This is an awk hardcoded example but you can get the point :
awk 'NR%2{e1+=$1;e2+=$2;c++;next}
{o1+=$1;o2+=$2;d++}
END{print e1/c"\t"e2/c"\n"o1/d"\t"o2/d}' your_file
4.66667 7
3 3.33333
A more generalized version of Juan Diego Godoy's answer. Relies on GNU awk
gawk '
{
parity = NR % 2 == 1 ? "odd" : "even"
for (i=1; i<=NF; i++) {
sum[parity][i] += $i
count[parity][i] += 1
}
}
function result(parity) {
for (i=1; i<=NF; i++)
printf "%g\t", sum[parity][i] / count[parity][i]
print ""
}
END { result("odd"); result("even") }
'
This answer uses Bash and bc. It assumes that the input file consists of only integers and that there is an even number of lines.
#!/bin/bash
while read -r oddcol1 oddcol2; read -r evencol1 evencol2
do
(( oddcol1sum += oddcol1 ))
(( oddcol2sum += oddcol2 ))
(( evencol1sum += evencol1 ))
(( evencol2sum += evencol2 ))
(( count++ ))
done < inputfile
cat <<EOF | bc -l
scale=2
print "Odd Column 1 Mean: "; $oddcol1sum / $count
print "Odd Column 2 Mean: "; $oddcol2sum / $count
print "Even Column 1 Mean: "; $evencol1sum / $count
print "Even Column 2 Mean: "; $evencol2sum / $count
EOF
It could be modified to use arrays to make it more flexible.

How do I perform this calculation and output it to standard out?

I am trying to do this in Bash:
read n
echo int(math.ceil((math.sqrt(1 + 8 * n) - 1) / 2))
Of course this isn't working syntax but I am just putting it there so you can tell what I am trying to do.
Is there an easy way to actually make this into valid Bash?
Although you ask to do this in Bash, there's no native support for functions like square root or ceiling. It would be simpler to delegate to Perl:
perl -wmPOSIX -e "print POSIX::ceil((sqrt(1 + 8 * $n) - 1) / 2)"
Alternatively, you could use bc to calculate the square root, and some Bash to calculate the ceiling.
Let's define a function that prints the result of the formula with sqrt of bc:
formula() {
local n=$1
bc -l <<< "(sqrt(1 + 8 * $n) - 1) / 2"
}
The -l flag changes the scale from the default 0 to 20.
This affects the scale in the display of floating point results.
For example, with the default zero, 10 / 3 would print just 3.
We need the floating point details in the next step to compute the ceiling.
ceil() {
local n=$1
local intpart=${n%%.*}
if [[ $n =~ \.00*$ ]]; then
echo $intpart
else
echo $((intpart + 1))
fi
}
The idea here is to extract the integer part,
and if the decimal part is all zeros, then we print simply the integer part,
otherwise the integer part + 1, as that is the ceiling.
And a final simple function that combines the above functions to get the result that you want:
compute() {
local n=$1
ceil $(formula $n)
}
And a checker function to test it:
check() {
local n num
for n; do
num=$(formula $n)
echo $n $num $(compute $n)
done
}
Let's try it:
check 1 2 3 4 7 11 12 16 17
It produces:
1 1.00000000000000000000 1
2 1.56155281280883027491 2
3 2.00000000000000000000 2
4 2.37228132326901432992 3
7 3.27491721763537484861 4
11 4.21699056602830190566 5
12 4.42442890089805236087 5
16 5.17890834580027361089 6
17 5.35234995535981255455 6
You can use bc's sqrt function.
echo "(sqrt(1 + 8 * 3) - 1) / 2" | bc
Ceil function can be implemented using any of the methods described in this answer.
Getting Ceil integer
For eg:
ceiling_divide() {
ceiling_result=`echo "($1 + $2 - 1)/$2" | bc`
}
You can use bc for all the job
$>cat filebc
print "Enter a number\n";
scale=20
a=read()
b=((sqrt(1 + 8 * a) - 1) / 2)
scale=0
print "ceil = ";
((b/1)+((b%1)>0))
quit
Call it like that
bc -q filebc

sort a line with bunch of numbers

I have a line that goes like:
string 2 2 3 3 1 4
where the 2nd, 4th and 6th columns represent an ID (assuming each ID number is unique) and 3rd, 5th and 7th columns represent some data associated with respective ID.
How can I re-arrange the line so that it will be sorted by the ID?
string 1 4 2 2 3 3
Note: a line may have any number of IDs, unlike the example.
Using shell script, I'm thinking something like
while read n
do
echo $(echo $n | sork -k (... stuck here) )
done < infile
Another bash alternative which does not rely on how many ids there are:
#!/usr/bin/env bash
x='string 2 2 3 3 1 4'
out="${x%% *}"
in=($x)
for (( i = 1; i < ${#in[*]}; i += 2 ))
do
new[${in[i]}]=${in[i+1]}
done
for i in ${!new[#]}
do
out="$out $i ${new[i]}"
done
echo $out
You can put a loop around the lot if you then want to read a file
I'll add an gawk solution to your long list of options.
This is a standalone script:
#!/usr/bin/env gawk -f
{
line=$1
# Collect the tuples into values of an array,
for (i=2;i<NF;i+=2) a[i]=$i FS $(i+1)
# This sorts the array "a" by value, numerically, ascending...
asort(a, a, "#val_num_asc")
# And this for loop gathers the result.
for (i=0; i<length(a); i++) line=line FS a[i]
# Finally, print the line,
print line
# and clear the array for the next round.
delete a
}
This works by copying your tuples into an array, sorting the array, then reassembling the sorted tuples in a for loop that prints the array elements.
Note that it's gawk-only (not traditional awk) because of the use of asort().
$ cat infile
string 2 2 3 3 1 4
other 5 1 20 9 3 7
$ ./sorttuples infile
string 1 4 2 2 3 3
other 3 7 5 1 20 9
As a bash script this can be done with:
Code:
#!/usr/bin/env bash
# send field pairs as separate lines
function emit_line() {
while [ $# -gt 0 ] ; do
echo "$1" "$2"
shift; shift
done
}
# break the line into pieces and send to sort
function sort_line() {
echo $1
shift
emit_line $* | sort
}
# loop through the lines in the file and sort by key-value pairs
while read n; do
echo $(sort_line $n)
done < infile
File infile:
string 2 2 3 3 1 4
string 2 2 0 3 4 4 1 7
string 2 2 0 3 2 1
Output:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
string 0 3 2 1 2 2
Update:
Cribbing the sort from grail's version, to remove the (much slower) external sort:
function sort_line() {
line="$1"
shift
while [ $# -gt 0 ] ; do
data[$1]=$2
shift; shift
done
for i in ${!data[#]}; do
out="$line $i ${data[i]}"
done
unset data
echo $line
}
while read n; do
sort_line $n
done < infile
You can use python for this. This function breaks up the column into a list of tuples that can then be sorted. itertools.chain is then used to re-assemble the key values pairs.
Code:
import itertools as it
def sort_line(line):
# split the line on white space
x = line.split()
# make a tuple of key value pairs
as_tuples = [tuple(x[i:i+2]) for i in range(1, len(x), 2)]
# sort the tuples, and flatten them with chain
sorted_kv = list(it.chain(*sorted(as_tuples)))
# join the results back into a string
return ' '.join([x[0]] + sorted_kv)
Test Code:
data = [
"string 2 2 3 3 1 4",
"string 2 2 0 3 4 4 1 7",
]
for line in data:
print(sort_line(line))
Results:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4

Find smallest missing integer in an array

I'm writing a bash script which requires searching for the smallest available integer in an array and piping it into a variable.
I know how to identify the smallest or the largest integer in an array but I can't figure out how to identify the 'missing' smallest integer.
Example array:
1
2
4
5
6
In this example I would need 3 as a variable.
Using sed for this would be silly. With GNU awk you could do
array=(1 2 4 5 6)
echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }'
...which remembers all numbers, then counts from 1 until it finds one that it doesn't remember and prints that. You can then remember this number in bash with
array=(1 2 4 5 6)
number=$(echo "${array[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 1; i in a; ++i); print i }')
However, if you're already using bash, you could just do the same thing in pure bash:
#!/bin/bash
array=(1 2 4 5 6)
declare -a seen
for i in ${array[#]}; do
seen[$i]=1
done
for((number = 1; seen[number] == 1; ++number)); do true; done
echo $number
You can iterate from minimal to maximal number and take first non existing element,
use List::Util qw( first );
my #arr = sort {$a <=> $b} qw(1 2 4 5 6);
my $min = $arr[0];
my $max = $arr[-1];
my %seen;
#seen{#arr} = ();
my $first = first { !exists $seen{$_} } $min .. $max;
This code will do as you ask. It can easily be accelerated by using a binary search, but it is clearest stated in this way.
The first element of the array can be any integer, and the subroutine returns the first value that isn't in the sequence. It returns undef if the complete array is contiguous.
use strict;
use warnings;
use 5.010;
my #data = qw/ 1 2 4 5 6 /;
say first_missing(#data);
#data = ( 4 .. 99, 101 .. 122 );
say first_missing(#data);
sub first_missing {
my $start = $_[0];
for my $i ( 1 .. $#_ ) {
my $expected = $start + $i;
return $expected unless $_[$i] == $expected;
}
return;
}
output
3
100
Here is a Perl one liner:
$ echo '1 2 4 5 6' | perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;'
If you want to switch from a pipeline to a file input:
$ perl -lane '}
{#a=sort { $a <=> $b } #F; %h=map {$_=>1} #a;
foreach ($a[0]..$a[-1]) { if (!exists($h{$_})) {print $_}} ;' file
Since it is sorted in the process, input can be in arbitrary order.
$ cat tst.awk
BEGIN {
split("1 2 4 5 6",a)
for (i=1;a[i+1]==a[i]+1;i++) ;
print a[i]+1
}
$ awk -f tst.awk
3
Having fun with #Borodin's excellent answer:
#!/usr/bin/env perl
use 5.020; # why not?
use strict;
use warnings;
sub increasing_stream {
my $start = int($_[0]);
return sub {
$start += 1 + (rand(1) > 0.9);
};
}
my $stream = increasing_stream(rand(1000));
my $first = $stream->();
say $first;
while (1) {
my $next = $stream->();
say $next;
last unless $next == ++$first;
$first = $next;
}
say "Skipped: $first";
Output:
$ ./tyu.pl
381
382
383
384
385
386
387
388
389
390
391
392
393
395
Skipped: 394
Here's one bash solution (assuming the numbers are in a file, one per line):
sort -n numbers.txt | grep -n . |
grep -v -m1 '\([0-9]\+\):\1' | cut -f1 -d:
The first part sorts the numbers and then adds a sequence number to each one, and the second part finds the first sequence number which doesn't correspond to the number in the array.
Same thing, using sort and awk (bog-standard, no extensions in either):
sort -n numbers.txt | awk '$1!=NR{print NR;exit}'
Here is a slight variation on the theme set by other answers. Values coming in are not necessarily pre-sorted:
$ cat test
sort -nu <<END-OF-LIST |
1
5
2
4
6
END-OF-LIST
awk 'BEGIN { M = 1 } M > $1 { next } M == $1 { M++; next }
M < $1 { exit } END { print M }'
$ sh test
3
Notes:
If numbers are pre-sorted, do not bother with the sort.
If there are no missing numbers, the next higher number is output.
In this example, a here document supplies numbers, but one can use a file or pipe.
M may start greater than the smallest to ignore missing numbers below a threshold.
To auto-start the search at the lowest number, change BEGIN { M = 1 } to NR == 1 { M = $1 }.

Resources