Need to calculate standard deviation from an array using bash and awk?

Need to calculate standard deviation from an array using bash and awk? - bash

Guys I'm new to awk and I'm struggling with awk command to find the standard deviation.
I have got the mean using the following:
echo ${GfieldList[#]} | awk 'NF {sum=0;for (i=1;i<=NF;i++)sum+=$i; print "Mean= " sum / NF; }'
Standard Deviation formula is:
sqrt((1/N)*(sum of (value - mean)^2))
I have found the mean using the above formula
Can you guys help me with the awk command for this one?

An alternate formula for the standard deviation is the square root of the quantity: (the mean square minus the square of the mean). This is used below:
$ echo 20 21 22 | awk 'NF {sum=0;ssq=0;for (i=1;i<=NF;i++){sum+=$i;ssq+=$i**2}; print "Std Dev=" (ssq/NF-(sum/NF)**2)**0.5}'
Std Dev=0.816497
Notes:
In awk, NF is the number of "fields" on a line. In our case, every field is a number, so NF is the number of numbers on a given line.
ssq is the sum of the squares of each number on the line. Thus, ssq/NF is the mean square.
sum is the sum of the numbers on the line. Thus sum/NF is the mean and (sum/NF)**2 is the square of the mean.
As per the formular, then, the standard deviation is (ssq/NF-(sum/NF)**2)**0.5.
The awk code
NF
This serves as a condition: the statements which follow will only be executed if the number of fields on this line, NF, evaluates to true, meaning non-zero. In other words, this condition will cause empty lines to be skipped.
sum=0;ssq=0;
This initializes sum and ssq to zero. This is only needed if there is more than one line of input.
for (i=1;i<=NF;i++){sum+=$i;ssq+=$i**2}
This puts the sum of all the numbers in sum and the sum of the square of the numbers in ssq.
print "Std Dev=" (ssq/NF-(sum/NF)**2)**0.5
This prints out the standard deviation.

Once you know the mean:
awk '{
for (i = 1;i <= NF; i++) {
sum += $i
};
print sum / NF
}' # for 2, 4, 4, 4, 5, 5, 7, 9 gives 5
then the standard deviation can be found thus:
awk -vM=5 '{
for (i = 1; i <= NF; i++) {
sum += ($i-M) * ($i-M)
};
print sqrt (sum / NF)
}' # for 2, 4, 4, 4, 5, 5, 7, 9 gives 2
In "compressed" form:
awk '{for(i=1;i<=NF;i++){sum+=$i};print sum/NF}'
awk -vM=5 '{for(i=1;i<=NF;i++){sum+=($i-M)*($i-M)};print sqrt(sum/NF)}'
(changing the value for M to the actual mean extracted from the first command).

Related

Divide as evenly as possible the value defined in var into the a variable length array

I'm trying to split the remainder as evenly as possible where var is not divisible into the array count.
I've tried the following, which gives me a rounded split into the array item. I'm looking for a way to identify the remainder and then split that as evenly as possible into each array index value.
for n in ${!variableLengthArray[#]} ; do
divideCount=$(( ${variableLengthArray[$n]} / $var ))
variableLengthArray[$n]=$(echo "($divideCount+0.5)/1" | bc )
done
EXAMPLE1:
Input:
var=11
variableLengthArray[0]=0
variableLengthArray[1]=0
variableLengthArray[2]=0
Ideal Output:
variableLengthArray[0]=4
variableLengthArray[1]=4
variableLengthArray[2]=3
EXAMPLE2:
Input:
var=33
variableLengthArray[0]=0
variableLengthArray[1]=0
variableLengthArray[2]=0
variableLengthArray[3]=0
variableLengthArray[4]=0
variableLengthArray[5]=0
Ideal Output:
variableLengthArray[0]=6
variableLengthArray[1]=6
variableLengthArray[2]=6
variableLengthArray[3]=5
variableLengthArray[4]=5
variableLengthArray[5]=5

You just need to divide the input by the number of output slots. The shell only does integer division, so you the result will be the number to store in each slot. The remainder of the division tells you how many slots get the result plus one.
As a concrete example,
$ var=11
$ slots=3
$ result=$((var / slots))
$ k=$((var % slots ))
$ for ((i=0; i<k; i++)); do
> variableLengthArray[i]=$(( result + 1 ))
> done
$ for ((i=k; i < slots; i++)); do
> variableLengthArray[i]=$result
> done

Assuming that indexing of your array variable starts from 0 and is contiguous the following code will do what you want:
n=${#variableLengthArray[#]}
ratio=$(($var / $n))
rem=$(($var % $n))
for i in ${!variableLengthArray[#]} ; do
variableLengthArray[$i]=$(( $ratio + ($i < $rem ? 1 : 0) ))
done

Perl sqrt, cube issue: 1 showing up after each line

I am having a tiny issue with a small perl script using arithmetic operators. After my cube root, and square root operators, a 1 shows up. I was testing this script on an openSUSE 42.1 VM.
I'm just not too certain what the 1 after each line is, I have tried looking it up, but am not too certain. I mainly script in bash, and ksh, so this perl syntax is a bit new to me.
Script:
#!/usr/bin/perl
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
$v1=10;
$v2=9;
$v3=8;
$val=$v1+$v2+$v3;
$cube=$val ** (1/3);
$square= sqrt($val);
print "Sum of 10, 9, 8: $val\n";
print
print "Cube of Sum: $cube\n";
print
print "Square of Sum: $square\n";
print
print "Thanks for using this script!"

Your lines just saying
print
are not statements in themselves as they are not terminated by a ;. Instead they are part of statements of the form
print print "text";
The inner print has an argument of "text" and prints that, the outer print has an argument of print "text" and print the value of that, and when succesful print returns a value of 1 (perldoc only says it returns true, so don't rely it being 1) - so a 1 is printed.
If the point was to format your output nicely, you should explicitly print "\n".

As has been explained, half of your print calls are printing the return value of the following print statement because you are missing a semicolon at the end of the line to terminate the statement
Also, print on its own will print the value of the default variable $_, not a newline as you expected. You need to write print "\n"; to achieve what you intend
It's also very important to add use strict and use warnings 'all' to the top of every Perl program you write. You will also need to declare all of your variables using my
#!/usr/bin/perl
use strict;
use warnings 'all';
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
my $v1 = 10;
my $v2 = 9;
my $v3 = 8;
my $val = $v1 + $v2 + $v3;
my $cube = $val**( 1 / 3 );
my $square = sqrt($val);
print "Sum of 10, 9, 8: $val\n";
print "\n";
print "Cube root of Sum: $cube\n";
print "\n";
print "Square root of Sum: $square\n";
print "\n";
print "Thanks for using this script!\n";
print "\n";
output
Sum of 10, 9, 8: 27
Cube root of Sum: 3
Square root of Sum: 5.19615242270663
Thanks for using this script!
It's also worth pointing out that there's a construct called a here document that will let you do this more neatly and clearly. If you change those print statements to just one, like this, then the intention is clear and the output is identical to that of the original code
print <<END;
Sum of 10, 9, 8: $val
Cube root of Sum: $cube
Square root of Sum: $square
Thanks for using this script!
END

As Henrik states in his answer, the lines with print and no ; are the problem.
An alternate way to get Perl to print a blank line between the main lines of output is to add an addition new line character, \n, at the end of each of the print lines. The code would become:
#!/usr/bin/perl
# Provide a sum, cube of the sum, and square root of the sum of three set variables
# Set variables
$v1=10;
$v2=9;
$v3=8;
$val=$v1+$v2+$v3;
$cube=$val ** (1/3);
$square= sqrt($val);
print "Sum of 10, 9, 8: $val\n\n";
print
print "Cube of Sum: $cube\n\n";
print
print "Square of Sum: $square\n\n";
print
print "Thanks for using this script!"
The output is:
Sum of 10, 9, 8: 27
Cube of Sum: 3
Square of Sum: 5.19615242270663
By the way, your equation for calculating the cube of the sum calculates the cubed root. To calculate the cube of the sum you need,
$cube=$val ** (3);
Likewise, your equation to find the square of the sum is calculating the square root, not the square. To find the square of the sum you need to raise the sum to the power of 2.

Unix Bash...To sum up each row in a csv file, from the second entry onwards and then find the highest number from the sum of rows

I have create a one liner that will sum up each row in a csv file, from the second entry onwards. But I want to find the highest number from the sum of rows
Example file output: There are thousands of rows
03/Mar/2016:00:14,19772,7494,11293,9467
03/Mar/2016:00:15,18041,13241,9715,8968
03/Mar/2016:00:16,17441,13534,9926,9301
03/Mar/2016:00:17,17709,14243,9022,9209
03/Mar/2016:00:18,16368,13535,8761,8313
03/Mar/2016:00:19,17074,13224,8868,7789
03/Mar/2016:00:20,16783,13666,9499,8763
03/Mar/2016:00:21,16665,12962,8821,8862
Example script:
This is what I have achieved by calculating each row but need to just find the highest number from the calculated rows. Any ideas?
awk 'BEGIN {FS=OFS=","} {sum=0; for(i=2;i<=NF;i++) {sum+=$i}; print $0,"sum:"sum,}' /tmp/101.20160304.csv
cheers

awk is quite capable of remembering a maximum value.
awk -F, '
# for every row, calculate the sum
{sum = 0; for (i=2; i<=NF; i++) sum += $i}
# set the max value (if the first row, initialize the max value)
NR == 1 || sum > max {max = sum}
END {print max}
' file
For your sample data, this is the max:
50202

you can pipe your awk output to :
awk_output|sort -t':' -nrk4|head -1
this does sort by the sum descending, then pick the first row. Of course you can re-write your awk, to do this in one shot.

Sorting multiple arrays simultaneously in awk

Introduction
Consider the following example sort.awk:
BEGIN {
a[1]="5";
a[2]="3";
a[3]="6";
asort(a)
for (i=1; i<=3; i++) print a[i]
}
Running with awk -f sort.awk prints the sorted numbers in array a in ascending order:
3
5
6
Question
Consider the extended case of two (and, in general, for N) corresponding arrays a and b
a[1]="5"; b[1]="fifth"
a[2]="3"; b[2]="third"
a[3]="6"; b[3]="sixth"
and the problem of sorting all arrays "simultaneously".. To achieve this, I need to sort array a but also to obtain the indices of the sorting. For this simple case, the indices would be given by
ind[1]=2; ind[2]=1; ind[3]=3;
Having these indices, I can then print out also the sorted b array based on the result of the sorting of array a. For instance:
for (i=1;i<=3;i++) print a[ind[i]], b[ind[i]]
will print the sorted arrays..
See also Sort associative array with AWK.

I come up with two methods to do your "simultaneous" sort.
One is combining the two arrays then sort. This is useful when you just need the output.
the other one is using gawk's asorti()
read codes for details, I think it is easy to understand:
BEGIN{
a[1]="5"; b[1]="fifth"
a[2]="3"; b[2]="third"
a[3]="6"; b[3]="sixth"
#method 1: combine the two arrays before sort
for(;++i<=3;)
n[i] = a[i]" "b[i]
asort(n)
print "--- method 1: ---"
for(i=0;++i<=3;)
print n[i]
#method 2:
#here we build a new array/hastable, and use asorti()
for(i=0;++i<=3;)
x[a[i]]=b[i]
asorti(x,t)
print "--- method 2: ---"
for(i=0;++i<=3;)
print t[i],x[t[i]]
}
output:
kent$ awk -f sort.awk
--- method 1: ---
3 third
5 fifth
6 sixth
--- method 2: ---
3 third
5 fifth
6 sixth
EDIT
if you want to get the original index, you can try the method3 as following:
#method 3:
print "--- method 3: ---"
for(i=0;++i<=3;)
c[a[i]] = i;
asort(a)
for(i=0;++i<=3;)
print a[i], " | related element in b: "b[c[a[i]]], " | original idx: " c[a[i]]
the output is:
--- method 3: ---
3 | related element in b: third | original idx: 2
5 | related element in b: fifth | original idx: 1
6 | related element in b: sixth | original idx: 3
you can see, the original idx is there. if you want to save them into an array, just add idx[i]=c[a[i]] in the for loop.
EDIT2
method 4: combine with different order, then split to get idx array:
#method 4:
for(i=0;++i<=3;)
m[i] = a[i]"\x99"i
asort(m)
print "--- method 4: ---"
for(i=0;++i<=3;){
split(m[i],x,"\x99")
ind[i]=x[2]
}
#test ind array:
for(i=0;++i<=3;)
print i"->"ind[i]
output:
--- method 4: ---
1->2
2->1
3->3

Based on Kents answer, here is a solution that should also obtain the indices:
BEGIN {
a[1]="5";
a[2]="3";
a[3]="6";
for (i=1; i<=3; i++) b[i]=a[i]" "i
asort(b)
for (i=1; i<=3; i++) {
split(b[i],c," ")
ind[i]=c[2]
}
for (i=1; i<=3; i++) print ind[i]
}

Need an algorithm to split a series of numbers

After a few busy nights my head isn't working so well, but this needs to be fixed yesterday, so I'm asking the more refreshed community of SO.
I've got a series of numbers. For example:
1, 5, 7, 13, 3, 3, 4, 1, 8, 6, 6, 6
I need to split this series into three parts so the sum of the numbers in all parts is as close as possible. The order of the numbers needs to be maintained, so the first part must consist of the first X numbers, the second - of the next Y numbers, and the third - of whatever is left.
What would be the algorithm to do this?
(Note: the actual problem is to arrange text paragraphs of differing heights into three columns. Paragraphs must maintain order (of course) and they may not be split in half. The columns should be as equal of height as possible.)

First, we'll need to define the goal better:
Suppose the partial sums are A1,A2,A3, We are trying to minimize |A-A1|+|A-A2|+|A-A3|. A is the average: A=(A1+A2+A3)/3.
Therefore, we are trying to minimize |A2+A3-2A1|+|A1+A3-2A2|+|A1+A2-2A3|.
Let S denote the sum (which is constant): S=A1+A2+A3, so A3=S-A1-A2.
We're trying to minimize:
|A2+S-A1-A2-2A1|+|A1+S-A1-A2-2A2|+|A1+A2-2S+2A1+2A2|=|S-3A1|+|S-3A2|+|3A1+SA2-2S|
Denoting this function as f, we can do two loops O(n^2) and keep track of the minimum:
Something like:
for (x=1; x<items; x++)
{
A1= sum(Item[0]..Item[x-1])
for (y=x; y<items; y++)
{
A2= sum(Item[x]..Item[y-1])
calc f, if new minimum found -keep x,y
}
}

find sum and cumulative sum of series.
get a= sum/3
then locate nearest a, 2*a in the cumulative sum which divides your list into three equal parts.

Lets say p is your array of paragraph heights;
int len= p.sum()/3; //it is avarage value
int currlen=0;
int templen=0;
int indexes[2];
int j = 0;
for (i=0;i<p.lenght;i++)
{
currlen = currlen + p[i];
if (currlen>len)
{
if ((currlen-len)<(abs((currlen-p[i])-len))
{ //check which one is closer to avarege val
indexes[j++] = i;
len=(p.sum()-currlen)/2 //optional: count new avearege height from remaining lengths
currlen = 0;
}
else
{
indexes[j++] = i-1;
len=(p.sum()-currlen)/2
currlen = p[i];
}
}
if (j>2)
break;
}
You will get starting index of 2nd and 3rd sequence. Note its kind of pseudo code :)

I believe that this can be solved with a dynamic programming algorithm for line breaking invented by Donald Knuth for use in TeX.

Following Aasmund Eldhuset answer, I previously answerd this question on SO.
Word wrap to X lines instead of maximum width (Least raggedness)
This algo doesn't rely on the max line size but just gives an optimal cut.
I modified it to work with your problem :
L=[1,5,7,13,3,3,4,1,8,6,6,6]
def minragged(words, n=3):
P=2
cumwordwidth = [0]
# cumwordwidth[-1] is the last element
for word in words:
cumwordwidth.append(cumwordwidth[-1] + word)
totalwidth = cumwordwidth[-1] + len(words) - 1 # len(words) - 1 spaces
linewidth = float(totalwidth - (n - 1)) / float(n) # n - 1 line breaks
print "number of words:", len(words)
def cost(i, j):
"""
cost of a line words[i], ..., words[j - 1] (words[i:j])
"""
actuallinewidth = max(j - i - 1, 0) + (cumwordwidth[j] - cumwordwidth[i])
return (linewidth - float(actuallinewidth)) ** P
"""
printing the reasoning and reversing the return list
"""
F={} # Total cost function
for stage in range(n):
print "------------------------------------"
print "stage :",stage
print "------------------------------------"
print "word i to j in line",stage,"\t\tTotalCost (f(j))"
print "------------------------------------"
if stage==0:
F[stage]=[]
i=0
for j in range(i,len(words)+1):
print "i=",i,"j=",j,"\t\t\t",cost(i,j)
F[stage].append([cost(i,j),0])
elif stage==(n-1):
F[stage]=[[float('inf'),0] for i in range(len(words)+1)]
for i in range(len(words)+1):
j=len(words)
if F[stage-1][i][0]+cost(i,j)<F[stage][j][0]: #calculating min cost (cf f formula)
F[stage][j][0]=F[stage-1][i][0]+cost(i,j)
F[stage][j][1]=i
print "i=",i,"j=",j,"\t\t\t",F[stage][j][0]
else:
F[stage]=[[float('inf'),0] for i in range(len(words)+1)]
for i in range(len(words)+1):
for j in range(i,len(words)+1):
if F[stage-1][i][0]+cost(i,j)<F[stage][j][0]:
F[stage][j][0]=F[stage-1][i][0]+cost(i,j)
F[stage][j][1]=i
print "i=",i,"j=",j,"\t\t\t",F[stage][j][0]
print 'reversing list'
print "------------------------------------"
listWords=[]
a=len(words)
for k in xrange(n-1,0,-1):#reverse loop from n-1 to 1
listWords.append(words[F[k][a][1]:a])
a=F[k][a][1]
listWords.append(words[0:a])
listWords.reverse()
for line in listWords:
print line, '\t\t',sum(line)
return listWords
THe result I get is :
[1, 5, 7, 13] 26
[3, 3, 4, 1, 8] 19
[6, 6, 6] 18
[[1, 5, 7, 13], [3, 3, 4, 1, 8], [6, 6, 6]]
Hope it helps

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Need to calculate standard deviation from an array using bash and awk? - bash

Related

Divide as evenly as possible the value defined in var into the a variable length array

Perl sqrt, cube issue: 1 showing up after each line

Unix Bash...To sum up each row in a csv file, from the second entry onwards and then find the highest number from the sum of rows

Sorting multiple arrays simultaneously in awk

Need an algorithm to split a series of numbers

Categories

Resources