How to determine statistical significance in shell - shell

I would like to determine the statistical significance of my results using shell scripting. My input file shows the number of errors in each trial in 10000 observations. Part of it is listed as: (using a threshold of having at least 1 error)
Then I calculated the probability of each numbered error, which I calculated as:
awk '{ count[$0]++; total++ }
END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt
where first column in ofile.txt shows the number of errors and 2nd column shows its probability
1 0.133
2 0.400
3 0.200
4 0.200
6 0.067
Now I need to determine the statistical significance of this result e.g. to highlight those results which are not statistically significant at 1% level. i.e. we will accept those errors which are having p-value < 0.005 and if a error has p-value > 0.005 then we will reject it.
I can't think of any method to do this in shell. Can anybody help/suggest me something?
Desire output is something like:
1 99999
2 0.400
3 0.200
4 0.200
6 99999
Here, I assumed the probability of showing 1 error is not statistically significent at 1% level, but the probability of showing 2 errors is statistically significant and so on.

With no statistics education or gnuplot experience, it's a bit difficult to decipher exactly the desired methods for a solution. The problem may not be described well enough or my knowledge is ill-equipped for it.
Either way, after looking at the relationships between the data presented and desired output, I came up with this Awk script to achieve it:
$ cat script.awk
function abs(v) { return v < 0 ? -v : v }
{ a[$0]++ }
obs = 10000
sig = 1
for (i in a) {
r = a[i]/NR
if (abs(r-sig/10) <= sig/20)
print i, obs-sig
printf "%d %.3f\n", i, r
$ awk -f script.awk ifile.txt | sort > outfile.txt
$ cat outfile.txt
1 9999
2 0.400
3 0.200
4 0.200
6 9999
This assumes 9999 (10000 (number of observations) - 1 (error)) was meant as the second field in the 1st and 5th lines in the desired output, not 99999.
Also, if using GNU Awk, the need for a pipe into sort could be eliminated by using the sort function asorti.


Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Add up every 5 rows in a column of integers BASH

I am writing a parser, and have to so some fancy stuff. I am trying not to use python, but I might have to at this point.
Given an STDOUT that looks like this:
For about 100,000 lines. What I need to do is add up every 5, like so:
1 - start
0 |
2 | - 6
3 |
0 - end
0 - start
1 |
0 | - 3
0 |
2 - end
0 - start
3 |
0 | - 7
4 |
0 - end
The -, |, start, end, are all for visual representation, I just need it in a column list:
I currently have a method of doing this by using an increment head -n $i and tail -n 5 to cut 5 rows out of the list, then I use paste -sd+ - | bc to add up all the values. But this is wayyyy to slow because there is 100,000 columns.
If anyone has anything to add I would appreciate it. Let me know if more info is needed.
Thank you
It looks like awk is a natural tool to use:
awk '{ sum += $1 } NR % 5 == 0 { print sum; sum = 0 }'
Add values in column 1 to sum. If the record number modulo 5 is 0, print the sum and reset it to 0. Note that if the last group of records is short (1-4 elements in the group), their sum is not printed. If you want the sum for the short group printed, add END { if (NR % 5 != 0) print sum } to the script.
Since this makes a single pass over the data file using a single command, it will be hard to beat it. Using Perl might be a little faster. I don't know how Python would fare against either Awk or Perl.
You can use awk for it.
Say file named file1 contains
So the awk command goes like:
awk 'begin{sum=0;} {sum=sum+1;if(NR%5==0){print sum;sum=0;}}' file1

bash search output for similar text and perform calculation between the 2

I am working on a script that will run a pm2 list and assign it to a variable, wait X seconds and run it again assigning it to a different variable. Then I run those through a comm <(echo "$pm2_1") <(echo "$pm2_2") -3 that gives me only the output that is different between the 2 in a nice format
name ID restart count
prog-name 0 1
prog-name 0 2
prog-name-live 10 1
prog-name-live 10 8
prog-name-live 3 1
prog-name-live 3 4
prog-name-live 6 1
prog-name-live 6 6
What I need is a way to compare the restart counts on the 2 lines with similar IDs.. EX
name ID restart count
prog-name 0 1
prog-name 0 2
prog-name-worker 10 1
prog-name-worker 10 8
Any ideas would be very helpful!
awk supports hash hope that helps
awk '{k=$1" "$2; a[k]=$3; print k, a[k]}'
here is example of using it to find difference, you can try any logic
awk '{k=$1" "$2; if (a[k]==0)a[k]=$3; else {a[k]-=$3; q=a[k]>0?a[k]:a[k]*-1;print k,q}}'

Awk - extracting information from an xyz-format matrix

I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
1 0.34
2 0.92
3 0.86

Converting a series of matrix files into an index of coordinates in awk

I have a time series of files 0000.vx.dat, 0000.vy.dat, 0000.vz.dat; ...; 0077.vx.dat, 0077.vy.dat, 0077.vz.dat... Each file is a space-separated 2D matrix. I would like to take each triplet of files and combine them all into a coordinate-based data format, i.e.:
[timestep + 1] [i] [j] [vx(i,j)] [vy(i,j)] [vz(i,j)]
Each file number corresponds to a particular time step. Given the amount of data I have in this time series (~ 4 GB), bash wasn't cutting it so it seemed to be time to head over to awk... specifically mawk. It was pretty stupid to try this in bash but here is
my ill-fated attempt:
for x in $(seq 1 78)
tfx=${tf[$x]} # an array of padded zeros
for y in $(seq 1 1568)
for z in $(seq 1 1344)
echo $x $y $z $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vx.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vy.dat) $(awk -v i=$z -v j=$y "FNR == i {print j}" $tfx.vz.dat) >> $file
edit: Thank you, ruakh, for pointing out that I had kept j in shell variable format with a $ in front! This is just a snippet of the original script, but I guess would be considered the guts of it!
Suffice it to say this would have taken about six months because of all the memory overhead in bash associated with O(MxN) algorithms, subshells and pipes and whatnot. I was looking for more along the lines of a day at most. Each file is around 18 MB, so it should not be that much of a problem. I would be happy with doing this one timestep at a time in awk provided that I get one output file per timestep. I could just cat them all together without much issue afterwords, I think. It is important, though, that the time step number be the first item on the coordinate list. I could achieve this with an awk -v argument (see above) in with a bash routine. I do not know how to look up specific elements of matrices in three separate files and put them all together into one output. This is the main hurdle I would like to overcome. I was hoping mawk could provide a nice balance between effort and computational speed. If this seems to be too much for an awk script, I could go to something lower level, and would appreciate any of those answering letting me know I should just go to C instead.
Thank you in advance! I really like awk, but am afraid I am a novice.
The three files, 0000.vx.dat, 0000.vy.dat, and 0000.vz.dat would read as follows (except huge and of the correct dimensions):
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 20 21
22 23 24
25 26 27
I would like to be able to input:
awk -v t=1 -f stackoverflow.awk 0000.vx.dat 0000.vy.dat 0000.vz.dat
and get the following output:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27
edit: Thank you, shellter, for suggesting I put the desired input and output more clearly!
Personally, I use gawk to process most of my text files. However, since you have requested a mawk compatible solution, here's one way to solve your problem. Run, in your present working directory:
for i in *.vx.dat; do nawk -f script.awk "$i" "${i%%.*}.vy.dat" "${i%%.*}.vz.dat"; done
Contents of script.awk:
FNR==1 {
for (i=1;i<=NF;i++) {
a[c] = (a[c] ? a[c] : FILENAME FS NR FS i) FS $i
for (j=1;j<=c;j++) {
print a[j] > sprintf("%04d.dat", FILENAME)
When you run the above, the results should be a single file for each set of three files containing your coordinates. These output files will have the filenames in the form: timestamp + 1 ".dat". I decided to pad these filenames with four 0's for your convenience. But you can change this to whatever format you like. Here's the results I get from the sample data you've posted. Contents of 0001.dat:
1 1 1 1 10 19
1 1 2 2 11 20
1 1 3 3 12 21
1 2 1 4 13 22
1 2 2 5 14 23
1 2 3 6 15 24
1 3 1 7 16 25
1 3 2 8 17 26
1 3 3 9 18 27
