Calculating mean from values in columns specified on the first line using awk - bash

I have a huge file (hundreds of lines, ca. 4,000 columns) structured like this
locus 1 1 1 2 2 3 3 3
exon 1 2 3 1 2 1 2 3
data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and I need to calculate mean from all values (on each data line separately) with the same locus number (i.e., the same number in the first line), i.e.
data1: mean from first three values (three columns with locus '1':
17.07, 7.11, 10.58), next two values (10.21, 19.34) and next three values (14.69, 3.32, 21.07)
I would like to have output like this
data1 mean1 mean2 mean3
data1 mean1 mean2 mean3
I was thinking about using bash and awk...
Thank you for your advice.

You can use GNU datamash version 1.1.0 or newer (I used last version - 1.1.1):
#!/bin/bash
lines=$(wc -l < "$1")
datamash -W transpose < "$1" |
datamash -H groupby 1 mean 3-"$lines" |
datamash transpose
Usage: mean_value.sh input.txt | column -t (column -t needed for pretty view, it is not necessary)
Output:
GroupBy(locus) 1 2 3
mean(data1) 11.586666666667 14.775 13.026666666667
mean(data2) 13.586666666667 18.565 10.933333333333

if it was me, i would use R, not awk:
library(data.table)
x = fread('data.txt')
#> x
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00
#2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00
#3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
#4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
# save first column of names for later
cnames = x$V1
# remove first column
x[,V1:=NULL]
# matrix transpose: makes rows into columns
x = t(x)
# convert back from matrix to data.table
x = data.table(x,keep.rownames=F)
# set the column names
colnames(x) = cnames
#> x
# locus exon data1 data2
#1: 1 1 17.07 21.42
#...
# ditch useless column
x[,exon:=NULL]
#> x
# locus data1 data2
#1: 1 17.07 21.42
# apply mean() function to each column, grouped by locus
x[,lapply(.SD,mean),locus]
# locus data1 data2
#1: 1 11.58667 13.58667
#2: 2 14.77500 18.56500
#3: 3 13.02667 10.93333
for convenience, here's the whole thing again without comments:
library(data.table)
x = fread('data.txt')
cnames = x$V1
x[,V1:=NULL]
x = t(x)
x = data.table(x,keep.rownames=F)
colnames(x) = cnames
x[,exon:=NULL]
x[,lapply(.SD,mean),locus]

awk ' NR==1{for(i=2;i<NF+1;i++) multi[i]=$i}
NR>2{
for(i in multi)
{
data[multi[i]] = 0
count[multi[i]] = 0
}
for(i=2;i<NF+1;i++)
{
data[multi[i]] += $i
count[multi[i]] += 1
};
printf "%s ",$1;
for(i in data)
printf "%s ", data[i]/count[i];
print ""
}' <file_name>
Replace <file_name> with your data file

Related

Passing for loop using non-integers to awk

I am trying to write code which will achieve:
Where $7 is less than $i (0 - 1 in increments of 0.05), print the line and pass to word count. The way I tried to do this was:
for i in $(seq 0 0.05 1); do awk '{if ($7 <= $i) print $0}' file.txt | wc -l ; done
This just ends up returning the word count of the full file (~40 million lines) for each instance of $i. When, for example using $7 <= 0.00, it should be returning ~67K.
I feel like there may be a way to do this within awk, but I have not seen any suggestions which allow for non-integers.
Thanks in advance.
Pass $i to awk as a variable with -v and so:
for i in $(seq 0 0.05 1); do awk -v i=$i '{if ($7 <= i) print $0}' file.txt | wc -l ; done
Some made up data:
$ cat file.txt
1 2 3 4 5 6 7 a b c d e f
1 2 3 4 5 6 0.6 a b c
1 2 3 4 5 6 0.57 a b c d e f g h i j
1 2 3 4 5 6 1 a b c d e f g
1 2 3 4 5 6 0.21 a b
1 2 3 4 5 6 0.02 x y z
1 2 3 4 5 6 0.00 x y z l j k
One possible 100% awk solution:
awk '
BEGIN { line_count=0 }
{ printf "================= %s\n",$0
for (i=0; i<=20; i++)
{ if ($7 <= i/20)
{ printf "matching seq : %1.2f\n",i/20
line_count++
seq_count[i]++
next
}
}
}
END { printf "=================\n\n"
for (i=0; i<=20; i++)
{ if (seq_count[i] > 0)
{ printf "seq = %1.2f : %8s (count)\n",i/20,seq_count[i] }
}
printf "\nseq = all : %8s (count)\n",line_count
}
' file.txt
# the output:
================= 1 2 3 4 5 6 7 a b c d e f
================= 1 2 3 4 5 6 0.6 a b c
matching seq : 0.60
================= 1 2 3 4 5 6 0.57 a b c d e f g h i j
matching seq : 0.60
================= 1 2 3 4 5 6 1 a b c d e f g
matching seq : 1.00
================= 1 2 3 4 5 6 0.21 a b
matching seq : 0.25
================= 1 2 3 4 5 6 0.02 x y z
matching seq : 0.05
================= 1 2 3 4 5 6 0.00 x y z l j k
matching seq : 0.00
=================
seq = 0.00 : 1 (count)
seq = 0.05 : 1 (count)
seq = 0.25 : 1 (count)
seq = 0.60 : 2 (count)
seq = 1.00 : 1 (count)
seq = all : 6 (count)
BEGIN { line_count=0 } : initialize a total line counter
print statement is merely for debug purposes; will print out every line from file.txt as it's processed
for (i=0; i<=20; i++) : depending on implementation, some versions of awk may have rounding/accuracy problems with non-integer numbers in sequences (eg, increment by 0.05), so we'll use whole integers for our sequence, and divide by 20 (for this particular case) to provide us with our 0.05 increments during follow-on testing
$7 <= i/20 : if field #7 is less than or equal to (i/20) ...
printf "matching seq ... : print the sequence value we just matched on (i/20)
line_count++ : add '1' to our total line counter
seq_count[i]++ : add '1' to our sequence counter array
next : break out of our sequence loop (since we found our matching sequence value (i/20), and process the next line in the file
END ... : print out our line counts
for (x=1; ...) / if / printf : loop through our array of sequences, printing the line count for each sequence (i/20)
printf "\nseq = all... : print out our total line count
NOTE: Some of the awk code can be further reduced but I'll leave this as is since it's a little easier to understand if you're new to awk.
One (obvious?) benefit of a 100% awk solution is that our sequence/looping construct is internal to awk thus allowing us to limit ourselves to one loop through the input file (file.txt); when the sequence/looping construct is outside of awk we find ourselves having to process the input file once for each pass through the sequence/loop (eg, for this exercise we would have to process the input file 21 times !!!).
Using a bit of guesswork as to what you actually want to accomplish, I came up with this:
awk '{ for (i=20; 20*$7<=i && i>0; i--) bucket[i]++ }
END { for (i=1; i<=20; i++) print bucket[i] " lines where $7 <= " i/20 }'
With the mock data from mark's second answer I get this output:
2 lines where $7 <= 0.05
2 lines where $7 <= 0.1
2 lines where $7 <= 0.15
2 lines where $7 <= 0.2
3 lines where $7 <= 0.25
3 lines where $7 <= 0.3
3 lines where $7 <= 0.35
3 lines where $7 <= 0.4
3 lines where $7 <= 0.45
3 lines where $7 <= 0.5
3 lines where $7 <= 0.55
5 lines where $7 <= 0.6
5 lines where $7 <= 0.65
5 lines where $7 <= 0.7
5 lines where $7 <= 0.75
5 lines where $7 <= 0.8
5 lines where $7 <= 0.85
5 lines where $7 <= 0.9
5 lines where $7 <= 0.95
6 lines where $7 <= 1

How to determine statistical significance in shell

I would like to determine the statistical significance of my results using shell scripting. My input file shows the number of errors in each trial in 10000 observations. Part of it is listed as: (using a threshold of having at least 1 error)
ifile.txt
1
2
2
4
1
3
2
3
4
2
3
4
2
6
2
Then I calculated the probability of each numbered error, which I calculated as:
awk '{ count[$0]++; total++ }
END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt
where first column in ofile.txt shows the number of errors and 2nd column shows its probability
ofile.txt
1 0.133
2 0.400
3 0.200
4 0.200
6 0.067
Now I need to determine the statistical significance of this result e.g. to highlight those results which are not statistically significant at 1% level. i.e. we will accept those errors which are having p-value < 0.005 and if a error has p-value > 0.005 then we will reject it.
I can't think of any method to do this in shell. Can anybody help/suggest me something?
Desire output is something like:
outfile.txt
1 99999
2 0.400
3 0.200
4 0.200
6 99999
Here, I assumed the probability of showing 1 error is not statistically significent at 1% level, but the probability of showing 2 errors is statistically significant and so on.
With no statistics education or gnuplot experience, it's a bit difficult to decipher exactly the desired methods for a solution. The problem may not be described well enough or my knowledge is ill-equipped for it.
Either way, after looking at the relationships between the data presented and desired output, I came up with this Awk script to achieve it:
$ cat script.awk
function abs(v) { return v < 0 ? -v : v }
{ a[$0]++ }
END {
obs = 10000
sig = 1
for (i in a) {
r = a[i]/NR
if (abs(r-sig/10) <= sig/20)
print i, obs-sig
else
printf "%d %.3f\n", i, r
}
}
$ awk -f script.awk ifile.txt | sort > outfile.txt
$ cat outfile.txt
1 9999
2 0.400
3 0.200
4 0.200
6 9999
This assumes 9999 (10000 (number of observations) - 1 (error)) was meant as the second field in the 1st and 5th lines in the desired output, not 99999.
Also, if using GNU Awk, the need for a pipe into sort could be eliminated by using the sort function asorti.

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Add different value to each column in array

How can i add a different value to each column in a bash script?
Example: Three function f1(x) f2(x) f3(x) plotted over x
test.dat:
# x f1 f2 f3
1 0.1 0.01 0.001
2 0.2 0.02 0.002
3 0.3 0.03 0.003
Now i want to add to each function a different offset value
values = 1 2 3
Desired result:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003
So first column should be unaffected, otherwise the value added.
I tried this, but it doesn work
declare -a energy_array=( 1 2 3 )
for (( i =0 ; i < ${#energy_array[#]} ; i ++ ))
do
local energy=${energy_array[${i}]}
cat "test.dat" \
| awk -v "offset=${energy}" \
'{ for(j=2; j<NF;j++) printf "%s",$j+offset OFS; if (NF) printf "%s",$NF; printf ORS} '
done
You can try the following:
declare -a energy_array=( 1 2 3 )
awk -voffset="${energy_array[*]}" \
'BEGIN { n=split(offset,a) }
NR> 1{
for(j=2; j<=NF;j++)
$j=$j+a[j-1]
print;next
}1' test.dat
With output:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003

Awk - extracting information from an xyz-format matrix

I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
Output:
1 0.34
2 0.92
3 0.86

Resources