awk condition always TRUE in a loop [duplicate] - bash

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.

You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Related

Passing variable grep command

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

Calculating mean from values in columns specified on the first line using awk

I have a huge file (hundreds of lines, ca. 4,000 columns) structured like this
locus 1 1 1 2 2 3 3 3
exon 1 2 3 1 2 1 2 3
data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and I need to calculate mean from all values (on each data line separately) with the same locus number (i.e., the same number in the first line), i.e.
data1: mean from first three values (three columns with locus '1':
17.07, 7.11, 10.58), next two values (10.21, 19.34) and next three values (14.69, 3.32, 21.07)
I would like to have output like this
data1 mean1 mean2 mean3
data1 mean1 mean2 mean3
I was thinking about using bash and awk...
Thank you for your advice.
You can use GNU datamash version 1.1.0 or newer (I used last version - 1.1.1):
#!/bin/bash
lines=$(wc -l < "$1")
datamash -W transpose < "$1" |
datamash -H groupby 1 mean 3-"$lines" |
datamash transpose
Usage: mean_value.sh input.txt | column -t (column -t needed for pretty view, it is not necessary)
Output:
GroupBy(locus) 1 2 3
mean(data1) 11.586666666667 14.775 13.026666666667
mean(data2) 13.586666666667 18.565 10.933333333333
if it was me, i would use R, not awk:
library(data.table)
x = fread('data.txt')
#> x
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00
#2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00
#3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
#4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
# save first column of names for later
cnames = x$V1
# remove first column
x[,V1:=NULL]
# matrix transpose: makes rows into columns
x = t(x)
# convert back from matrix to data.table
x = data.table(x,keep.rownames=F)
# set the column names
colnames(x) = cnames
#> x
# locus exon data1 data2
#1: 1 1 17.07 21.42
#...
# ditch useless column
x[,exon:=NULL]
#> x
# locus data1 data2
#1: 1 17.07 21.42
# apply mean() function to each column, grouped by locus
x[,lapply(.SD,mean),locus]
# locus data1 data2
#1: 1 11.58667 13.58667
#2: 2 14.77500 18.56500
#3: 3 13.02667 10.93333
for convenience, here's the whole thing again without comments:
library(data.table)
x = fread('data.txt')
cnames = x$V1
x[,V1:=NULL]
x = t(x)
x = data.table(x,keep.rownames=F)
colnames(x) = cnames
x[,exon:=NULL]
x[,lapply(.SD,mean),locus]
awk ' NR==1{for(i=2;i<NF+1;i++) multi[i]=$i}
NR>2{
for(i in multi)
{
data[multi[i]] = 0
count[multi[i]] = 0
}
for(i=2;i<NF+1;i++)
{
data[multi[i]] += $i
count[multi[i]] += 1
};
printf "%s ",$1;
for(i in data)
printf "%s ", data[i]/count[i];
print ""
}' <file_name>
Replace <file_name> with your data file

Passing for loop using non-integers to awk

I am trying to write code which will achieve:
Where $7 is less than $i (0 - 1 in increments of 0.05), print the line and pass to word count. The way I tried to do this was:
for i in $(seq 0 0.05 1); do awk '{if ($7 <= $i) print $0}' file.txt | wc -l ; done
This just ends up returning the word count of the full file (~40 million lines) for each instance of $i. When, for example using $7 <= 0.00, it should be returning ~67K.
I feel like there may be a way to do this within awk, but I have not seen any suggestions which allow for non-integers.
Thanks in advance.
Pass $i to awk as a variable with -v and so:
for i in $(seq 0 0.05 1); do awk -v i=$i '{if ($7 <= i) print $0}' file.txt | wc -l ; done
Some made up data:
$ cat file.txt
1 2 3 4 5 6 7 a b c d e f
1 2 3 4 5 6 0.6 a b c
1 2 3 4 5 6 0.57 a b c d e f g h i j
1 2 3 4 5 6 1 a b c d e f g
1 2 3 4 5 6 0.21 a b
1 2 3 4 5 6 0.02 x y z
1 2 3 4 5 6 0.00 x y z l j k
One possible 100% awk solution:
awk '
BEGIN { line_count=0 }
{ printf "================= %s\n",$0
for (i=0; i<=20; i++)
{ if ($7 <= i/20)
{ printf "matching seq : %1.2f\n",i/20
line_count++
seq_count[i]++
next
}
}
}
END { printf "=================\n\n"
for (i=0; i<=20; i++)
{ if (seq_count[i] > 0)
{ printf "seq = %1.2f : %8s (count)\n",i/20,seq_count[i] }
}
printf "\nseq = all : %8s (count)\n",line_count
}
' file.txt
# the output:
================= 1 2 3 4 5 6 7 a b c d e f
================= 1 2 3 4 5 6 0.6 a b c
matching seq : 0.60
================= 1 2 3 4 5 6 0.57 a b c d e f g h i j
matching seq : 0.60
================= 1 2 3 4 5 6 1 a b c d e f g
matching seq : 1.00
================= 1 2 3 4 5 6 0.21 a b
matching seq : 0.25
================= 1 2 3 4 5 6 0.02 x y z
matching seq : 0.05
================= 1 2 3 4 5 6 0.00 x y z l j k
matching seq : 0.00
=================
seq = 0.00 : 1 (count)
seq = 0.05 : 1 (count)
seq = 0.25 : 1 (count)
seq = 0.60 : 2 (count)
seq = 1.00 : 1 (count)
seq = all : 6 (count)
BEGIN { line_count=0 } : initialize a total line counter
print statement is merely for debug purposes; will print out every line from file.txt as it's processed
for (i=0; i<=20; i++) : depending on implementation, some versions of awk may have rounding/accuracy problems with non-integer numbers in sequences (eg, increment by 0.05), so we'll use whole integers for our sequence, and divide by 20 (for this particular case) to provide us with our 0.05 increments during follow-on testing
$7 <= i/20 : if field #7 is less than or equal to (i/20) ...
printf "matching seq ... : print the sequence value we just matched on (i/20)
line_count++ : add '1' to our total line counter
seq_count[i]++ : add '1' to our sequence counter array
next : break out of our sequence loop (since we found our matching sequence value (i/20), and process the next line in the file
END ... : print out our line counts
for (x=1; ...) / if / printf : loop through our array of sequences, printing the line count for each sequence (i/20)
printf "\nseq = all... : print out our total line count
NOTE: Some of the awk code can be further reduced but I'll leave this as is since it's a little easier to understand if you're new to awk.
One (obvious?) benefit of a 100% awk solution is that our sequence/looping construct is internal to awk thus allowing us to limit ourselves to one loop through the input file (file.txt); when the sequence/looping construct is outside of awk we find ourselves having to process the input file once for each pass through the sequence/loop (eg, for this exercise we would have to process the input file 21 times !!!).
Using a bit of guesswork as to what you actually want to accomplish, I came up with this:
awk '{ for (i=20; 20*$7<=i && i>0; i--) bucket[i]++ }
END { for (i=1; i<=20; i++) print bucket[i] " lines where $7 <= " i/20 }'
With the mock data from mark's second answer I get this output:
2 lines where $7 <= 0.05
2 lines where $7 <= 0.1
2 lines where $7 <= 0.15
2 lines where $7 <= 0.2
3 lines where $7 <= 0.25
3 lines where $7 <= 0.3
3 lines where $7 <= 0.35
3 lines where $7 <= 0.4
3 lines where $7 <= 0.45
3 lines where $7 <= 0.5
3 lines where $7 <= 0.55
5 lines where $7 <= 0.6
5 lines where $7 <= 0.65
5 lines where $7 <= 0.7
5 lines where $7 <= 0.75
5 lines where $7 <= 0.8
5 lines where $7 <= 0.85
5 lines where $7 <= 0.9
5 lines where $7 <= 0.95
6 lines where $7 <= 1

Add different value to each column in array

How can i add a different value to each column in a bash script?
Example: Three function f1(x) f2(x) f3(x) plotted over x
test.dat:
# x f1 f2 f3
1 0.1 0.01 0.001
2 0.2 0.02 0.002
3 0.3 0.03 0.003
Now i want to add to each function a different offset value
values = 1 2 3
Desired result:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003
So first column should be unaffected, otherwise the value added.
I tried this, but it doesn work
declare -a energy_array=( 1 2 3 )
for (( i =0 ; i < ${#energy_array[#]} ; i ++ ))
do
local energy=${energy_array[${i}]}
cat "test.dat" \
| awk -v "offset=${energy}" \
'{ for(j=2; j<NF;j++) printf "%s",$j+offset OFS; if (NF) printf "%s",$NF; printf ORS} '
done
You can try the following:
declare -a energy_array=( 1 2 3 )
awk -voffset="${energy_array[*]}" \
'BEGIN { n=split(offset,a) }
NR> 1{
for(j=2; j<=NF;j++)
$j=$j+a[j-1]
print;next
}1' test.dat
With output:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003

Awk - extracting information from an xyz-format matrix

I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
Output:
1 0.34
2 0.92
3 0.86

Resources