Print only '+' or '-' if string matches (with two conditions) - bash

I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!

This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -

join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.

Related

shell script for extracting line of file using awk

I want, the selected lines of file to be print in output file side by side separated by space. Here what I have did so far,
for file in SAC*
do
awk 'FNR==2 {print $4}' $file >>exp
awk 'FNR==3 {print $4}' $file >>exp
awk 'FNR==4 {print $4}' $file >>exp
awk 'FNR==5 {print $4}' $file >>exp
awk 'FNR==7 {print $4}' $file >>exp
awk 'FNR==8 {print $4}' $file >>exp
awk 'FNR==24 {print $0}' $file >>exp
done
My output is:
XV
AMPY
BHZ
2012-08-15T08:00:00
2013-12-31T23:59:59
I want output should be
XV AMPY BHZ 2012-08-15T08:00:00 2013-12-31T23:59:59
First the test data (only 9 rows, tho):
$ cat file
1 2 3 14
1 2 3 24
1 2 3 34
1 2 3 44
1 2 3 54
1 2 3 64
1 2 3 74
1 2 3 84
1 2 3 94
Then the awk. No need for that for loop in shell, awk can handle multiple files:
$ awk '
BEGIN {
ORS=" "
a[2];a[3];a[4];a[5];a[7];a[8] # list of records for which $4 should be outputed
}
FNR in a { print $4 } # output the $4s
FNR==9 { printf "%s\n",$0 } # replace 9 with 24
' file file # ... # the files you want to process (SAC*)
24 34 44 54 74 84 1 2 3 94
24 34 44 54 74 84 1 2 3 94

How to use awk to search for min and max values of column in certain files

I know that awk is helpful in trying to find certain things in columns in files, but I'm not sure how to use it to find the min and max values of a column in a group of files. Any advice? To be specific I have four files in a directory that I want to go through awk with.
If you're looking for the absolute maximum and minimum of column N over all the files, then you might use:
N=6
awk -v N=$N 'NR == 1 { min = max = $N }
{ if ($N > max) max = $N; else if ($N < min) min = $N }
END { print min, max }' "$#"
You can change the column number using a command line option or by editing the script (crude, but effective — go with option handling), or any other method that takes your fancy.
If you want the maximum and minimum of column N for each file, then you have to detect new files, and you probably want to identify the files, too:
awk -v N=$N 'FNR == 1 { if (NR != 1) print file, min, max; min = max = $N; file = FILENAME }
{ if ($N > max) max = $N; else if ($N < min) min = $N }
END { print file, min, max }' "$#"
Try this: it will give min and max in file with comma seperated.
simple:
awk 'BEGIN {max = 0} {if ($6>max) max=$6} END {print max}' yourfile.txt
or
awk 'BEGIN {min=1000000; max=0;}; { if($2<min && $2 != "") min = $2; if($2>max && $2 != "") max = $2; } END {print min, max}' file
or more awkish way:
awk 'NR==1 { max=$1 ; min=$1 }
FNR==NR { if ($1>=max) max=$1 ; $1<=min?min=$1:0 ; next}
{ $2=($1-min)/(max-min) ; print }' file file
sort can do the sorting and you can pick up the first and last by any means, for example, with awk
sort -nk2 file{1..4} | awk 'NR==1{print "min:"$2} END{print "max:"$2}'
sorts numerically by the second field of files file1,file2,file3,file4 and print the min and max values.
Since you didn't provide any input files, here is a worked example, for the files
==> file_0 <==
23 29 84
15 58 19
81 17 48
15 36 49
91 26 89
==> file_1 <==
22 63 57
33 10 50
56 85 4
10 63 1
72 10 48
==> file_2 <==
25 67 89
75 72 90
92 37 89
77 32 19
99 16 70
==> file_3 <==
50 93 71
10 20 55
70 7 51
19 27 63
44 3 46
if you run the script, now with a variable column number n
n=1; sort -k${n}n file_{0..3} |
awk -v n=$n 'NR==1{print "min ("n"):",$n} END{print "max ("n"):",$n}'
you'll get
min (1): 10
max (1): 99
and for the other values of n
n=2; sort ...
min (2): 3
max (2): 93
n=3; sort ...
min (3): 1
max (3): 90

Awk flag to remove unwanted data

another awk question.
I have a large text file that is separated by numerical values
43 47
abc
efg
hig
21 122
hijk
lmnop
39 41
somemore
texthere
what i would like to do is print the text only if a condition is satisfied.
here's what i have tried, with no luck
awk '{a=$1; b=$2; if (a < 43 && a > 37 && b < 52 && b > 41) {f=1} elif (a > 43 && a < 37 && b > 52 && b < 41) {print; f=0} } f' file
I'd like to print all of the text if the statement is satisfied and i'd like to skip the text if the statement isn't satisfied.
desired output from above
43 47
abc
efg
hig
39 41
somemore
texthere
awk '
# on a line with 2 numbers:
NF == 2 && $1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {
# set a flag if the numbers fall in the given ranges
f = (37 <= $1 && $1 <= 43 && 41 <= $2 && $2 <= 52)
}
f
' file
Self-explanining solution:
awk '
function inrange(x, a, b) { return a <= x && x <= b }
/^[0-9]+[\t ]+[0-9]/ {
f = inrange($1, 37, 43) && inrange($2, 41, 52)
}
f
'

Shell script to find common values and write in particular pattern with subtraction math to range pattern

Shell script to find common values and write in particular pattern with subtraction math to range pattern
Shell script to get command values in two files and write i a pattern to new file AND also have the first value of the range pattern to be subtracted by 1
$ cat file1
2
3
4
6
7
8
10
12
13
16
20
21
22
23
27
30
$ cat file2
2
3
4
8
10
12
13
16
20
21
22
23
27
Script that works:
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s ":" l; s=$1}'
Script out:
2:4
8
10
12:13
16
20:23
27
Desired output:
1:4
8
10
11:13
16
19:23
27
Similar to sputnick's, except using comm to find the intersection of the file contents.
comm -12 <(sort file1) <(sort file2) |
sort -n |
awk '
function print_range() {
if (start != prev)
printf "%d:", start-1
print prev
}
FNR==1 {start=prev=$1; next}
$1 > prev+1 {print_range(); start=$1}
{prev=$1}
END {print_range()}
'
1:4
8
10
11:13
16
19:23
27
Try doing this :
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 |
sort |
awk 'NR==1 {s=l=$1; next}
$1!=l+1 {if(l == s) print l; else print s -1 ":" l; s=$1}
{l=$1}
END {if(l == s) print l; else print s -1 ":" l; s=$1}'

Using awk create two arrays from two column values, find difference and sum differences, and output data

I have a file with the following fields (and an example value to the right):
hg18.ensGene.bin 0
hg18.ensGene.name ENST00000371026
hg18.ensGene.chrom chr1
hg18.ensGene.strand -
hg18.ensGene.txStart 67051161
hg18.ensGene.txEnd 67163158
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
hg18.ensGene.name2 ENSG00000152763
hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0,
This is a shortened version of the file:
0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1,
0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1
0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a
I need to sum the difference of the exon starts and ends for example:
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
difference:
1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226
sum (hg18.ensGene.exonLenSum):
3842
And I would like the output to have the following fields:
hg18.ensGene.name
hg18.ensGene.name2
hg18.ensGene.exonLenSum
such as this:
ENST00000371026 ENST00000371023 3842
I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat
ENST00000371026 ENSG00000152763 3842
ENST00000371023 ENSG00000152763 1645
ENST00000395250 ENSG00000152763 1622
so ross$ cat gene.awk
/./ {
name = $2
name2 = $9
s = $7
e = $8
sc = split(s, sa, ",")
ec = split(e, ea, ",")
if (sc != ec) {
print "starts != ends ", name, name2, sc, ec
}
diffsum = 0
for(i = 1; i <= sc; ++i) {
diffsum += ea[i] - sa[i]
}
print name, name2, diffsum
}
using the UCSC mysql anonymous server:
mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\
awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }'
result:
ENST00000404059 ENSG00000219789 632
ENST00000326632 ENSG00000146556 1583
ENST00000408384 ENSG00000221311 138
ENST00000409575 ENSG00000222003 1187
ENST00000409981 ENSG00000222027 1187
ENST00000359752 ENSG00000197490 126
ENST00000379479 ENSG00000205292 873
ENST00000326183 ENSG00000177693 918
ENST00000407826 ENSG00000219467 2820
ENST00000405199 ENSG00000220902 1231
(...)

Resources