Print only '+' or '-' if string matches (with two conditions) - bash
I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!
This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.
Related
shell script for extracting line of file using awk
I want, the selected lines of file to be print in output file side by side separated by space. Here what I have did so far, for file in SAC* do awk 'FNR==2 {print $4}' $file >>exp awk 'FNR==3 {print $4}' $file >>exp awk 'FNR==4 {print $4}' $file >>exp awk 'FNR==5 {print $4}' $file >>exp awk 'FNR==7 {print $4}' $file >>exp awk 'FNR==8 {print $4}' $file >>exp awk 'FNR==24 {print $0}' $file >>exp done My output is: XV AMPY BHZ 2012-08-15T08:00:00 2013-12-31T23:59:59 I want output should be XV AMPY BHZ 2012-08-15T08:00:00 2013-12-31T23:59:59
First the test data (only 9 rows, tho): $ cat file 1 2 3 14 1 2 3 24 1 2 3 34 1 2 3 44 1 2 3 54 1 2 3 64 1 2 3 74 1 2 3 84 1 2 3 94 Then the awk. No need for that for loop in shell, awk can handle multiple files: $ awk ' BEGIN { ORS=" " a[2];a[3];a[4];a[5];a[7];a[8] # list of records for which $4 should be outputed } FNR in a { print $4 } # output the $4s FNR==9 { printf "%s\n",$0 } # replace 9 with 24 ' file file # ... # the files you want to process (SAC*) 24 34 44 54 74 84 1 2 3 94 24 34 44 54 74 84 1 2 3 94
How to use awk to search for min and max values of column in certain files
I know that awk is helpful in trying to find certain things in columns in files, but I'm not sure how to use it to find the min and max values of a column in a group of files. Any advice? To be specific I have four files in a directory that I want to go through awk with.
If you're looking for the absolute maximum and minimum of column N over all the files, then you might use: N=6 awk -v N=$N 'NR == 1 { min = max = $N } { if ($N > max) max = $N; else if ($N < min) min = $N } END { print min, max }' "$#" You can change the column number using a command line option or by editing the script (crude, but effective — go with option handling), or any other method that takes your fancy. If you want the maximum and minimum of column N for each file, then you have to detect new files, and you probably want to identify the files, too: awk -v N=$N 'FNR == 1 { if (NR != 1) print file, min, max; min = max = $N; file = FILENAME } { if ($N > max) max = $N; else if ($N < min) min = $N } END { print file, min, max }' "$#"
Try this: it will give min and max in file with comma seperated. simple: awk 'BEGIN {max = 0} {if ($6>max) max=$6} END {print max}' yourfile.txt or awk 'BEGIN {min=1000000; max=0;}; { if($2<min && $2 != "") min = $2; if($2>max && $2 != "") max = $2; } END {print min, max}' file or more awkish way: awk 'NR==1 { max=$1 ; min=$1 } FNR==NR { if ($1>=max) max=$1 ; $1<=min?min=$1:0 ; next} { $2=($1-min)/(max-min) ; print }' file file
sort can do the sorting and you can pick up the first and last by any means, for example, with awk sort -nk2 file{1..4} | awk 'NR==1{print "min:"$2} END{print "max:"$2}' sorts numerically by the second field of files file1,file2,file3,file4 and print the min and max values. Since you didn't provide any input files, here is a worked example, for the files ==> file_0 <== 23 29 84 15 58 19 81 17 48 15 36 49 91 26 89 ==> file_1 <== 22 63 57 33 10 50 56 85 4 10 63 1 72 10 48 ==> file_2 <== 25 67 89 75 72 90 92 37 89 77 32 19 99 16 70 ==> file_3 <== 50 93 71 10 20 55 70 7 51 19 27 63 44 3 46 if you run the script, now with a variable column number n n=1; sort -k${n}n file_{0..3} | awk -v n=$n 'NR==1{print "min ("n"):",$n} END{print "max ("n"):",$n}' you'll get min (1): 10 max (1): 99 and for the other values of n n=2; sort ... min (2): 3 max (2): 93 n=3; sort ... min (3): 1 max (3): 90
Awk flag to remove unwanted data
another awk question. I have a large text file that is separated by numerical values 43 47 abc efg hig 21 122 hijk lmnop 39 41 somemore texthere what i would like to do is print the text only if a condition is satisfied. here's what i have tried, with no luck awk '{a=$1; b=$2; if (a < 43 && a > 37 && b < 52 && b > 41) {f=1} elif (a > 43 && a < 37 && b > 52 && b < 41) {print; f=0} } f' file I'd like to print all of the text if the statement is satisfied and i'd like to skip the text if the statement isn't satisfied. desired output from above 43 47 abc efg hig 39 41 somemore texthere
awk ' # on a line with 2 numbers: NF == 2 && $1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ { # set a flag if the numbers fall in the given ranges f = (37 <= $1 && $1 <= 43 && 41 <= $2 && $2 <= 52) } f ' file
Self-explanining solution: awk ' function inrange(x, a, b) { return a <= x && x <= b } /^[0-9]+[\t ]+[0-9]/ { f = inrange($1, 37, 43) && inrange($2, 41, 52) } f '
Shell script to find common values and write in particular pattern with subtraction math to range pattern
Shell script to find common values and write in particular pattern with subtraction math to range pattern Shell script to get command values in two files and write i a pattern to new file AND also have the first value of the range pattern to be subtracted by 1 $ cat file1 2 3 4 6 7 8 10 12 13 16 20 21 22 23 27 30 $ cat file2 2 3 4 8 10 12 13 16 20 21 22 23 27 Script that works: awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s ":" l; s=$1}' Script out: 2:4 8 10 12:13 16 20:23 27 Desired output: 1:4 8 10 11:13 16 19:23 27
Similar to sputnick's, except using comm to find the intersection of the file contents. comm -12 <(sort file1) <(sort file2) | sort -n | awk ' function print_range() { if (start != prev) printf "%d:", start-1 print prev } FNR==1 {start=prev=$1; next} $1 > prev+1 {print_range(); start=$1} {prev=$1} END {print_range()} ' 1:4 8 10 11:13 16 19:23 27
Try doing this : awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s -1 ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s -1 ":" l; s=$1}'
Using awk create two arrays from two column values, find difference and sum differences, and output data
I have a file with the following fields (and an example value to the right): hg18.ensGene.bin 0 hg18.ensGene.name ENST00000371026 hg18.ensGene.chrom chr1 hg18.ensGene.strand - hg18.ensGene.txStart 67051161 hg18.ensGene.txEnd 67163158 hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, hg18.ensGene.name2 ENSG00000152763 hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, This is a shortened version of the file: 0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1, 0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1 0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a I need to sum the difference of the exon starts and ends for example: hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, difference: 1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226 sum (hg18.ensGene.exonLenSum): 3842 And I would like the output to have the following fields: hg18.ensGene.name hg18.ensGene.name2 hg18.ensGene.exonLenSum such as this: ENST00000371026 ENST00000371023 3842 I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat ENST00000371026 ENSG00000152763 3842 ENST00000371023 ENSG00000152763 1645 ENST00000395250 ENSG00000152763 1622 so ross$ cat gene.awk /./ { name = $2 name2 = $9 s = $7 e = $8 sc = split(s, sa, ",") ec = split(e, ea, ",") if (sc != ec) { print "starts != ends ", name, name2, sc, ec } diffsum = 0 for(i = 1; i <= sc; ++i) { diffsum += ea[i] - sa[i] } print name, name2, diffsum }
using the UCSC mysql anonymous server: mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\ awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }' result: ENST00000404059 ENSG00000219789 632 ENST00000326632 ENSG00000146556 1583 ENST00000408384 ENSG00000221311 138 ENST00000409575 ENSG00000222003 1187 ENST00000409981 ENSG00000222027 1187 ENST00000359752 ENSG00000197490 126 ENST00000379479 ENSG00000205292 873 ENST00000326183 ENSG00000177693 918 ENST00000407826 ENSG00000219467 2820 ENST00000405199 ENSG00000220902 1231 (...)