blast tab limited out extraction - sorting

I have blast result like this..
GCA_001188035.1_1 GCA_001188035.1_1 100.00 159 0 0 1 159 1 159 8e-113 324
GCA_001188035.1_1 GCF_000878595.1_1595 100.00 159 0 0 1 159 853 1011 2e-104 327
GCA_001188035.1_1 GCA_001267965.1_78 100.00 159 0 0 1 159 853 1011 2e-104 327
i want to extract above result based on 3 rd column (<=90) using awk. please help me

awk '$3<90' File
If 3rd column less than 90, print the line.

Related

Matching Column numbers from two different txt file

I have two text files which are a different size. The first one below example1.txt has only one column of numbers:
101
102
103
104
111
120
120
125
131
131
131
131
131
131
And the Second text file example2.txt has two columns:
101 3
102 3
103 3
104 4
104 4
111 5
120 1
120 1
125 2
126 2
127 2
128 2
129 2
130 2
130 2
130 2
131 10
131 10
131 10
131 10
131 10
131 10
132 10
The first column in the example1.txt is a subset of column one in example2.txt. The second column numbers in example2.txt are the associated values with the first column.
What I want to do is to get the associated second column of example1.txt following the example2.txt. I have tried but couldn't figure it out yet. Any suggestions or solutions in bash, awk would be appreciated
Therefore the result would be:
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
UPDATE:
I have been trying to do the column matching like :
awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' example1.txt example2.txt > output.txt
In both files, the first column goes like an ascending order, but the frequency of the same numbers may not be the same. For example, the frequency of 104 is one in the example1.txt, but it appeared twice in the example2.txt The important thing is that the associated second column value would be the same for example1.txt too. Just see the expected output in the end.
$ awk 'NR==FNR{a[$1]++; next} ($1 in a) && b[$1]++ < a[$1]' f1 f2
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
This solution doesn't make use of the fact that the first column is in ascending order. Perhaps some optimization can be done based on that.
($1 in a) && b[$1]++ < a[$1] is the main difference from your solution. This checks if the field exists as well as that the count doesn't exceed that of the first file.
Also, not sure why you set the field separator as | because there is no such character in the sample given.

if parts of two columns match in a file, copy row and move to new file

I have a blastn output file with tens of thousands of rows. I'm only interested in rows where part of the query sequence ID does not match with part of the subject sequence ID, which I'd like to put into a new text file. Here is an excerpt of the massive output file for which I want to extract information from, as an example:
qseqid qlen qstart qend sseqid slen sstart send evalue bitscore length pident nident mismatch gaps
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 121 679 OFAS003927-RA-EXON03_Anisoscelini_Anisoscelis_flavolineatus_CMF_0018_S7_L005_UQ_trinity_assembled 557 1 557 0 832 562 93.594 526 28 8
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 155 650 OFAS003927-RA-EXON03_Placoscelini_Plaxiscelis_limbata_CMF_0072_S29_L005_UQ_trinity_assembled 820 327 819 0 808 496 96.169 477 16 3
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 222 686 OFAS003927-RA-EXON03_Anisoscelini_Leptoscelis_tricolor_CMF_0079_S32_L005_UQ_trinity_assembled 465 1 465 0 793 465 97.419 453 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 429 635 OFAS003927-RA-EXON03B_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ_trinity_assembled 655 1 207 4.30E-87 316 207 94.203 195 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 531 629 OFAS003927-RA-EXON07_Mictini_Anoplocnemis_sp_CMF_0052_S20_L005_UQ_trinity_assembled 668 1 99 9.92E-39 156 99 94.949 94 5 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 0 1286 696 100 696 0 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_declivis_CMF_0069_S26_L005_UQ_trinity_assembled 1060 332 1025 0 1212 696 98.132 683 11 2
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_thomasi_CMF_0028_S13_L005_UQ_trinity_assembled 814 50 745 0 1147 698 96.418 673 21 4
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 695 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_confraterna_CMF_0123_S44_L005_UQ_trinity_assembled 1313 578 1274 0 1131 699 95.994 671 22 6
qseqid = query sequence ID
sseqid = subject sequence ID
What should be matching is the OFAS#-RA-EXON# between the two ID's for each row. When this isn't the case, e.g., the 4th and 5th row, I want to extract the entire row and place into a new text file. I know some regex pattern will need to be employed, but how to indicate columns and search on a per row basis isn't clear to me.
This will work with GNU Awk :
tail -n+2 input.txt | awk '{ if( substr($1,0,21) != substr($5,0,21)) { print $0 } }'
Regards!

increasing range parsing challenge with awk

I wrote this in response to Reddit's daily programmer challenge, and I would like to get some of your feedback on it to improve the code (it seems to work). The challenge is as follows:
We are given a list of numbers in a "short-hand" range notation where only the significant part of the next number is written because we know the numbers are always increasing (ex. "1,3,7,2,4,1" represents [1, 3, 7, 12, 14, 21]). Some people use different separators for their ranges (ex. "1-3,1-2", "1:3,1:2", "1..3,1..2" represent the same numbers [1, 2, 3, 11, 12]) and they sometimes specify a third digit for the range step (ex. "1:5:2" represents [1, 3, 5]).
NOTE: For this challenge range limits are always inclusive.
Our job is to return a list of the complete numbers.
The possible separators are: ["-", ":", ".."]
Sample input:
104..02
545,64:11
Sample output:
104 105 106...200 201 202 # truncated for simplicity
545 564 565 566...609 610 611 # truncated for simplicity
My solution:
BEGIN { FS = "," }
function next_value(current_value, previous_value) {
regexp = current_value "$"
while(current_value <= previous_value || !(current_value ~ regexp)) {
current_value += 10
}
return current_value;
}
{
j = 0
delete number_list
for(i = 1; i <= NF; i++) {
# handle fields with ranges
if($i ~ /-|:|\.\./) {
split($i, range, /-|:|\.\./)
if(range[1] > range[2]) {
if(j != 0) {
range[1] = next_value(range[1], number_list[j-1])
range[2] = next_value(range[2], range[1])
}
else
range[2] = next_value(range[2], range[1]);
}
if(range[3] == "")
number_to_iterate_by = 1;
else
number_to_iterate_by = range[3];
range_iterator = range[1]
while(range_iterator <= range[2]) {
number_list[j] = range_iterator
range_iterator += number_to_iterate_by
j++
}
}
else {
number_list[j] = $i
j++
}
}
# apply increasing range logic and print
for(i = 0; i < j; i++ ) {
if(i == 0) {
if(NR != 1) printf "\n"
current_value = number_list[i]
}
else {
previous_value = current_value
current_value = next_value(number_list[i], previous_value)
}
printf "%s ", current_value
}
}
END { printf "\n" }
This is BASH (Not AWK).
I believe it is a valid answer because the original challenge doesn't specify a language.
#!/bin/bash
mkord(){ local v=$1 dig base
max=$2
(( dig=10**${#v} , base=max/dig*dig , v+=base ))
while (( v < max )); do (( v+=dig )); done
max=$v
}
while read line; do
line="${line//[,\"]/ }" line="${line//[:-]/..}"
IFS=' ' read -a arr <<<"$line"
max=0 a='' res=''
for val in "${arr[#]//../ }"; do
IFS=" " read v1 v2 v3 <<<"$val"
(( a==0 )) && max=$v1
[[ $v1 ]] && mkord "$v1" "$max" && v1=$max
[[ $v2 ]] && mkord "$v2" "$max" && v2=$max
res=$res${a:+,}${v2:+\{}$v1${v2:+\.\.}$v2${v3:+\.\.}$v3${v2:+\}}
a=1
done
(( ${#arr[#]} > 1 )) && res={$res}
eval set -- $res
echo "\"$*\""
done <"infile"
If the source of the tests is:
$ cat infile
"1,3,7,2,4,1"
"1-3,1-2"
"1:5:2"
"104-2"
"104..02"
"545,64:11"
The result will be:
"1 3 7 12 14 21"
"1 2 3 11 12"
"1 3 5"
"104 105 106 107 108 109 110 111 112"
"104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202"
"545 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611"
This gets the list done in 7 milliseconds.
My solution using gawk, RT (It contains the input text that matched the text denoted by RS) and next_n function uses modulo operation for to find the next number based on the last
cat range.awk
BEGIN{
RS="\\.\\.|,|:|-"
start = ""
end = 0
temp = ""
}
function next_n(n, last){
mod = last % (10**length(n))
if(mod < n) return last - mod + n
return last + ((10**length(n))-mod) + n
}
{
if(RT==":" || RT==".." || RT=="-"){
if(start=="") start = next_n($1,end)
else temp = $1
}else{
if(start != ""){
if(temp==""){
end = next_n($1,start)
step = 1
}else {
end = next_n(temp,start)
step = $1
}
for(i=start; i<=end; i+=step) printf "%s ", i
start = ""
temp = ""
}else{
end = next_n($1,end)
printf "%s ", end
}
}
}
END{
print ""
}
TEST 1
echo "104..02" | awk -f range.awk
OUTPUT 1
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
TEST 2
echo "545,64:11" | awk -f range.awk
OUTPUT 2
545 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611
TEST 3
echo "2..5,7,2-1,2:1,0-3,2-7,8..0,4,4,2..1" | awk -f range.awk
OUTPUT 3
2 3 4 5 7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 40 41 42 43 52 53 54 55 56 57 58 59 60 64 74 82 83 84 85 86 87 88 89 90 91
TEST 4 with step
echo "1:5:2,99,88..7..3" | awk -f range.awk"
OUTPUT 4
1 3 5 99 188 191 194 197

how to add text to next line in tab separated file from other file?

I have a set of files contain tab separated values, at the last but third line, I have my desired values. I have extracted that value with
cat result1.tsv | tail -3 | head -1 > final1.tsv
cat resilt2.tsv | tail -3 | head -1 >final2.tsv
..... so on (I have almost 30-40 files)
I want the content of final tsv files in next line in a new single file.
I tried
cat final1.tsv final2.tsv > final.tsv
but this works for the limited amount of files difficult to write the name of all files.
I tried to put the file names in a loop as variables but not worked.
final1.tsv contains:
270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2.tsv contains:
1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
all the files (final1.tsv,final2.tsv,final3.tsv,final5..... contains same number of columns but different values)
I want the rows of each file merged in new file like
final.tsv
final1 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
final3 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final4 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
here you go...
for f in final{1..4}.tsv;
do
echo -en $f'\t' >> final.tsv;
cat $f >> final.tsv;
done
Try this:
rm final.tsv
for FILE in result*.tsv
do
tail -3 $FILE | head -1 >> final.tsv
done
As long as the files aren't enormous, it's simplest to read each file into an array and select the third record from the end
This solves your problem for you. It looks for all files in the current directory that match result*.tsv and writes the required line from each of them to final.tsv
use strict;
use warnings 'all';
my #results = sort {
my ($aa, $bb) = map /(\d+)/, ($a, $b);
$aa <=> $bb;
} glob 'result*.tsv';
open my $out_fh, '>', 'final.tsv';
for my $result_file ( #results ) {
open my $fh, '<', $result_file or die qq({Unable to open "$result_file" for input: $!};
my #data = <$fh>;
next unless #data >= 3;
my ($name) = $result_file =~ /([^.]+)/;
print { $out_fh } "$name\t$data[-3]";
}

awk find the closest match of a list in a matrix

I look for common elements in two files or which row of a matrix has the most elements from a given row. what I understood until now is how to compare fields. I receive the lines which hold the same value in the same fieldnumber.
But how can I open the search to the other field numbers?
awk 'NR==FNR{a[$1];next}$1 in a{print $1" "FNR}' file1 file2
104 3
Expected output:
104 3 111 4 117 2 134 2 148 - 156 4 166 4 176 3 186 - 198 1 221 6 236 -
best match row 4 with 3 elements common.
file 1
104 111 117 134 148 156 166 176 186 198 221 236
file 2
102 108 116 124 132 141 151 162 173 185 198 211
103 109 117 125 134 143 153 163 175 187 200 213
104 110 118 126 135 144 154 165 176 188 201 215
105 111 119 127 136 145 156 166 178 190 203 217
106 112 120 128 137 147 157 168 179 192 205 219
107 113 121 130 139 148 158 169 181 193 207 221
108 114 122 131 140 150 160 171 183 195 208 200
This solution assumes 1) that file1 contains unique values as shown in the provided example and 2) there is only one top ranked line in file2.
awk -v string=$(cat file1 | tr " " ",") \
'{split(string,array,","); cnt=0;
for(i in array) {for(j=1;j<=NF;j++) if(array[i]==$j) cnt++};
if(cnt>cntmax) {cntmax=cnt; NRmax=NR}} END{print NRmax}' file2
4

Resources