gawk -v ff=${fileB} '
/^1017/ { print $0 >> ff; next; }
!(/^#/||/^1016/||/^1018/||/^1013/||/^1014/||/^1013/||/^1014/) {
f=substr($0,11,2)".csv"; print $0 >>"../../" f;
}
' ${csvfiles}
The big file contains various 20 million lines.and we have to read each line if it starts with 1017 it will be printed in fileB irrespective of line content
if it starts not starting with the skip list above(1016,1013..) it will be written in file, where the filename is taken from the line content. for example the line
1010,abcdefg,123453,343,3434, written in fg.csv. we do substring and take the fg from the second column.
The problem is the performance is like 35k lines per second. is it possible to make it faster?
sample input
Exclusion List 1016 1013 ..
Include line number 1010,1017...
1016,abcdefg,123453,343,3434,
1010,abcdefg,123453,343,3434,
1017,sdfghhj,123453,343,3434,
1034,zxczcvf,123453,343,3434,
1055,zxczcfg,123453,343,3434,
sample output
fileB.csv
1017,sdfghhj,123453,343,3434,
fg.csv
055,zxczcfg,123453,343,3434,
vf.csv
1034,zxczcvf,123453,343,3434,
Try this:
gawk -v ff="$fileB" '
!/^(#|10(1[6834]|24|55))/{ print > (/^1017/ ? ff : "../../" substr($0,20,2) ".csv") }
' "$csvfiles"
This MAY speed things up if all the time is being spent on file opens/closes:
awk '!/^(#|10(1[6834]|24|55))/{print substr($0,20,2), $0}' "$csvfiles" |
sort -t ' ' |
awk -v ff="$fileB" '
{
curr = substr($0,1,2)
str = substr($0,3)
if ( index(str,"1017") == 1 ) {
out = ff
}
else if ( curr != prev ) {
close(out)
out = "../../" curr ".csv"
prev = curr
}
print str > out
}
' "$csvfiles"
I'm really not sure if it'll be any faster but it might be due to the simpler regexp at least it's concise.
Related
I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
87453454
65778445
.
.
.
.
Till 65000 productscodes
Need to arrange in below pattern:
'87453454','65778445',
With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
:a
\$bb
N
0~999{
:b
s/\n/','/g
s/^/'/
s/$/'/
b
}
ba" file
With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
pq
}
}
print "\n"
' < file
Credit for Mauke perl#libera.chat
65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
'66771756','69562431','22026341','58085790','22563930',
'63801696','24044132','94255986','56451624','46154427'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
'36452530','29776340','31198057','36015730','30143632'
'49664844','83535994','86871984','44613227','12309645'
'58002568','31342035','72695499','54546650','21800933'
'38059391','36935562','98323086','91089765','65672096'
'17634208','14009291','39114390','35338398','43676356'
'14973124','19782405','96782582','27689803','27438921'
'79540212','49141859','25714405','42248622','25589123'
'11466085','87022819','65726165','86718075','56989625'
'12900115','82979216','65469187','63769703','86494457'
'26544666','89342693','64603075','26102683','70528492'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed
If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.
I want to plot some data of a spray simulation. There is a variable called the vaporpenetrationlength, which describes the distance from the injector to the position where the mass fraction is 0.1%. The simulation created many folders for each time step. Inside those folders there is one file which contains the mass fraction and the distance.
I want to create a script which goes through all the time step folders and search inside this one file and prints out the distance where the 0.1% were measured and in which time step it was.
I found a script, but I don't understand it because I just started to learn shell scripting.
Could someone please help me step by step in building such a script? I am interested in learning it, and therefore I want to understand ever line of the code.
Thanks in advance :)
This little script outputs TimeTabLengthTabMass based on the value of the "mass fraction":
printf '%s\t%s\t%s\n' 'Time' 'Length' 'Mass'
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
0.001 <= $2 && $2 < 0.00101 { print time,$1,$2 }
' postProcessing/singleGraphVapPen/*/*
remark: In fact, printing the header could be done within the awk program, but doing it with a separate printf command allows you to post-process the output of awk (for ex. if you need to sort the times and/or lengths and/or masses).
notes:
FNR == 1 is true for the first line of each input file. In the corresponding block, I extract the time value from the directory name.
NF != 2 {next} is for filtering out the gnuplot commands that are at the beginning of the input files. In words, this statement means "if the number of (tab-delimited) fields in the line isn't 2, then skip"
0.001 <= $2 && $2 < 0.00101 selects the lines based on the value of their second field, which is referred to as yheptane in your script. IDK the margin of error of your "0.1% of mass fraction" so I chose convenient conditions for the sample output below.
With the sample data, the output will be:
Time Length Mass
0.0001500 0.0895768 0.00100839
0.0002000 0.102057 0.00100301
0.0002000 0.0877939 0.00100832
0.0003500 0.0827694 0.00100114
0.0009000 0.0657509 0.00100015
0.0015000 0.0501911 0.00100016
0.0016500 0.0469495 0.00100594
0.0018000 0.0436538 0.00100853
0.0021500 0.0369005 0.00100809
0.0023000 0.100328 0.00100751
As an aside, here's a script for replacing your original code:
#!/bin/bash
set -- postProcessing/singleGraphVapPen/*/*
if ! [ -f VapPen.txt ]
then
{
printf '%s\t%s\n' 'Time [s]' 'VapPen [m]'
awk '
BEGIN {FS = OFS = "\t"}
FNR == 1 {
if (NR > 1)
print time,vappen
vappen = 0
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
$2 >= 0.001 { vappen = $1 }
END { if (NR) print time,vappen }
' "$#" |
sort -n -k1,1
} > VapPen.txt
fi
gnuplot -e '
set title "Verdunstungspenetration";
set xlabel "Zeit [s]";
set ylabel "Verdunstungspenetrationslänge [m]";
set grid;
plot "VapPen.txt" using 1:2 with linespoints title "Vapor penetraion 0,1% mass";
pause -1 "Hit return to continue";
'
With the provided data, it reduces the execution time from several minutes to 0.15s on my computer.
I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR".
Here is how a simple file looks like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
Now I am trying to search through the second field (line) of each record and only extract records which have up to a certain maximum number of "-" characters (referred to as gaps) at the beginning $start_gaps and end, $end_gaps, of line(field $2).
I have tried a few codes and the following came works well:
read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) {
if ( match($2, "^-{5,}") && match($2, "-{6,}$") ) {
next} else {print x y ">"$0}}' infile > outfile
But I need to keep using variable numbers without explicitly editing the script every time i am conducting the regex pattern matching. So i tried the following but the regex do not accept variables. What is the best work around to this?
read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) {
if ( match($2, "^-{x,}") && match($2, "-{y,}$") ) {
next} else {print x y ">"$0}}' infile > outfile
Expected results:
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
match sets variable RLENGTH to the matched substring's length, make use of it. Also, you don't need a multi-char RS for this.
awk -v start_g="$start_gaps" -v end_g="$end_gaps" '
/^>/ { hdr=$0; next }
match($0,/^-*/) && RLENGTH<=start_g && match($0,/-*$/) && RLENGTH<=end_g { print hdr; print }
' file
I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.
Sample.txt ( File will be sorted based on the third field on which pattern to be searched )
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>
"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
Used
awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile
and grep commands but it was very time consuming since file is 300+ MB of size.
Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.
It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.
ndx=0; fromRow=1
for val in '00003' '00112' '|'; do # 2 sample values to match, plus dummy value
chunkFile="smallfile$(( ++ndx ))"
fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
NR < fromRow { next }
{ if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
' big_file)
done
Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.
Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:
awk -F'|' -v vals='00003|00112' '
BEGIN { split(vals, val); outFile="smallfile" ++ndx }
{
if ($3 != val[ndx]) {
if (p) { p=0; close(outFile); outFile="smallfile" ++ndx }
} else {
p=1
}
print > outFile
}
' big_file
You can try with Perl:
perl -ne '/00003/ && print' big_file > small_file
and compare its timing with other solutions...
EDIT
Limiting my answer to the tools you didn't try already... you can also use:
sed -n '/00003/p' big_file > small_file
But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.
I've recently approached the incredibly fast awk since I needed to parse very big files.
I had to parse this kind of input...
ID 001R_FRG3G Reviewed; 256 AA.
AC Q6GZX4;
[...]
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64;
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
ID 002L_FRG3G Reviewed; 320 AA.
AC Q6GZX3;
[...]
SQ SEQUENCE 320 AA; 34642 MW; 9E110808B6E328E0 CRC64;
MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
VMFFVAGAVL VAILISTVRW
//
ID 004R_FRG3G Reviewed; 60 AA.
AC Q6GZX1; dog;
[...]
SQ SEQUENCE 60 AA; 6514 MW; 12F072778EE6DFE4 CRC64;
MNAKYDTDQG VGRMLFLGTI GLAVVVGGLM AYGYYYDGKT PSSGTSFHTA SPSFSSRYRY
...filter it with a file like this...
Q6GZX4
dog
...to get an output like this:
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
To do this, I came up with this code:
BEGIN{
while(getline<"filterFile.txt">0)B[$1];
}
{
if ($1=="ID")
len=$4;
else{
if ($1=="AC"){
acc=0;
line = substr($0,6,length($0)-6);
split(line,A,"; ");
for (i in A){
if (A[i] in B){
acc=A[i];
}
}
if (acc){
printf acc"\t";
}
}
if (acc){
if(substr($0, 1, 5) == " "){
printf $1$2$3$4$5$6;
}
if ($1 == "//"){
print "\t"len
}
}
}
}
However, since I've seen many examples of similar tasks done with awk, I think there probably is a much more elegant and efficient way to do it. But I can't really grasp the super-compact examples usually found around the internet.
Since this is my input, my output and my code I think this is a good occasion to understand more of awk optimization in terms of performance and coding-style, if some awk-guru has some time and patience to spend in this task.
Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
open my $FILTER, '<', 'filterFile.txt' or die $!;
my %wanted; # Hash of the wanted ids.
chomp, $wanted{$_} = 1 for <$FILTER>;
$/ = "//\n"; # Record separator.
while (<>) {
my ($id_string) = /^ AC \s+ (.*) /mx;
my #ids = split /\s*;\s*/, $id_string;
if (my ($id) = grep $wanted{$_}, #ids) {
print "$id\t";
my ($seq) = /^ SQ \s+ .* $ ((?s:.*)) /mx;
$seq =~ s/\s+//g; # Remove whitespace.
$seq =~ s=//$==; # Remove the final //.
print "$seq\t", length $seq, "\n";
}
}
An awk solution with a different field separator (in this way, you avoid to use substr and split):
BEGIN {
while (getline<"filterFile.txt">0) filter[$1] = 1;
FS = "[ \t;]+"; OFS = ""; ORS = "";
}
{
if (flag) {
if (len)
if ($1 == "//") {
print "\t" len "\n";
flag = 0; len = 0;
} else {
$1 = $1;
print;
}
else if ($1 == "SQ") len = $3;
} else if ($1 == "AC") {
for (i = 1; ++i < NF;)
if (filter[$i]) {
flag = 1;
print $i "\t";
break;
}
}
}
END { if (flag) print "\t" len }
Note: this code is not designed to be short but to be fast. That's why I didn't try to remove nested if/else conditions, but I tried to reduce as possible the global number of tests for a whole file.
However, after several changes since my first version and after several benchmarks, I must admit that choroba perl version is a little faster.
For that kind of task, an idea is to pipe your second file through awk or sed in order to create on the fly a new awk script parsing the big file. As an example:
Control file (f1):
test
dog
Data (f2):
tree 5
test 2
nothing
dog 1
An idea to start with:
sed 's/^\(.*\)$/\/\1\/ {print $2}/' f1 | awk -f - f2
(where -f - means: read the awk script from the standard input rather than from a named file).
may not be much shorter than the original but multiple awk scripts will make the code simpler. First awk generates the records of interest, second extracts the information, third formats
$ awk 'NR==FNR{keys[$0];next}
{RS="//";
for(k in keys)
if($0~k)
{print "key",k; print $0}}' keys file
| awk '/key/{key=$2;f=0;;next}
/SQ/{f=1;print "\n\n"key,$3;next}
f{gsub(" ","");printf $0}
END{print}'
| awk -vRS= -vOFS="\t" '{print $1,$3,$2}'
will print
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
Your code looks almost OK as-is. Keep it simple, single-pass like that.
Only a couple suggestions:
1) The business around the split is too messy/brittle. Maybe try it this way:
acc="";
n=split($0,A,"[; ]+");
for (i=2;i<=n;++i){
if (A[i] in B){
acc=A[i];
break;
}
}
2) Don't use input data in the first argument to your printfs. You never know when something that looks like printf formatting might come in and really mess things up:
printf "%s\t",acc";
printf "%s%s%s%s%s%s",$1,$2,$3,$4,$5,$6;
Update with one more possible "elegance":
3) The awk style of pattern{action} is already a form of if/then, so you can avoid a lot of your outer if/then nesting:
$1="ID" {len=$4}
$1="AC" {
acc="";
...
}
acc {
if(substr($0, 1, 5) == " "){
...
}
In Vim it's actually one-liner to find the pattern:
/^AC.\{-}Q6GZX4;\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//
where Q6GZX4; is your pattern to find in order to match the sequence characters.
The above basically will do:
Search for the line with AC at the beginning (^) which is followed by Q6GZX4;.
Follow across multiple lines (\_.\{-}) to the line starting with SQ (\nSQ).
Then follow to the next line ignoring what's in the current (\_.\{-}\n).
Now start selecting the main pattern (\zs) which is basically everything across multiple lines (\_.\{-}) until (\ze) the // pattern if found.
Then execute normal Vim commands (norm) which selects the pattern (gn) and yank it into x register ("xy).
You may now print register (echo #x) or remove whitespace characters from it.
This can be extended into Ex editor script as below (e.g. cmd.ex):
let s="Q6GZX4"
exec '/^AC.\{-}' . s . ';\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//norm gn"xy'
let #x=substitute(#x,'\W','','g')
silent redi>>/dev/stdout
echon s . " " . #x
redi END
q!
Then run from the command-line as:
$ ex inputfile < cmd.ex
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
The above example can be further extended for multiple files or matches.
awk 'FNR == NR { aFilter[ $1 ";"] = $1; next }
/^AC/ {
if (String !~ /^$/) print Taken "\t" String "\t" Len
Taken = ""; String = ""
for ( i = 2; i <= NF && Taken ~ /^$/; i++) {
if( $i in aFilter) Taken = aFilter[ $i]
}
Take = Taken !~ /^$/
next
}
Take && /^SQ/ { Len = $3; next }
Take && /^[[:blank:]]/ {
gsub( /[[:blank:]]*/, "")
String = String $0
}
END { if( String !~ /^$/) print Taken "\t" String "\t" Len }
' filter.txt YourFile
Not really shorter, maybe a bit more generic. The heavy part is to extract the value that serve as filter from the line