Extracting multiple parts of a string using bash - bash

I have a caret delimited (key=value) input and would like to extract multiple tokens of interest from it.
For example: Given the following input
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"
1=A00^35=D^22=101^150=1^33=1
1=B000^35=D^22=101^150=2^33=2
I would like the following output
35=D^150=1^
35=D^150=2^
I have tried the following
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"|egrep -o "35=[^/^]*\^|150=[^/^]*\^"
35=D^
150=1^
35=D^
150=2^
My problem is that egrep returns each match on a separate line. Is it possible to get one line of output for one line of input? Please note that due to the constraints of the larger script, I cannot simply do a blind replace of all the \n characters in the output.
Thank you for any suggestions.This script is for bash 3.2.25. Any egrep alternatives are welcome. Please note that the tokens of interest (35 and 150) may change and I am already generating the egrep pattern in the script. Hence a one liner (if possible) would be great

You have two options. Option 1 is to change the "white space character" and use set --:
OFS=$IFS
IFS="^ "
set -- 1=A00^35=D^150=1^33=1 # No quotes here!!
IFS="$OFS"
Now you have your values in $1, $2, etc.
Or you can use an array:
tmp=$(echo "1=A00^35=D^150=1^33=1" | sed -e 's:\([0-9]\+\)=: [\1]=:g' -e 's:\^ : :g')
eval value=($tmp)
echo "35=${value[35]}^150=${value[150]}"

To get rid of the newline, you can just echo it again:
$ echo $(echo "1=A00^35=D^150=1^33=1"|egrep -o "35=[^/^]*\^|150=[^/^]*\^")
35=D^ 150=1^
If that's not satisfactory (I think it may give you one line for the whole input file), you can use awk:
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=35,150 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
35=D^150=1^
35=d^
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=1,33 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
1=A00^33=1^
1=a00^33=11^
This one allows you to use a single awk script and all you need to do is to provide a comma-separated list of keys to print out.
And here's the one-liner version :-)
echo '1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLST=1,33 -F^ '{s="";split(LST,k,",");for(i=1;i<=NF;i++){for(j in k){split($i,arr,"=");if(arr[1]==k[j]){printf s""arr[1]"="arr[2];s="^";}}}if(s!=""){print s;}}'

given a file 'in' containing your strings :
$ for i in $(cut -d^ -f2,3 < in);do echo $i^;done
35=D^150=1^
35=D^150=2^

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

To split and arrange number in single inverted

I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
87453454
65778445
.
.
.
.
Till 65000 productscodes
Need to arrange in below pattern:
'87453454','65778445',
With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
:a
\$bb
N
0~999{
:b
s/\n/','/g
s/^/'/
s/$/'/
b
}
ba" file
With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
pq
}
}
print "\n"
' < file
Credit for Mauke perl#libera.chat
65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
'66771756','69562431','22026341','58085790','22563930',
'63801696','24044132','94255986','56451624','46154427'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
'36452530','29776340','31198057','36015730','30143632'
'49664844','83535994','86871984','44613227','12309645'
'58002568','31342035','72695499','54546650','21800933'
'38059391','36935562','98323086','91089765','65672096'
'17634208','14009291','39114390','35338398','43676356'
'14973124','19782405','96782582','27689803','27438921'
'79540212','49141859','25714405','42248622','25589123'
'11466085','87022819','65726165','86718075','56989625'
'12900115','82979216','65469187','63769703','86494457'
'26544666','89342693','64603075','26102683','70528492'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed
If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.

How to define the range 'r(n,)' using a variable in match() function with awk

I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR".
Here is how a simple file looks like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
Now I am trying to search through the second field (line) of each record and only extract records which have up to a certain maximum number of "-" characters (referred to as gaps) at the beginning $start_gaps and end, $end_gaps, of line(field $2).
I have tried a few codes and the following came works well:
read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) {
if ( match($2, "^-{5,}") && match($2, "-{6,}$") ) {
next} else {print x y ">"$0}}' infile > outfile
But I need to keep using variable numbers without explicitly editing the script every time i am conducting the regex pattern matching. So i tried the following but the regex do not accept variables. What is the best work around to this?
read -p "Please enter the muximum number of gaps allowed at start position: " start_gaps &&
read -p "Please enter the maximum number of gaps allowed at the end position: " end_gaps &&
awk -v start_g=$start_gaps -v end_g=$end_gaps 'BEGIN{
RS="\n>"; FS="\n"; ORS="\n"; OFS="\n"; }; (x=start_g+1)(y=end_g+1) {
if ( match($2, "^-{x,}") && match($2, "-{y,}$") ) {
next} else {print x y ">"$0}}' infile > outfile
Expected results:
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-----TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
match sets variable RLENGTH to the matched substring's length, make use of it. Also, you don't need a multi-char RS for this.
awk -v start_g="$start_gaps" -v end_g="$end_gaps" '
/^>/ { hdr=$0; next }
match($0,/^-*/) && RLENGTH<=start_g && match($0,/-*$/) && RLENGTH<=end_g { print hdr; print }
' file

awk: More elegant way to filter a file with another one

I've recently approached the incredibly fast awk since I needed to parse very big files.
I had to parse this kind of input...
ID 001R_FRG3G Reviewed; 256 AA.
AC Q6GZX4;
[...]
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64;
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
ID 002L_FRG3G Reviewed; 320 AA.
AC Q6GZX3;
[...]
SQ SEQUENCE 320 AA; 34642 MW; 9E110808B6E328E0 CRC64;
MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
VMFFVAGAVL VAILISTVRW
//
ID 004R_FRG3G Reviewed; 60 AA.
AC Q6GZX1; dog;
[...]
SQ SEQUENCE 60 AA; 6514 MW; 12F072778EE6DFE4 CRC64;
MNAKYDTDQG VGRMLFLGTI GLAVVVGGLM AYGYYYDGKT PSSGTSFHTA SPSFSSRYRY
...filter it with a file like this...
Q6GZX4
dog
...to get an output like this:
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
To do this, I came up with this code:
BEGIN{
while(getline<"filterFile.txt">0)B[$1];
}
{
if ($1=="ID")
len=$4;
else{
if ($1=="AC"){
acc=0;
line = substr($0,6,length($0)-6);
split(line,A,"; ");
for (i in A){
if (A[i] in B){
acc=A[i];
}
}
if (acc){
printf acc"\t";
}
}
if (acc){
if(substr($0, 1, 5) == " "){
printf $1$2$3$4$5$6;
}
if ($1 == "//"){
print "\t"len
}
}
}
}
However, since I've seen many examples of similar tasks done with awk, I think there probably is a much more elegant and efficient way to do it. But I can't really grasp the super-compact examples usually found around the internet.
Since this is my input, my output and my code I think this is a good occasion to understand more of awk optimization in terms of performance and coding-style, if some awk-guru has some time and patience to spend in this task.
Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
open my $FILTER, '<', 'filterFile.txt' or die $!;
my %wanted; # Hash of the wanted ids.
chomp, $wanted{$_} = 1 for <$FILTER>;
$/ = "//\n"; # Record separator.
while (<>) {
my ($id_string) = /^ AC \s+ (.*) /mx;
my #ids = split /\s*;\s*/, $id_string;
if (my ($id) = grep $wanted{$_}, #ids) {
print "$id\t";
my ($seq) = /^ SQ \s+ .* $ ((?s:.*)) /mx;
$seq =~ s/\s+//g; # Remove whitespace.
$seq =~ s=//$==; # Remove the final //.
print "$seq\t", length $seq, "\n";
}
}
An awk solution with a different field separator (in this way, you avoid to use substr and split):
BEGIN {
while (getline<"filterFile.txt">0) filter[$1] = 1;
FS = "[ \t;]+"; OFS = ""; ORS = "";
}
{
if (flag) {
if (len)
if ($1 == "//") {
print "\t" len "\n";
flag = 0; len = 0;
} else {
$1 = $1;
print;
}
else if ($1 == "SQ") len = $3;
} else if ($1 == "AC") {
for (i = 1; ++i < NF;)
if (filter[$i]) {
flag = 1;
print $i "\t";
break;
}
}
}
END { if (flag) print "\t" len }
Note: this code is not designed to be short but to be fast. That's why I didn't try to remove nested if/else conditions, but I tried to reduce as possible the global number of tests for a whole file.
However, after several changes since my first version and after several benchmarks, I must admit that choroba perl version is a little faster.
For that kind of task, an idea is to pipe your second file through awk or sed in order to create on the fly a new awk script parsing the big file. As an example:
Control file (f1):
test
dog
Data (f2):
tree 5
test 2
nothing
dog 1
An idea to start with:
sed 's/^\(.*\)$/\/\1\/ {print $2}/' f1 | awk -f - f2
(where -f - means: read the awk script from the standard input rather than from a named file).
may not be much shorter than the original but multiple awk scripts will make the code simpler. First awk generates the records of interest, second extracts the information, third formats
$ awk 'NR==FNR{keys[$0];next}
{RS="//";
for(k in keys)
if($0~k)
{print "key",k; print $0}}' keys file
| awk '/key/{key=$2;f=0;;next}
/SQ/{f=1;print "\n\n"key,$3;next}
f{gsub(" ","");printf $0}
END{print}'
| awk -vRS= -vOFS="\t" '{print $1,$3,$2}'
will print
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
Your code looks almost OK as-is. Keep it simple, single-pass like that.
Only a couple suggestions:
1) The business around the split is too messy/brittle. Maybe try it this way:
acc="";
n=split($0,A,"[; ]+");
for (i=2;i<=n;++i){
if (A[i] in B){
acc=A[i];
break;
}
}
2) Don't use input data in the first argument to your printfs. You never know when something that looks like printf formatting might come in and really mess things up:
printf "%s\t",acc";
printf "%s%s%s%s%s%s",$1,$2,$3,$4,$5,$6;
Update with one more possible "elegance":
3) The awk style of pattern{action} is already a form of if/then, so you can avoid a lot of your outer if/then nesting:
$1="ID" {len=$4}
$1="AC" {
acc="";
...
}
acc {
if(substr($0, 1, 5) == " "){
...
}
In Vim it's actually one-liner to find the pattern:
/^AC.\{-}Q6GZX4;\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//
where Q6GZX4; is your pattern to find in order to match the sequence characters.
The above basically will do:
Search for the line with AC at the beginning (^) which is followed by Q6GZX4;.
Follow across multiple lines (\_.\{-}) to the line starting with SQ (\nSQ).
Then follow to the next line ignoring what's in the current (\_.\{-}\n).
Now start selecting the main pattern (\zs) which is basically everything across multiple lines (\_.\{-}) until (\ze) the // pattern if found.
Then execute normal Vim commands (norm) which selects the pattern (gn) and yank it into x register ("xy).
You may now print register (echo #x) or remove whitespace characters from it.
This can be extended into Ex editor script as below (e.g. cmd.ex):
let s="Q6GZX4"
exec '/^AC.\{-}' . s . ';\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//norm gn"xy'
let #x=substitute(#x,'\W','','g')
silent redi>>/dev/stdout
echon s . " " . #x
redi END
q!
Then run from the command-line as:
$ ex inputfile < cmd.ex
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
The above example can be further extended for multiple files or matches.
awk 'FNR == NR { aFilter[ $1 ";"] = $1; next }
/^AC/ {
if (String !~ /^$/) print Taken "\t" String "\t" Len
Taken = ""; String = ""
for ( i = 2; i <= NF && Taken ~ /^$/; i++) {
if( $i in aFilter) Taken = aFilter[ $i]
}
Take = Taken !~ /^$/
next
}
Take && /^SQ/ { Len = $3; next }
Take && /^[[:blank:]]/ {
gsub( /[[:blank:]]*/, "")
String = String $0
}
END { if( String !~ /^$/) print Taken "\t" String "\t" Len }
' filter.txt YourFile
Not really shorter, maybe a bit more generic. The heavy part is to extract the value that serve as filter from the line

Conditional insert of line breaks to number sequence (preferably using bash, awk, or sed)

I'm trying to add line breaks to a text file each time a subsequent number is smaller than the immediately preceding number (e.g. a break between "72.774" and "7.009") in a text file with this structure:
7.007 28.929 50.851 72.774 7.009 28.932 50.854 72.777 7.015 32.939 54.862 76.784
I want the output to be in this format:
7.007 28.929 50.851 72.774
7.009 28.932 50.854 72.777
7.015 32.939 54.862 76.784
Files do not always have the same number of numerical entries (either in total or before the series begins counting up again) nor are the same number of line breaks required in all text files.
I've been trying to use conditionals in awk or sed but haven't had any luck.
Thank you in advance for any suggestions/solutions.
note: edited to reflect 1st comment.
This may be what you want:
$ awk -v RS=' ' '{printf "%s%s", (NR>1?($0<p?ORS:OFS):""), $0; p=$0}' file
7.007 28.929 50.851 72.774
7.009 28.932 50.854 72.777
7.015 32.939 54.862 76.784
Here's one solution using awk:
{
for (i=1; i<=NF; ++i) {
if ($i < last) {
printf "\n"
last=-1
} else if (last > 0) {
printf " "
}
printf "%s", $i
last = $i
}
}
END { printf "\n" }
Example run:
$ awk -f foo.awk bar.txt
7.007 28.929 50.851 72.774 94.696 116.619 138.542 160.464 182.387 204.309 226.232 248.155 270.077 292 313.922 335.845 357.768 379.69 401.613 423.535 445.458 467.381
7.009 28.932 50.854 72.777 94.699 116.622 138.545 160.467 182.39 204.312 226.235 248.158 270.08 292.003 313.925 335.848 357.771 379.693 401.616 423.538 445.461 467.384 489.306
7.015 32.939 54.862 76.784 102.708 124.631 146.553 168.476 190.398 212.321 234.244 260.167 282.09 308.013 333.937 355.86 377.782 403.706
7.005 28.928 50.85 72.773 94.696 116.618 138.541 160.463 186.387 212.311 234.233 256.156 278.079 300.001 321.924 347.847
Another AWK solution:
awk 'BEGIN { RS=" "; ORS=" "; prev=-999 }
{ if ( $1<prev ) { printf "\n%.3f", $1 }
else { printf "%.3f ", $1 } prev=$1
}
END { print }'

Resources