R how to accesses to the data associated to a list of S4 class - bioinformatics

I’m traying to create a S4 object, with a list and data.frame, the list contains sequences in Biostrings (AAStringSet or DNAStringSet) format, this is the code:
setClass(Class='myclass',
representation(
mytable = 'data.frame',
sequences = 'list'
))
table, assignment table methods
setGeneric('mytable', function(x) standardGeneric('mytable'))
setMethod('mytable', 'myclass', function(x) { x#mytable })
setGeneric("mytable<-", function(x, value) standardGeneric("mytable<-"))
setMethod("mytable<-", "myclass", function(x, value) {
x#mytable <- value
return(x)
})
seq, assignment-methods
setGeneric('seq', function(x) standardGeneric('seq'))
setMethod('seq', 'myclass', function(x) { x#sequences })
setGeneric("seq<-", function(x, value) standardGeneric("seq<-"))
setMethod("seq<-", "myclass", function(x, value) {
x#sequences <- value
return(x)
})
The data.frame
df[40:45, ]
sseqid gene
40 VFG007051(gi:27364316) epsM
41 VFG007052(gi:37678407) epsM
42 VFG007057(gi:27364317) epsL
43 VFG007063(gi:27364318) epsK
44 VFG007064(gi:37678405) epsK
45 VFG007069(gi:27364319) epsJ
the sequences list
sequences_list
$`sample-1`
AAStringSet object of length 141:
width seq names
[1] 741 MSYDLDEDILQDFLIEAGEILELLSEQLVELENNPEDRDLLNAI...IKPLDNLLQGTPGMAGATITSDGHIALILDVPDLLKQYAAASRL DJCA_01737
... ... ...
[23] 723 MTTPQNRSSVDNSSDEIDLGKLLGILLDAKWIILVTTFLFAVGG...VARSRFEQAGIEVKGVILNAIEKKASSSYGYGYYNYSYGESNKA DJCA_03821
$`sample-2`
AAStringSet object of length 141:
width seq names
[1] 741 MSYDLDEDILQDFLIEAGEILELLSEQLVELENNPEDRDLLNAI...IKPLDNLLQGTPGMAGATITSDGHIALILDVPDLLKQYAAASRL EKPI_01584
... ... ...
[37] 471 MKKITLFTLSLLATAVQVGAQEYVPIVEKPIYITSSKIKCVLHT...ISRYVDGNNTRYLLNIVGGRNVQVTPENEATQARWKPTLQQVKL EKPI_00335
$`sample-3`
AAStringSet object of length 143:
width seq names
[1] 741 MSYDLDEDILQDFLIEAGEILELLSEQLVELENNPEDRDLLNAI...IKPLDNLLQGTPGMAGATITSDGHIALILDVPDLLKQYAAASRL BGDK_02838
... ... ...
[39] 388 MRITVVGAGYVGLANAALLAQYHSITLLDVDSKRVEQINAQSSP...FHSRVIKDLNEFKQTADVIVSNRMVEELTDVADKVYTRDLFGSD BGDK_04023
create the new object-myclass
x <- new("myclass", mytable=df, sequences=sequences_list)
Due to x#sequences is a list, I have to use [[]], my question is how to create an easy accesses to a list data?
to print all the list (in list format)
seq(x)
seq(x) %>% class()
[1] "list"
or
seq(x)["sample-1"] %>% class()
[1] "list"
If I want to extract in AAStringSet format, I have to use [[]]
seq(x)[["sample-1"]] %>% class()
[1] "AAStringSet"
attr(,"package")
[1] "Biostrings"
seq(x)[["sample-1"]]
AAStringSet object of length 141:
width seq names
[1] 741 MSYDLDEDILQDFLIEAGEILELLSEQLVELENNPE...GTPGMAGATITSDGHIALILDVPDLLKQYAAASRL DJCA_01737
... ... ...
[141] 723 MTTPQNRSSVDNSSDEIDLGKLLGILLDAKWIILVT...GIEVKGVILNAIEKKASSSYGYGYYNYSYGESNKA DJCA_03821
it accesses to the sequences associated to sample-1 in Biostrings format (AAStringSet in this example), for me it is ok, but I just want to accesses something easy, maybe just a pair of brackets and extract in AAStringSet format, and have the same result as seq(x)[["sample-1"]], something like:
seq(x)["sample-1"]
so what do you recommend me, just use the double bracket [[]], just like a list, or how to accesses with a single bracket [] and extract in AAStringSet format ???

Related

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Get unicode block element based on matrix

A unique question I guess, given these unciode block elements:
https://en.wikipedia.org/wiki/Block_Elements
I want to get the relevant block element based on the matrix I get, so
11
01 will give ▜
00
10 will give ▖
and so on
I managed to do this in python, but I wonder if anyone got a more elegant solution.
from itertools import product
elements = [0, 1]
a = product(elements, repeat=2)
b = product(a, repeat=2)
matrices = [c for c in b]
"""
Matrices generated possiblities
00 00 00 00 01 01 01 01 10 10 10 10 11 11 11 11
00 01 10 11 00 01 10 11 00 01 11 10 00 01 10 11
"""
blocks = [' ', '▗', '▖', '▄', '▝', '▐', '▞', '▟', '▘', '▚', '▙', '▌', '▀', '▜', '▛', '█']
given = (
(0,1),
(1,0)
)
print(blocks[matrices.index(given)])
output: ▞
These characters, although existing, were not meant to have a direct correlation
of numbers-to-set-1/4 blocks.
So, I have a solution in a published package, and it is not necessarily
more "elegant" than yours, as it is far more verbose.
However, the code around it allows one to "draw" on a text terminal
using these 1/4 blocks as pixels, in a somewhat clean API.
So, this is the class I use to set/reset pixels in a character block. The relevant methods can be used straight from the class, and they take the"pixel coordinates", and the current character block upon which to set or reset the addressed pixel. The code instantiates the class just to be able to use the in operator to check for block-characters.
The project can be installed with "pip install terminedia".
The function and class bellow, extracted from the project, will work in standalone to do the same as you do:
# Snippets from jsbueno/terminedia, v. 0.2.0
def _mirror_dict(dct):
"""Creates a new dictionary exchanging values for keys
Args:
- dct (mapping): Dictionary to be inverted
"""
return {value: key for key, value in dct.items()}
class BlockChars_:
"""Used internaly to emulate pixel setting/resetting/reading inside 1/4 block characters
Contains a listing and other mappings of all block characters used in order, so that
bits in numbers from 0 to 15 will match the "pixels" on the corresponding block character.
Although this class is purposed for internal use in the emulation of
a higher resolution canvas, its functions can be used by any application
that decides to manipulate block chars.
The class itself is stateless, and it is used as a single-instance which
uses the name :any:`BlockChars`. The instance is needed so that one can use the operator
``in`` to check if a character is a block-character.
"""
EMPTY = " "
QUADRANT_UPPER_LEFT = '\u2598'
QUADRANT_UPPER_RIGHT = '\u259D'
UPPER_HALF_BLOCK = '\u2580'
QUADRANT_LOWER_LEFT = '\u2596'
LEFT_HALF_BLOCK = '\u258C'
QUADRANT_UPPER_RIGHT_AND_LOWER_LEFT = '\u259E'
QUADRANT_UPPER_LEFT_AND_UPPER_RIGHT_AND_LOWER_LEFT = '\u259B'
QUADRANT_LOWER_RIGHT = '\u2597'
QUADRANT_UPPER_LEFT_AND_LOWER_RIGHT = '\u259A'
RIGHT_HALF_BLOCK = '\u2590'
QUADRANT_UPPER_LEFT_AND_UPPER_RIGHT_AND_LOWER_RIGHT = '\u259C'
LOWER_HALF_BLOCK = '\u2584'
QUADRANT_UPPER_LEFT_AND_LOWER_LEFT_AND_LOWER_RIGHT = '\u2599'
QUADRANT_UPPER_RIGHT_AND_LOWER_LEFT_AND_LOWER_RIGHT = '\u259F'
FULL_BLOCK = '\u2588'
# This depends on Python 3.6+ ordered behavior for local namespaces and dicts:
block_chars_by_name = {key: value for key, value in locals().items() if key.isupper()}
block_chars_to_name = _mirror_dict(block_chars_by_name)
blocks_in_order = {i: value for i, value in enumerate(block_chars_by_name.values())}
block_to_order = _mirror_dict(blocks_in_order)
def __contains__(self, char):
"""True if a char is a "pixel representing" block char"""
return char in self.block_chars_to_name
#classmethod
def _op(cls, pos, data, operation):
number = cls.block_to_order[data]
index = 2 ** (pos[0] + 2 * pos[1])
return operation(number, index)
#classmethod
def set(cls, pos, data):
""""Sets" a pixel in a block character
Args:
- pos (2-sequence): coordinate of the pixel inside the character
(0,0) is top-left corner, (1,1) bottom-right corner and so on)
- data: initial character to be composed with the bit to be set. Use
space ("\x20") to start with an empty block.
"""
op = lambda n, index: n | index
return cls.blocks_in_order[cls._op(pos, data, op)]
#classmethod
def reset(cls, pos, data):
""""resets" a pixel in a block character
Args:
- pos (2-sequence): coordinate of the pixel inside the character
(0,0) is top-left corner, (1,1) bottom-right corner and so on)
- data: initial character to be composed with the bit to be reset.
"""
op = lambda n, index: n & (0xf - index)
return cls.blocks_in_order[cls._op(pos, data, op)]
#classmethod
def get_at(cls, pos, data):
"""Retrieves whether a pixel in a block character is set
Args:
- pos (2-sequence): The pixel coordinate
- data (character): The character were to look at blocks.
Raises KeyError if an invalid character is passed in "data".
"""
op = lambda n, index: bool(n & index)
return cls._op(pos, data, op)
#: :any:`BlockChars_` single instance: enables ``__contains__``:
BlockChars = BlockChars_()
After pasting only this in the terminal it is possible to do:
In [131]: pixels = BlockChars.set((0,0), " ")
In [132]: print(BlockChars.set((1,1), pixels))
# And this internal "side-product" is closer to what you have posted:
In [133]: BlockChars.blocks_in_order[0b1111]
Out[133]: '█'
In [134]: BlockChars.blocks_in_order[0b1010]
Out[134]: '▐'
The project at https://github.com/jsbueno/terminedia have a complete
drawing API do use these as pixels in an ANSI text terminal -
including bezier curves, filled ellipses, and RGB image display
(check the "examples" folder)

bash merging tables on unique id

I have two similar, 'table format' text files, each several million records long. In the inputfile1, the unique identifier is a merger of values in two other columns (neither of which are unique identifiers on their own). In inputfile2, the unique identifier is two letters followed by a random four-digit number.
How can I replace the unique identifiers in inputfile1 with the corresponding unique identifiers in the inputfile2? All of the records in the first table are present in the second, though not vis versa. Below are toy examples of the files.
Input file 1:
Grp Len ident data
A 20 A_20 3k3bj52
A 102 A_102 3k32rf2
A 352 A_352 3w3bj52
B 60 B_60 3k3qwrg
B 42 B_42 3kerj52
C 89 C_89 3kftj55
C 445 C_445 fy5763b
Input file 2:
Grp Len ident
A 20 fz2525
A 102 fz5367
A 352 fz4678
A 356 fz1543
B 60 fz5732
B 11 fz2121
B 42 fz3563
C 89 fz8744
C 245 fz2653
C 445 fz2985
C 536 fz8983
Desired output:
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
My provisional plan is:
Generate extra identifiers for input2, in the style of input1 (easy)
Filter out lines from input2 that don't occur input1 (hardish)
Then stick on the data from input1 (easy)
I might be able to do this in R but the data is large and complex, and I was wondering if there was a way in bash or perl. Any tips in the right direction would be good.
This should work for you, assuming the Grp and Len values are in the same order in both files, as per my comment
Essentially it reads a line from the first file and then reads from the second file, forming the Grp_Len key from each record until it finds an entry that matches. Then it's just a matter of building the new output record
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
print scalar <$f1>;
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
print "#f2 $f1[3]\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
Update
Here's another version which is identical except that it builds a printf format string from the spacing of the column headers in the first file. That results in a much neater output
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
my $head = <$f1>;
print $head;
my $format = create_format($head);
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
printf $format, #f2, $f1[3];
}
sub create_format {
my ($head) = #_;
my ($format, $pos);
while ( $head =~ /\b\S/g ) {
$format .= sprintf("%%-%ds", $-[0] - $pos) if defined $pos;
$pos = $-[0];
}
$format . "%s\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b

Show duplicates in internal table

Each an every item should have an uniquie SecondNo + Drawing combination. Due to misentries, some combinations are there two times.
I need to create a report with ABAP which identifies those combinations and does not reflect the others.
Item: SecNo: Drawing:
121 904 5000 double
122 904 5000 double
123 816 5100
124 813 5200
125 812 4900 double
126 812 4900 double
127 814 5300
How can I solve this? I tried 2 approaches and failed:
Sorting the data and tried to print out each one when the value of the upper row is equal to the next value
counting the duplicates and showing all of them which are more then one.
Where do I put in the condition? in the loop area?
I tried this:
REPORT duplicates.
DATA: BEGIN OF lt_duplicates OCCURS 0,
f2(10),
f3(10),
END OF lt_duplicates,
it_test TYPE TABLE OF ztest WITH HEADER LINE,
i TYPE i.
SELECT DISTINCT f2 f3 FROM ztest INTO TABLE lt_duplicates.
LOOP AT lt_duplicates.
IF f2 = lt_duplicates-f2 AND f3 = lt_duplicates-f3.
ENDIF.
i = LINES( it_test ).
IF i > 1.
LOOP AT it_test.
WRITE :/ it_test-f1,it_test-f2,it_test-f3.
ENDLOOP.
ENDIF.
ENDLOOP.
From ABAP 7.40, you may use the GROUP BY constructs with the GROUP SIZE words so that to take into account only the groups with at least 2 elements.
ABAP statement LOOP AT ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines:
Either LOOP AT GROUP ...
Or ... FOR ... IN GROUP ...
ABAP expression ... VALUE|REDUCE|NEW type|#( FOR GROUPS ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines: ... FOR ... IN GROUP ...
For both constructs, it's possible to loop at the grouped lines in two ways:
* LOOP AT GROUP ...
* ... FOR ... IN GROUP ...
Line# Item SecNo Drawing
1 121 904 5000 double
2 122 904 5000 double
3 123 816 5100
4 124 813 5200
5 125 812 4900 double
6 126 812 4900 double
7 127 814 5300
You might want to produce the following table containing the duplicates:
SecNo Drawing Lines
904 5000 [1,2]
812 4900 [5,6]
Solution with LOOP AT ... GROUP BY ...:
TYPES: BEGIN OF t_line,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF t_line,
BEGIN OF t_duplicate,
secno TYPE i,
drawing TYPE i,
num_dup TYPE i, " number of duplicates
lines TYPE STANDARD TABLE OF REF TO t_line WITH EMPTY KEY,
END OF t_duplicate,
t_lines TYPE STANDARD TABLE OF t_line WITH EMPTY KEY,
t_duplicates TYPE STANDARD TABLE OF t_duplicate WITH EMPTY KEY.
DATA(table) = VALUE t_lines(
( item = 121 secno = 904 drawing = 5000 )
( item = 122 secno = 904 drawing = 5000 )
( item = 123 secno = 816 drawing = 5100 )
( item = 124 secno = 813 drawing = 5200 )
( item = 125 secno = 812 drawing = 4900 )
( item = 126 secno = 812 drawing = 4900 )
( item = 127 secno = 814 drawing = 5300 ) ).
DATA(expected_duplicates) = VALUE t_duplicates(
( secno = 904 drawing = 5000 num_dup = 2 lines = VALUE #( ( REF #( table[ 1 ] ) ) ( REF #( table[ 2 ] ) ) ) )
( secno = 812 drawing = 4900 num_dup = 2 lines = VALUE #( ( REF #( table[ 5 ] ) ) ( REF #( table[ 6 ] ) ) ) ) ).
DATA(actual_duplicates) = VALUE t_duplicates( ).
LOOP AT table
ASSIGNING FIELD-SYMBOL(<line>)
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
ASSIGNING FIELD-SYMBOL(<group_table>).
IF <group_table>-gs >= 2.
actual_duplicates = VALUE #( BASE actual_duplicates
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
ENDIF.
ENDLOOP.
WRITE : / 'List of duplicates:'.
SKIP 1.
WRITE : / 'Secno Drawing List of concerned items'.
WRITE : / '---------- ---------- ---------------------------------- ...'.
LOOP AT actual_duplicates ASSIGNING FIELD-SYMBOL(<duplicate>).
WRITE : / <duplicate>-secno, <duplicate>-drawing NO-GROUPING.
LOOP AT <duplicate>-lines INTO DATA(line).
WRITE line->*-item.
ENDLOOP.
ENDLOOP.
ASSERT actual_duplicates = expected_duplicates. " short dump if not equal
Output:
List of duplicates:
Secno Drawing List of concerned items
---------- ---------- ---------------------------------- ...
904 5000 121 122
812 4900 125 126
Solution with ... VALUE type|#( FOR GROUPS ... GROUP BY ...:
DATA(actual_duplicates) = VALUE t_duplicates(
FOR GROUPS <group_table> OF <line> IN table
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
DELETE actual_duplicates WHERE num_dup = 1.
Note: for deleting non-duplicates, instead of using an additional DELETE statement, it can be done inside the VALUE construct by adding a LINES OF COND construct which adds 1 line if group size >= 2, or none otherwise (if group size = 1):
...
gs = GROUP SIZE )
( LINES OF COND #( WHEN <group_table>-gs >= 2 THEN VALUE #( "<== new line
( secno = <group_table>-secno
...
... REF #( <line2> ) ) ) ) ) ) ) ). "<== 3 extra right parentheses
You can use AT...ENDAT for this, provided that you arrange the fields correctly:
TYPES: BEGIN OF t_my_line,
secno TYPE foo,
drawing TYPE bar,
item TYPE baz, " this field has to appear AFTER the other ones in the table
END OF t_my_line.
DATA: lt_my_table TYPE TABLE OF t_my_line,
lt_duplicates TYPE TABLE OF t_my_line.
FIELD-SYMBOLS: <ls_line> TYPE t_my_line.
START-OF-WHATEVER.
* ... fill the table ...
SORT lt_my_table BY secno drawing.
LOOP AT lt_my_table ASSIGNING <ls_line>.
AT NEW drawing. " whenever drawing or any field left of it changes...
FREE lt_duplicates.
ENDAT.
APPEND <ls_line> TO lt_duplicates.
AT END OF drawing.
IF lines( lt_duplicates ) > 1.
* congrats, here are your duplicates...
ENDIF.
ENDAT.
ENDLOOP.
I needed simply to report duplicate lines in error based on two fields so used the following.
LOOP AT gt_data INTO DATA(gs_data)
GROUP BY ( columnA = gs_data-columnA columnB = gs_data-columnB
size = GROUP SIZE index = GROUP INDEX ) ASCENDING
REFERENCE INTO DATA(group_ref).
IF group_ref->size > 1.
PERFORM insert_error USING group_ref->columnA group_ref->columnB.
ENDIF.
ENDLOOP.
Here is my 2p worth, you could cut some out of this depending on what you want to do, and you should consider the amount of data being processed too. This method is only really for smaller sets.
Personally I like to prevent erroneous records at the source. Catching an error during input. But if you do end up in a pickle there is definitely more than one way to solve the issue.
TYPES: BEGIN OF ty_itab,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF ty_itab.
TYPES: itab_tt TYPE STANDARD TABLE OF ty_itab.
DATA: lt_itab TYPE itab_tt,
lt_itab2 TYPE itab_tt,
lt_itab3 TYPE itab_tt.
lt_itab = VALUE #(
( item = '121' secno = '904' drawing = '5000' )
( item = '122' secno = '904' drawing = '5000' )
( item = '123' secno = '816' drawing = '5100' )
( item = '124' secno = '813' drawing = '5200' )
( item = '125' secno = '812' drawing = '4900' )
( item = '126' secno = '812' drawing = '4900' )
( item = '127' secno = '814' drawing = '5300' )
).
APPEND LINES OF lt_itab TO lt_itab2.
APPEND LINES OF lt_itab TO lt_itab3.
SORT lt_itab2 BY secno drawing.
DELETE ADJACENT DUPLICATES FROM lt_itab2 COMPARING secno drawing.
* Loop at what is hopefully the smaller itab.
LOOP AT lt_itab2 ASSIGNING FIELD-SYMBOL(<line>).
DELETE TABLE lt_itab3 FROM <line>.
ENDLOOP.
* itab1 has all originals.
* itab2 has the unique.
* itab3 has the duplicates.

multiline matching with ruby

I have a string variable with multiple lines: e.g.
"SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n
I would want to get both of lines that start with "Seq_vec SVEC" and extract the values of the integer part that matches...
string = "Clone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
seqvector = Regexp.new("Seq_vec\\s+SVEC\\s+(\\d+\\s+\\d+)",Regexp::MULTILINE )
vector = string.match(seqvector)
if vector
vector_start,vector_stop = vector[1].split(/ /)
puts vector_start.to_i
puts vector_stop.to_i
end
However this only grabs the first match's values and not the second as i would like.
Any ideas what i could be doing wrong?
Thank you
To capture groups use String#scan
vector = string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
match finds just the first match. To find all matches use String#scan e.g.
string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
or to do something with each match:
string.scan(seqvector) do |match|
# match[0] will be the substring captured by your first regexp grouping
puts match.inspect
end
Just to make this a bit easier to handle, I would split the whole string into an array first and then would do:
string = "SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
selected_strings = string.split("\n").select{|x| /Seq_vec SVEC/.match(x)}
selected_strings.collect{|x| x.scan(/\s\d+/)}.flatten # => [" 1", " 65", " 102", " 1710"]

Resources