how is it Regular expression to get 6 lines of substring - expression

1~6 lines:
/Ab7c142...m1l???t1???ygeeae===13245gea123
/Ab7c142???t1???ygeeae===13245gea123
/Ab7c...m1l???t1???ygeeae===13245gea123
/Ab7c???t1???ygeeae===13245gea123
/???ygeeae===13245gea123
/Ab7c
1./Ab7c142...m1l???t1???ygeeae===13245gea123:
/
Ab7c
142
.
..m1l
?
??ygeeae
=
==13245gea123
6./Ab7c:
/
Ab7c
Keep array position.

Related

Ruby Regular Expression for Parcelify (Shopify)

I'm trying to write a Regex in Ruby for a shipping query.
If postcodes match MK1 - MK10, MK19, MK43, MK46 or MK77, then allow it.
If postcodes match NN1 - NN7, NN12, NN13, NN29 or NN77, then allow it.
If postcodes match MK11 - MK18 then don't allow it.
My trouble is that in the UK our postcodes are a bit funny where you can put MK1 1TS and MK11TS and they're considered the same. By not allowing MK11, MK11TY could be misread as MK11.
I've written a regex below, and so far it will disallow MK111TS and MK11\s1TS, and allow MK1\s1TS but not MK11TS. Any help would be greatly appreciated, I've only tested this for MK11 so far.
^((?!MK11\d).)*$&^((?!MK11\s\d).)*$|(MK(1 |2 |3 |4 |5 |6 |7 |8 |9 |10 ))|(MK19)|(MK43)|(MK46)|(MK77)|(NN1)|(NN2)|(NN3)|(NN4)|(NN5)|(NN6)|(NN7)|(NN12)|(NN13)|(NN29)|(NN77)
Thanks in advance.
r = /
(?: # begin non-capture group
MK # match characters
(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77) # match one of the choices
| # or
NN # match characters
(?:1|2|3|4|5|6|7|12|13|29|77) # match one of the choices
) # end non-capture group
(?![^\sA-Z]) # do not match a space or cap letter
/ix # case indifferent and free-spacing
# regex definition mode
This is conventionally written
r = /(?:MK(?:1|2|...|10|19|...|77)|NN(?:1|2|...|7|12|13|29|77))(?![^\sA-Z])/i
"MK4 abc def MK11MK19ghi NN6 jkl NN13 NN29NN77".scan(r)
# => ["MK4", "NN6", "NN13", "NN29", "NN77"]
"MK11" is not matched because "11" is not in the list. "MK19" is not matched because it is followed by a character that is neither a space nor a capital letter.
Alternatively, one could write
s = (['MK'].product(%w{1 2 3 4 5 6 7 8 9 10 19 43 46 77}).map(&:join) +
['NN'].product(%w{1 2 3 4 5 6 7 12 13 29 77}).map(&:join)).join('|')
# => "MK1|MK2|...|MK10|MK19|MK43|MK46|MK77|NN1|NN2|...|NN7|NN12|NN13|NN29|NN77"
r = /(?:#{s})(?![^\sA-Z])/i
#=> /(?:MK1|MK2|...|MK10|MK19|...|MK77|NN1|NN2|...|NN7|NN12|NN13|NN29|NN77)(?![^\sA-Z])/
If the remainder of the postal code is to be included in the regex, perhaps something like the following could be done.
suffixes = %w|ES AB CD EF|.join('|')
#=> "ES|AB|CD|EF"
Then replace (?![^\sA-Z])/x with the following.
\s? # optionally match a space
(?:#{suffixes}) # match a valid suffix in a non-capture group
(?!\S) # do not match a non-whitespace char (negative lookahead)
/ix # case-indifferent and free-spacing regex definition mode
Note the negative lookahead is satisfied if the suffix is at the end of the string.
Now I have written the following to match the postcodes format exactly:
#format: Area Code, Localities accepted, whitespace (MKor not), any digit, any single character, any single character
((MK|mk|Mk|mK)(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77)\s\d[A-Za-z][A-Za-z]) #with whitespace
|
((MK|mk|Mk|mK)(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77)\d[A-Za-z][A-Za-z]) #without whitespace
|
((NN|nn|Nn|nN)(?:1|2|3|4|5|6|7|12|13|29|77)\s\d[A-Za-z][A-Za-z]) #with whitespace
|
((NN|nn|Nn|nN)(?:1|2|3|4|5|6|7|12|13|29|77)\d[A-Za-z][A-Za-z]) #without whitespace
This works for my purposes, I got here using Cary's answer, which has been extremely helpful. Thank you and have marked up.

Best way to capture multiple matches

Having in same text message fixed part once (id of item) and multiple lines (several references and dimensions of each part):
..some random text here..
ID/11000082734
REF/D14-109-0
REF/D14-209-0
REF/D14-219-0
CMT/59-40-25
CMT/38-25-28
CMT/59-40-25
CMT/37-37-20
CMT/40-40-20
CMT/37-37-20
CMT/49-41-31
CMT/44-34-53
I want to parse and store IdCode, References, Array with dimensions.
When applying REGEX.match(my_text) method getting only first occurencies of REF and CMT:
REGEX = %r{
ID\/(?<IdCode> \d{10})\s
(REF\/(?<ReferenceCode> \w{3}\-\d{3}\-\d)\s)+
(CMT\/(?<Length> \d+)\-(?<Width> \d+)\-(?<Height> \d+)\s)+
}x
The result looks like this:
IdCode: "1100008273"
ReferenceCode: "D14-219-0"
Length: "37"
Width: "37"
Height: "20"
Is there a way to capture multiple occurrences without iterating ?
Suppose your string were:
str = %w| dog
ID/11000082734
REF/D14-109-0
REF/D14-209-0
CMT/49-41-31
CMT/44-34-53
cat
ID/11000082735
REF/D14-109-1
REF/D14-209-1
CMT/49-41-32
CMT/44-34-54
pig |.join("\n")
#=> "dog\nID/11000082734\nREF/D14-109-0\nREF/D14-209-0\nCMT/49-41-31\nCMT/44-34-53\ncat\nID/11000082735\nREF/D14-109-1\nREF/D14-209-1\nCMT/49-41-32\nCMT/44-34-54\npig"
Then you could write:
r = /(ID\/\d{11}) # match string in capture group 1
\n # match newline
((?:REF\/[A-Z]\d{2}-\d{3}-\d\n)+) # match consecutive REF lines in capture group 2
((?:CMT\/\d{2}-\d{2}-\d{2}\n)+) # match consecutive CMT lines in capture group 3
/x # free-spacing regex definition mode
arr = str.scan(r)
#=> [["ID/11000082734", "REF/D14-109-0\nREF/D14-209-0\n",
# "CMT/49-41-31\nCMT/44-34-53\n"],
# ["ID/11000082735", "REF/D14-109-1\nREF/D14-209-1\n",
# "CMT/49-41-32\nCMT/44-34-54\n"]]
This extracts the desired information without iterating.
At this point it may be desirable to convert arr to a more convenient data structure. For example:
arr.map do |a,b,c|
{ :id => a[/\d+/],
:ref => b.split("\n").map { |s| s[4..-1] },
:cmt => c.scan(/(\d{2})-(\d{2})-(\d{2})/).map { |e|
[:length, :width, :height].zip(e.map(&:to_i)).to_h }
}
end
#=> [{ :id=>"11000082734",
# :ref=>["D14-109-0", "D14-209-0"],
# :cmt=>[{ :length=>49, :width=>41, :height=>31 },
# { :length=>44, :width=>34, :height=>53 }
# ]
# },
# { :id=>"11000082735",
# :ref=>["D14-109-1", "D14-209-1"],
# :cmt=>[{ :length=>49, :width=>41, :height=>32 },
# { :length=>44, :width=>34, :height=>54 }
# ]
# }
# ]
Try this
(?<IdCode>\d{10,})|REF\/(?<ReferenceCode>\w{3}\-\d{3}\-\d)|CMT\/(?<Length>\d+)\-(?<Width>\d+)\-(?<Height>\d+)
Regex demo
Explanation:
( … ): Capturing group sample
?: Once or none sample
\: Escapes a special character sample
|: Alternation / OR operand sample
+: One or more sample
Input
..some random text here..
ID/11000082734
REF/D14-109-0
REF/D14-209-0
REF/D14-219-0
CMT/59-40-25
CMT/38-25-28
CMT/59-40-25
CMT/37-37-20
CMT/40-40-20
CMT/37-37-20
CMT/49-41-31
CMT/44-34-53
Output:
MATCH 1
IdCode [29-40] `11000082734`
MATCH 2
ReferenceCode [45-54] `D14-109-0`
MATCH 3
ReferenceCode [59-68] `D14-209-0`
MATCH 4
ReferenceCode [73-82] `D14-219-0`
MATCH 5
Length [87-89] `59`
Width [90-92] `40`
Height [93-95] `25`
MATCH 6
Length [100-102] `38`
Width [103-105] `25`
Height [106-108] `28`
MATCH 7
Length [113-115] `59`
Width [116-118] `40`
Height [119-121] `25`
MATCH 8
Length [126-128] `37`
Width [129-131] `37`
Height [132-134] `20`
MATCH 9
Length [139-141] `40`
Width [142-144] `40`
Height [145-147] `20`
MATCH 10
Length [152-154] `37`
Width [155-157] `37`
Height [158-160] `20`
MATCH 11
Length [165-167] `49`
Width [168-170] `41`
Height [171-173] `31`
MATCH 12
Length [178-180] `44`
Width [181-183] `34`
Height [184-186] `53`

count specific lines in specific files in a folder

I'm fairly new to ruby but this is testing me
I want to count all the lines in any file that ends in bowtie.txt in a folder
The lines have to start with a number of varying length followed by a '+' or a '-' (with or without whitespace inbetween. Sometimes the lines are wrapped but I don't know if this matters).
I want to then create a hash that stores the filename with the count associated with it.
I've got as far I think as looping through the directory to select the files out and then counting the number of lines in that file but how do I then create the hash and return it?
The file data looks like:
0 + chr12 129402816 ACACAGGGAGGGGAATAACACACACTGGGACCTGTCAGGAGAGGGTAGGGCTGGGGGCATCAGGAGAGCATCAGGAAAAATAGCTAATGCATGCTGGGCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
2 - chr5 93625939 TCAACCTGTCATCTACATTAGGTATTTCTCCTAATGCTATCCCTCCCCTAGCCCCCCACCACCCAACAGACCCTGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5:T>C
5 + chr3 155023119 ACACAGGGAGGGGAACATCACACACCGGGGCCTGTAGTGGGGGTGAGGGGCAAGAGGAGGAATAGCATTAGGAGAAATACCTAATGTAGATGACCGGTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
7 + chr2 22818055 ACACAGGGAGGGGAAAAACACACACTGGGGCTTCTCAGGGGTGGTGGGGGGAGAGCATCAGGATAAATAGCTAATGCATGCAGGGCTTAATACCTAGGTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
8 + chr3 131206106 ACACAGGGAGGGGAACATCACACACCAGGCCCTGTCAGCGGTGAGGGGCTGGGGGAGGGATAGCATTAAGAGAAATACCTAATATAAATGACGAGTTGAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 8:C>A
10 + chrX 108455592 ACACAGGGAGGGGAACATCACACACCAGGGCCTGTCGGGCAGTGGGGGGGCAAAGGGAGGGATTAAGTCATACACCCAATGCATGTGGGGCTTAAAACCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:A>G
11 - chr2 31936302 ACCCATTAACTCGTCATTTACATTAGGTATATCTCCTAATGCTATCCCTCCCCCCACCCCACAACAGGCCCCCCGGTGTGTGATGTTCCCCTCCCTGTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 7:T>C
This is what I am trying to get at the end
blablabla.bowtie.txt : 27998
blablafsfds.bowtie.txt : 25987
etc
This is my attempt at the code:
Dir[File.join('/Volumes/SeagateBackupPlusDriv/SequencingRawFiles/TumourOesophagealOCCAMS/SequencingScripts/3finalcounts', '*.bowtie.txt')].each |file| do
puts File.open(file) { |f| f.grep(/^[0-9]*.\+|\-/).count }
end
Untested, since I have no input files, but likely working:
# `Dir[]` expects it’s own format
# ⇓ will inject results into hash
Dir['/Volumes/.../*.bowtie.txt'].inject({}) do |memo, file|
memo[file] = File.readlines(file).select do |line|
line =~ /^[0-9]+\s*(\+|\-)/ # only those, matching
end.count
memo
end
Additional references: IO#readlines, Enumerable#select, Enumerable#inject.

ruby multiple loop sets but with limited rows per set

Alrightie, so I'm building an CSV file this time with ruby. The outer loop will run up to length of num_of_loops, but it runs for an entire set rather than up to the specified row. I want to change the first column of a CSV file to a new name for each row.
If I do this:
class_days = %w[Wednesday Thursday Friday]
num_of_loops = (num_of_loops / class_days.size).ceil
num_of_loops.times {
["Wednesday","Thursday","Friday"].each do |x|
data[0] = x
data[4] = classname()
# Write all to file
#
csv << data
end
}
Then the loop will run only 3 times for a 5 row request.
I'd like it to run the full 5 rows such that instead of stopping at Wed/Thurs/Fri it goes to Wed/Thurs/Fri/Wed/Thurs instead.
class_days = %w[Wednesday Thursday Friday]
num_of_loops.times do |i|
data[0] = class_days[i % class_days.size]
data[4] = classname
csv << data
end
The interesting part is here:
class_days[i % class_days.size]
We need an index into class_days that is between 0 and class_days.size - 1. We can get that with the % (modulo) operator. That operator yields the remainder after dividing i by class_days.size. This table shows how it works:
i i % 3
0 0
1 1
2 2
3 0
4 1
5 2
...
The other key part is that the times method yields indices starting with 0.

multiline matching with ruby

I have a string variable with multiple lines: e.g.
"SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n
I would want to get both of lines that start with "Seq_vec SVEC" and extract the values of the integer part that matches...
string = "Clone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
seqvector = Regexp.new("Seq_vec\\s+SVEC\\s+(\\d+\\s+\\d+)",Regexp::MULTILINE )
vector = string.match(seqvector)
if vector
vector_start,vector_stop = vector[1].split(/ /)
puts vector_start.to_i
puts vector_stop.to_i
end
However this only grabs the first match's values and not the second as i would like.
Any ideas what i could be doing wrong?
Thank you
To capture groups use String#scan
vector = string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
match finds just the first match. To find all matches use String#scan e.g.
string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
or to do something with each match:
string.scan(seqvector) do |match|
# match[0] will be the substring captured by your first regexp grouping
puts match.inspect
end
Just to make this a bit easier to handle, I would split the whole string into an array first and then would do:
string = "SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
selected_strings = string.split("\n").select{|x| /Seq_vec SVEC/.match(x)}
selected_strings.collect{|x| x.scan(/\s\d+/)}.flatten # => [" 1", " 65", " 102", " 1710"]

Resources