Regex for First Line (Only) that Contains a String - ruby

I have a bunch of phone numbers with one per line:
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
I want to grab the first one that contains the letter "c" upper or lower case.
So far, I have this /^.*[C].*$/i and that matches C (202) 456-1111, [Cell] (505) 555-1234 and c 12346567s. How do I return only the first? In other words, the match should only be C (202) 456-1111.
I have been blindly putting question marks everywhere without success.
I am using Ruby if it makes a difference http://www.rubular.com/r/h6ReB9IN8t
Edit: Here is another question that Hrishi pointed to but I cannot figure out how to adapt it to match the whole line.

Try match method. Here is an example:
list = <<EOF
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOF
Update
#match line with "c" letter in line, even that are part of word
puts list.match(/^.*C.*$/i)
#match line with "c" letter in line, that are not a part of word
puts list.match(/^\W*C\W.*$/i)

I'd go about this a bit differently. I prefer to reduce regular expressions to very simple patterns:
str = <<EOT
[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s
EOT
Finding the right line to work with is easily done using either select or find:
str.split("\n").select{ |s| s[/c/i] }.first # => "C (202) 456-1111"
str.split("\n").find{ |s| s[/c/i] } # => "C (202) 456-1111"
I'd recommend find because it only returns the first occurrence.
Once the desired string is found, use scan to grab the numbers:
str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/) # => ["202", "456", "1111"]
Then join them. When you have phone numbers stored in a database you don't really want them to be formatted, you just want the numbers. Formatting occurs later when you're outputting them again.
phone_number = str.split("\n").find{ |s| s[/c/i] }.scan(/\d+/).join # => "2024561111"
When you need to output the number, break it into the right grouping based on the regional phone-number representation. You should have some idea where the person is located, because you've usually also got their country code. Based on that you know how many digits you should have, plus the groups:
area_code, prefix, number = phone_number[0 .. 2], phone_number[3 .. 5], phone_number[6 .. 9] # => ["202", "456", "1111"]
Then output them so they're displayed correctly:
"(%s) %s-%s" % [area_code, prefix, number] # => "(202) 456-1111"
As far as your original pattern /^.*[C].*$/i, there are some things wrong with your understanding of regex:
^.* says "start at the beginning of the string and find zero or more characters", which is no more effective than saying /[C].
Using [C] creates an unnecessary character-set which means "find one of the letters in the set "C"; It does nothing useful, so just use C as /C.
.*$ artificially finds the end of the string also, but since you're not capturing it there's nothing accomplished, so don't bother with it. The regex is now /C/.
Since you want to match upper and lower-case, use /C/i or /c/i. (Or you could use /[cC]/ but why?)
Instead:
To find a "c" or "C" anywhere in the string, just use /c/i. That's all that's needed. http://rubular.com/r/uPyxACOWls
To find "c", "C" or "cell" or "Cell", you can use /c(?:ell)?/. http://rubular.com/r/TkSRPWG2y6
To find "c", "C", "cell" or "Cell" as a separate word, use word-break markers like /\bc(?:ell)?\b/. http://rubular.com/r/Smo0bFs9w8
You can get a whole lot more complicated, but if you're not accomplishing anything with the additional pattern information, you're just wasting the regex-engine's CPU-time, and slowing your code. A confused regex-engine can waste a LOT of CPU-time, so be efficient and aware of what you're asking it to do.

EDIT Added two more ways of handling this. The last one is preferable.
This will do what you want. It will search for matches of your regex, and then get the first one. Please note that this will produce an error if string does not have any matches.
string = "[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s"
puts string.match(/^(.*[C].*)$/i).captures.first
puts string.match(/^(.*[C].*)$/i)
puts string[/^(.*[C].*)$/i]
Ruby Docs String#match.

Split the string by the new line characters, and select the substring which matches your requirements and grab the first one:
str = '[Home] (202) 121-7777
C (202) 456-1111
[mobile] 55 55 5 55555
[Work] (404) 555-1234
[Cell] (505) 555-1234
W 303-555-5555
M 777-555-5555
c 12346567s'
p str.split(/\n/).select{|el| el =~ /^.*[C].*$/i}[0]
or use match:
p str.match(/^.*[C].*$/i)[0]
EDITED:
Or, in case you want to find the first chunk that exactly starts with C try this:
p str.match(/^C.*$/)[0]

Related

How to parse username, ID or whole part using Ruby Regex in this sentence?

I have a sentences like this:
Hello #[Pratha](user:1), did you see #[John](user:3)'s answer?
And what I want to is get #[Pratha](user:1) and #[John](user:3). Either their names and ids or just as texts as I quoted so that i can explode and parse name and id myself.
But there is an issue here. Names Pratha and John may include non-abc characters like ', ,, -, + , etc... But not [] and ()
What I tried so far:
c = ''
f = c.match(/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s(\[)|$)))(\w+)(?=\s|$)/i)
But no success.
You may use
/#\[([^\]\[]*)\]\([^()]*:(\d+)\)/
See the regex demo
Details
# - a # char
\[ - a [
([^\]\[]*) - Group 1: 0+ chars other than [ and ]
\] - a ] char
\( - a ( char
[^()]*- 0+ chars other than ( and )
: - a colon
(\d+) - Group 2: 1 or more digits
\) - a ) char.
Sample Ruby code:
s = "Hello #[Pratha](user:1), did you see #[John](user:3)'s answer?"
rx = /#\[([^\]\[]*)\]\([^()]*:(\d+)\)/
res = s.scan(rx)
puts res
# = > [["Pratha", "1"], ["John", "3"]]
"Hello #[Pratha](user:1), did you see #[John](user:3)'s answer?".scan(/#.*?\)/)
#⇒ ["#[Pratha](user:1)", "#[John](user:3)"]
Since the line is not coming from the user input, you might rely on that the part you are interested in starts with # and ends with ).
You could use 2 capturing groups to get the names and the id's:
#\[([^]]+)]\([^:]+:([^)]+)\)
That will match
# Match literally
\[ Match [
([^]]+) 1st capturing group which matches not ] 1+ times using a negated character class.
\( Match literally
[^:]+: Match not :, then match :
([^)]+) 2nd capturing group which matches not ) 1+ times
\) Match )
Regex demo | Ruby demo

RNA Splicing Python

I have a gene sequence –
"acguccgcaagagaagccuuaauauauucaaaaagcuacgccucagauuucgcgcucgagcccaaaacaacugguguacggguugaucacaucaaaugaagucgcuaaagucggugaucucacuauccuugucuucggcuuuugcucucucggcuaucaucuaagcaggcgaguuccauggugaccggaacgacggcuacuggaguccaugaucgcaagcgucgggcugggguaaaagaggcucagcucauaauaguccgccccaccaguacgggacucgauaggccccgucguugccguagaaacgcaauuuuccucagacccacuauacgcaccucgauuuagcaugguuccgggguugcgcuuugagaaucauacguaaggaucggaaccuaggaaugcaccacagaacuuugaaauacuagaacaaguugauugacaacggaguaucggcgccccacauuuaacgaauaauugcaggcgccagacgaugcuaggugcguccguaucaagauucgaggucgcuacuggcuucgcuugccgaucgagcucagaguuugugagaguuguuacuaauugcguggucgccuaauauccuugauacuacguggguguacuagacaucccggacagaaaaucucuuaaacgcuagaguucucuuggaagcgccugcacuucuugugaacauacgaugauagccacucuaagcccaacgcacuucgcuuggcccacauugcccccagagcuuauucaucgacaggcguuccacucuuggauucaucaguaaacuuuauuauacgugguaagcgugcuuauagcugucggaaucucacuuaggcggauugaagugagacagccugaaaguaaccguguacaggcgccgucaauguguuuugagugugcaccuacaaaaaguguuauuuaggcaggggagcuuuguaguuucuuuagaagagccgcgaaugaaccaacgguagacugcgagcgcguucaaccuaau"
I want to splice the RNA and want to extract two lists (exons and introns). The key is that the intron section of RNA starts with gu and ends with ag. However, if ag appears before gu, it is a part of the exon and not the intron.
def splice(sequence):
introns = list()
exons = list()
while(sequence.count("gu")):
if "gu" not in sequence:
break
else:
exons.append(sequence[:sequence.find("gu")])
sequence = sequence[sequence.find("gu"):]
if "ag" not in sequence:
break
else:
introns.append(sequence[:sequence.find("ag")+2])
sequence = sequence[sequence.find("ag")+2:]
return introns, exons
This is what I have so far. It goes well pretty far but the issue begins at the end when gu appears without an ag in the remaining string.
Output:
Exons:
['ac',
'agaagccuuaauauauucaaaaagcuacgccucagauuucgcgcucgagcccaaaacaacug',
'ucgcuaaa',
'caggcga',
'uccaugaucgcaagc',
'aggcucagcucauaaua',
'uacgggacucgauaggcccc',
'aaacgcaauuuuccucagacccacuauacgcaccucgauuuagcaug',
'aaucauac',
'gaucggaaccuaggaaugcaccacagaacuuugaaauacuagaacaa',
'uaucggcgccccacauuuaacgaauaauugcaggcgccagacgaugcuag',
'auucgag',
'cucaga',
'a',
'acaucccggacagaaaaucucuuaaacgcuaga',
'cgccugcacuucuu',
'ccacucuaagcccaacgcacuucgcuuggcccacauugcccccagagcuuauucaucgacaggc',
'uaaacuuuauuauac',
'c',
'cu',
'gcggauugaa',
'acagccugaaa',
'gcgcc',
'u',
'u',
'gcaggggagcuuu',
'uuucuuuagaagagccgcgaaugaaccaacg',
'acugcgagcgc']
Introns:
['guccgcaag',
'guguacggguugaucacaucaaaugaag',
'gucggugaucucacuauccuugucuucggcuuuugcucucucggcuaucaucuaag',
'guuccauggugaccggaacgacggcuacuggag',
'gucgggcugggguaaaag',
'guccgccccaccag',
'gucguugccguag',
'guuccgggguugcgcuuugag',
'guaag',
'guugauugacaacggag',
'gugcguccguaucaag',
'gucgcuacuggcuucgcuugccgaucgag',
'guuugugag',
'guuguuacuaauugcguggucgccuaauauccuugauacuacguggguguacuag',
'guucucuuggaag',
'gugaacauacgaugauag',
'guuccacucuuggauucaucag',
'gugguaag',
'gugcuuauag',
'gucggaaucucacuuag',
'gugag',
'guaaccguguacag',
'gucaauguguuuugag',
'gugcaccuacaaaaag',
'guuauuuag',
'guag',
'guag']
I fixed the query by using regular expressions.
def splice(gene_Sequence):
regex = r"gu(?:\w{0,}?)ag"
introns = re.findall(regex, gene_Sequence)
for intron in introns:
exon = gene_Sequence.replace(intron, "")
return introns, exon

Ruby Regular Expression for Parcelify (Shopify)

I'm trying to write a Regex in Ruby for a shipping query.
If postcodes match MK1 - MK10, MK19, MK43, MK46 or MK77, then allow it.
If postcodes match NN1 - NN7, NN12, NN13, NN29 or NN77, then allow it.
If postcodes match MK11 - MK18 then don't allow it.
My trouble is that in the UK our postcodes are a bit funny where you can put MK1 1TS and MK11TS and they're considered the same. By not allowing MK11, MK11TY could be misread as MK11.
I've written a regex below, and so far it will disallow MK111TS and MK11\s1TS, and allow MK1\s1TS but not MK11TS. Any help would be greatly appreciated, I've only tested this for MK11 so far.
^((?!MK11\d).)*$&^((?!MK11\s\d).)*$|(MK(1 |2 |3 |4 |5 |6 |7 |8 |9 |10 ))|(MK19)|(MK43)|(MK46)|(MK77)|(NN1)|(NN2)|(NN3)|(NN4)|(NN5)|(NN6)|(NN7)|(NN12)|(NN13)|(NN29)|(NN77)
Thanks in advance.
r = /
(?: # begin non-capture group
MK # match characters
(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77) # match one of the choices
| # or
NN # match characters
(?:1|2|3|4|5|6|7|12|13|29|77) # match one of the choices
) # end non-capture group
(?![^\sA-Z]) # do not match a space or cap letter
/ix # case indifferent and free-spacing
# regex definition mode
This is conventionally written
r = /(?:MK(?:1|2|...|10|19|...|77)|NN(?:1|2|...|7|12|13|29|77))(?![^\sA-Z])/i
"MK4 abc def MK11MK19ghi NN6 jkl NN13 NN29NN77".scan(r)
# => ["MK4", "NN6", "NN13", "NN29", "NN77"]
"MK11" is not matched because "11" is not in the list. "MK19" is not matched because it is followed by a character that is neither a space nor a capital letter.
Alternatively, one could write
s = (['MK'].product(%w{1 2 3 4 5 6 7 8 9 10 19 43 46 77}).map(&:join) +
['NN'].product(%w{1 2 3 4 5 6 7 12 13 29 77}).map(&:join)).join('|')
# => "MK1|MK2|...|MK10|MK19|MK43|MK46|MK77|NN1|NN2|...|NN7|NN12|NN13|NN29|NN77"
r = /(?:#{s})(?![^\sA-Z])/i
#=> /(?:MK1|MK2|...|MK10|MK19|...|MK77|NN1|NN2|...|NN7|NN12|NN13|NN29|NN77)(?![^\sA-Z])/
If the remainder of the postal code is to be included in the regex, perhaps something like the following could be done.
suffixes = %w|ES AB CD EF|.join('|')
#=> "ES|AB|CD|EF"
Then replace (?![^\sA-Z])/x with the following.
\s? # optionally match a space
(?:#{suffixes}) # match a valid suffix in a non-capture group
(?!\S) # do not match a non-whitespace char (negative lookahead)
/ix # case-indifferent and free-spacing regex definition mode
Note the negative lookahead is satisfied if the suffix is at the end of the string.
Now I have written the following to match the postcodes format exactly:
#format: Area Code, Localities accepted, whitespace (MKor not), any digit, any single character, any single character
((MK|mk|Mk|mK)(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77)\s\d[A-Za-z][A-Za-z]) #with whitespace
|
((MK|mk|Mk|mK)(?:1|2|3|4|5|6|7|8|9|10|19|43|46|77)\d[A-Za-z][A-Za-z]) #without whitespace
|
((NN|nn|Nn|nN)(?:1|2|3|4|5|6|7|12|13|29|77)\s\d[A-Za-z][A-Za-z]) #with whitespace
|
((NN|nn|Nn|nN)(?:1|2|3|4|5|6|7|12|13|29|77)\d[A-Za-z][A-Za-z]) #without whitespace
This works for my purposes, I got here using Cary's answer, which has been extremely helpful. Thank you and have marked up.

Match Multiple Patterns in a String and Return Matches as Hash

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

multiline matching with ruby

I have a string variable with multiple lines: e.g.
"SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n
I would want to get both of lines that start with "Seq_vec SVEC" and extract the values of the integer part that matches...
string = "Clone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
seqvector = Regexp.new("Seq_vec\\s+SVEC\\s+(\\d+\\s+\\d+)",Regexp::MULTILINE )
vector = string.match(seqvector)
if vector
vector_start,vector_stop = vector[1].split(/ /)
puts vector_start.to_i
puts vector_stop.to_i
end
However this only grabs the first match's values and not the second as i would like.
Any ideas what i could be doing wrong?
Thank you
To capture groups use String#scan
vector = string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
match finds just the first match. To find all matches use String#scan e.g.
string.scan(seqvector)
=> [["1 65"], ["102 1710"]]
or to do something with each match:
string.scan(seqvector) do |match|
# match[0] will be the substring captured by your first regexp grouping
puts match.inspect
end
Just to make this a bit easier to handle, I would split the whole string into an array first and then would do:
string = "SClone VARPB63A\nSeq_vec SVEC 1 65 pCR2.1-topo\nSequencing_vector \"pCR2.1-topo\"\nSeq_vec SVEC 102 1710 pCR2.1-topo\nClipping QUAL 46 397\n"
selected_strings = string.split("\n").select{|x| /Seq_vec SVEC/.match(x)}
selected_strings.collect{|x| x.scan(/\s\d+/)}.flatten # => [" 1", " 65", " 102", " 1710"]

Resources