I have the following regular expression:
REGEX = /^.+(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
I have the following string:
str = "fdsfd 8126 E Bowen AVE Bensalem, PA 19020-1642 dfdf"
Notice my capturing group begins with one or more digits that match the pattern. Yet this is what I get:
str =~ REGEX
$1
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Or
match = str.match(REGEX)
match[1]
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Why is it missing the first 3 digits of 812?
The below regex works properly, as you can see at Regex101
REGEX = /^.+?(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
Note the addition of the question mark near the beginning of the regex
/^.+?(\d+...
^
By default, your first .+ is being greedy, consuming all digits it can, and still allowing the regex pass. By adding ? after the plus, you can make it lazy instead of greedy.
An alternative would be to not capture digits, like this:
/^[^\d]+(\d+...
[^\d]+ will capture everything except for digits.
Related
I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?
The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.
I am trying to find a right regex expression to select substrings between another substring, which I'd like to exclude. For example in this string:
11 - 12£ in $ + 13
I want to select 12£ and $. Basically, it's substrings around in, until I hit an array of values I want to use as end/start, in this case, arithmetic operators %w(+ - / *)
So far closest I got was using this regex /(.\d\p{Sc})\sin\s(\p{Sc})/
Some more examples:
10 - 12$ in £ - 13$ should return 12$ and £
12 $ in £ should return 12$ and £
100£in$ should return 100£ and $
sentence.match(/[^-+*\/]*in[^-+*\/]*/).to_s.strip.split(/ *in */)
[^-+*\/]* matches multiple non-arithmetic operators
this will hence get everything from the "opening" to the "closing" operator that surround an in
#strip removes the leading and trailing whitespaces
finally, split into two strings, removing in and the spaces around it
r = /
\s+[+*\/-]\s+ # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 1
\s+in\s+ # match 1+ whitespaces, 'in', 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 2
\s+[+*\/-]\s # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
/x # free-spacing regex definition mode
str = '11 - 12£ in $ + 13 / 13F in % * 4'
str.scan(r)
#=> [["12£", "$"], ["13F", "%"]]
See the doc for String#scan to see how scan handles capture groups.
Note that '-' must be first or last in the character class [+*\/-].
I am trying to write a script that parses filename of a comicbook and tries to extract info such as Seriesname, Publication year etc.In this case, I am trying to extract publication year from the name. Consider the following name, I would need to match and get value 2003. Below is the expression I had for this.
r = %r{ (?i)(^|[,\s-_])v(\d{4})($|[,\s-_]) }
However this matches the number irrespective of what character I have before the v or after the number
I expect the first two to not match and the third to match.
010 - All Star Batman & Robin The Boy Wonder 01 - av2003
010 - All Star Batman & Robin The Boy Wonder 01 - v2003t
010 - All Star Batman & Robin The Boy Wonder 01 - v2003
What am I doing wrong in this case?
Inside character classes (ie. []s) the - character has a special meaning when it's between two other characters: it creates a range starting the character before and ending at the character after.
Here, you want it literally, so you should either escape the - or (more idiomatically in regex) put it as the first or last character in the [].
Also, btw, you have literal space characters, but no /x modifier, also you probably don't want to capture what's before and after the year, so the final pattern would be:
%r{(?i)(?:^|[,\s_-])v(\d{4})(?:$|[,\s_-])}
#smathy answered your question (rather nicely). I want to point out that you could write your regex without a capture group:
r = /
(?: # begin a non-capture group
^|[,\s_-] # match the beginning of the string, a ws char or char in ',_-'
) # end the non-capture group
v # match v
\K # forget everything matched so far
\d{4} # match 4 digits
(?= # begin a positive look-ahead
$|[,\s_-] # match the end of the string, a ws char or char in ',_-'
) # end positive lookahead
/x
"010 - All Star Batman & Robin The Boy Wonder 01 - av2003"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003t"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003"[r]
#=> "2003
If you wish to match v or V, change the line v to [vV].
If you wish the regex to be case independent, change /x to /ix (in which case there is no need to replace v with [vV]).
If you wish to ensure the publication date is (say) in the 20th or 21st century, change \d{4} to [12]\d{3}.
You could alternatively change the non-capture group to a positive lookbehind ((?<=^|[,\s_-])) and delete \K.
This question already has answers here:
Regular expression to match digits and basic math operators
(10 answers)
Closed 8 years ago.
I'm trying to write a program that will take in a string and use RegEx to search for certain mathematical expressions, such as 1 * 3 + 4 / 2. Only operators to look for are [- * + /].
so far:
string = "something something nothing 1/ 2 * 3 nothing hello world"
a = /\d+\s*[\+ \* \/ -]\s*\d+/
puts a.match(string)
produces:
1/ 2
I want to grab the whole equation 1/ 2 * 3. I'm essentially brand new to the world of regex, so any help will be appreciated!
New Information:
a = /\s*-?\d+(?:\s*[-\+\*\/]\s*\d+)+/
Thank you to zx81 for his answer. I had to modify it in order to work. For some reason ^ and $ do not produce any output, or perhaps a nil output, for a.match(string). Also, certain operators need a \ before them.
Version to work with parenthesis:
a = /\(* \s* \d+ \s* (( [-\+\*\/] \s* \d+ \)* \s* ) | ( [-\+\*\/] \s* \(* \s* \d+ \s* ))+/
Regex Calculators
First off, you might want to have a look at this question about Regex Calculators (both RPN and non-RPN version).
But we're not dealing with parentheses, so we can go with something like:
^\s*-?\d+(?:\s*[-+*/]\s*\d+)+$
See demo.
Explanation
The ^ anchor asserts that we are at the beginning of the string
\s* allows optional spaces
-? allows an optional minus before the first digit
\d+ matches the first digits
The non-capturing group (?:\s*[-+*/]\s*\d+) matches optional spaces, an operator, optional spaces and digits
the + quantifier matches that one or more times
The $ anchor asserts that we are at the end of the string
I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')