Ruby Regex matching unexpected characters - ruby

I am trying to write a script that parses filename of a comicbook and tries to extract info such as Seriesname, Publication year etc.In this case, I am trying to extract publication year from the name. Consider the following name, I would need to match and get value 2003. Below is the expression I had for this.
r = %r{ (?i)(^|[,\s-_])v(\d{4})($|[,\s-_]) }
However this matches the number irrespective of what character I have before the v or after the number
I expect the first two to not match and the third to match.
010 - All Star Batman & Robin The Boy Wonder 01 - av2003
010 - All Star Batman & Robin The Boy Wonder 01 - v2003t
010 - All Star Batman & Robin The Boy Wonder 01 - v2003
What am I doing wrong in this case?

Inside character classes (ie. []s) the - character has a special meaning when it's between two other characters: it creates a range starting the character before and ending at the character after.
Here, you want it literally, so you should either escape the - or (more idiomatically in regex) put it as the first or last character in the [].
Also, btw, you have literal space characters, but no /x modifier, also you probably don't want to capture what's before and after the year, so the final pattern would be:
%r{(?i)(?:^|[,\s_-])v(\d{4})(?:$|[,\s_-])}

#smathy answered your question (rather nicely). I want to point out that you could write your regex without a capture group:
r = /
(?: # begin a non-capture group
^|[,\s_-] # match the beginning of the string, a ws char or char in ',_-'
) # end the non-capture group
v # match v
\K # forget everything matched so far
\d{4} # match 4 digits
(?= # begin a positive look-ahead
$|[,\s_-] # match the end of the string, a ws char or char in ',_-'
) # end positive lookahead
/x
"010 - All Star Batman & Robin The Boy Wonder 01 - av2003"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003t"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003"[r]
#=> "2003
If you wish to match v or V, change the line v to [vV].
If you wish the regex to be case independent, change /x to /ix (in which case there is no need to replace v with [vV]).
If you wish to ensure the publication date is (say) in the 20th or 21st century, change \d{4} to [12]\d{3}.
You could alternatively change the non-capture group to a positive lookbehind ((?<=^|[,\s_-])) and delete \K.

Related

ignore a specific \n character while still enabling the m flag

I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?
The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.

why is \d+ not matching all digits?

I have the following regular expression:
REGEX = /^.+(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
I have the following string:
str = "fdsfd 8126 E Bowen AVE Bensalem, PA 19020-1642 dfdf"
Notice my capturing group begins with one or more digits that match the pattern. Yet this is what I get:
str =~ REGEX
$1
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Or
match = str.match(REGEX)
match[1]
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Why is it missing the first 3 digits of 812?
The below regex works properly, as you can see at Regex101
REGEX = /^.+?(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
Note the addition of the question mark near the beginning of the regex
/^.+?(\d+...
^
By default, your first .+ is being greedy, consuming all digits it can, and still allowing the regex pass. By adding ? after the plus, you can make it lazy instead of greedy.
An alternative would be to not capture digits, like this:
/^[^\d]+(\d+...
[^\d]+ will capture everything except for digits.

How do I insert a space before a number in a string if it matches a regular expression?

I want to insert a space before a number in a string if my string has a capital letter, one or more lower case letters, and then a number right after it. That is, if I have
Bdefg23
I want to insert a space between the "g" and the "23" making the above string
Bdefg 23
So this string would not get changed
BabcdD55
because there is a capital letter before "55". I tried this below
str.split(/([A-Z][a-z]+)/).delete_if(&:empty?).join(' ')
but it works too well. That is, if my string is
Ptannex..
it will turn it into
Ptannex ..
How can I adjust what I have to make it work for only the condition I outlined? Btw, I'm using Ruby 2.4.
You can always do it roughly this way:
%w[
Bdefg23
Ptannex95
BigX901
littleX902
CS101
xx900
].each do |example|
puts '%s -> %s' % [
example,
example.sub(/\A([A-Z][a-z]+)(\d+)\z/, '\1 \2')
]
end
Which gives you output like:
Bdefg23 -> Bdefg 23
Ptannex95 -> Ptannex 95
BigX901 -> BigX901
littleX902 -> littleX902
CS101 -> CS101
xx900 -> xx900
You may use
s.sub(/\A(\p{Lu}\p{L}*\p{Ll})(\d+)\z/, '\1 \2')
See the Ruby demo.
Details
\A - start of string
(\p{Lu}\p{L}*\p{Ll}) - Group 1:
\p{Lu} - any Unicode uppercase letter
\p{L}* - any 0+ Unicode letters
\p{Ll} - any lowercase Unicode letter
(\d+) - Group 2: one or more digits
\z - end of string.
The \1 \2 replacement pattern replaces the whole match with the contents of Group 1, then a space, and then the contents of Group 2.

Matching repeated pattern in string

I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.
You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.
If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.
The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)
I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Resources