Finding all groups of contiguous words in string - ruby

I need to find all groups of two contiguous words in a string, but only of words that have 2-3 chars of length. So far I've come with this:
'toolong fee fi fo fum toolong verylong aa bb'.scan(/\b[a-z]{2,3}\s+\b[a-z]{2,3}/)
=> ["fee fi", "fo fum", "aa bb"]
But I want something like this:
=> ["fee fi", "fi fo", "fo fum", "aa bb"]
Any help greatly appreciated.

You need to use lookahead along with capturing group in-order to do overlapping matches.
> 'toolong fee fi fo fum toolong verylong aa bb'.scan(/(?=\b([a-z]{2,3}\s+[a-z]{2,3})\b)/)
=> [["fee fi"], ["fi fo"], ["fo fum"], ["aa bb"]]
> 'toolong fee fi fo fum toolong verylong aa bb'.scan(/\b(?=([a-z]{2,3}\s+[a-z]{2,3})\b)/).flatten
=> ["fee fi", "fi fo", "fo fum", "aa bb"]

The logical way is to consume the first 3 ltr word, then lookahead for the
next one.
Since you want both words together, you'd capture each one then join
them together after each match. \b([a-z]{2,3})(?=(\s+[a-z]{2,3})\b)
\b
( [a-z]{2,3} ) # (1)
(?=
( # (2 start)
\s+
[a-z]{2,3}
) # (2 end)
\b
)
The next logical way (though, not intuitive) is to lookahead for the
combined 2 words, then consume the first one to advance the match
position. (?=\b(([a-z]{2,3})\s+[a-z]{2,3})\b)\2
This way lets you just grab group 1 without the need to join.
(?=
\b
( # (1 start)
( [a-z]{2,3} ) # (2)
\s+
[a-z]{2,3}
) # (1 end)
\b
)
\2

Related

why is \d+ not matching all digits?

I have the following regular expression:
REGEX = /^.+(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
I have the following string:
str = "fdsfd 8126 E Bowen AVE Bensalem, PA 19020-1642 dfdf"
Notice my capturing group begins with one or more digits that match the pattern. Yet this is what I get:
str =~ REGEX
$1
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Or
match = str.match(REGEX)
match[1]
=> "6 E Bowen AVE Bensalem, PA 19020-1642"
Why is it missing the first 3 digits of 812?
The below regex works properly, as you can see at Regex101
REGEX = /^.+?(\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?).+/
Note the addition of the question mark near the beginning of the regex
/^.+?(\d+...
^
By default, your first .+ is being greedy, consuming all digits it can, and still allowing the regex pass. By adding ? after the plus, you can make it lazy instead of greedy.
An alternative would be to not capture digits, like this:
/^[^\d]+(\d+...
[^\d]+ will capture everything except for digits.

Ruby Regex matching unexpected characters

I am trying to write a script that parses filename of a comicbook and tries to extract info such as Seriesname, Publication year etc.In this case, I am trying to extract publication year from the name. Consider the following name, I would need to match and get value 2003. Below is the expression I had for this.
r = %r{ (?i)(^|[,\s-_])v(\d{4})($|[,\s-_]) }
However this matches the number irrespective of what character I have before the v or after the number
I expect the first two to not match and the third to match.
010 - All Star Batman & Robin The Boy Wonder 01 - av2003
010 - All Star Batman & Robin The Boy Wonder 01 - v2003t
010 - All Star Batman & Robin The Boy Wonder 01 - v2003
What am I doing wrong in this case?
Inside character classes (ie. []s) the - character has a special meaning when it's between two other characters: it creates a range starting the character before and ending at the character after.
Here, you want it literally, so you should either escape the - or (more idiomatically in regex) put it as the first or last character in the [].
Also, btw, you have literal space characters, but no /x modifier, also you probably don't want to capture what's before and after the year, so the final pattern would be:
%r{(?i)(?:^|[,\s_-])v(\d{4})(?:$|[,\s_-])}
#smathy answered your question (rather nicely). I want to point out that you could write your regex without a capture group:
r = /
(?: # begin a non-capture group
^|[,\s_-] # match the beginning of the string, a ws char or char in ',_-'
) # end the non-capture group
v # match v
\K # forget everything matched so far
\d{4} # match 4 digits
(?= # begin a positive look-ahead
$|[,\s_-] # match the end of the string, a ws char or char in ',_-'
) # end positive lookahead
/x
"010 - All Star Batman & Robin The Boy Wonder 01 - av2003"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003t"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003"[r]
#=> "2003
If you wish to match v or V, change the line v to [vV].
If you wish the regex to be case independent, change /x to /ix (in which case there is no need to replace v with [vV]).
If you wish to ensure the publication date is (say) in the 20th or 21st century, change \d{4} to [12]\d{3}.
You could alternatively change the non-capture group to a positive lookbehind ((?<=^|[,\s_-])) and delete \K.

Need some help understanding backreferences in ruby

I was working on a coderbyte problem where I output the number of occurrences of a character along with the corresponding character. For example "wwwggopp" would return 3w2g1o2p. I was able to solve it but I compared my answer to someone else's and they came up with the following:
def RunLength(str)
chunks = str.scan(/((\w)\2*)/)
output = ' '
chunks.each do |chunk|
output << chunk[0].size.to_s + chunk[1]
end
output
end
I get most of the code but what exactly is happening here?
(/((\w)\2*)/)
I understand that \w refers to any character and \2 is a 'backreference' and * refers to 0 or more instances...but together, I'm not sure what it means, mostly because I don't know really know what a backreference is and how it works. I've been reading about it but I'm still struggling to grasp the concept. Does the \2 refer to the "2nd group" and if so, what exactly is the "2nd group"?
Backreferences recall what was matched by a capturing group. A backreference is specified as a backslash (\); followed by a digit indicating the number of the group to be recalled.
Your regular expression broke down:
( # group and capture to \1:
( # group and capture to \2:
\w # word characters (a-z, A-Z, 0-9, _)
) # end of \2
\2* # what was matched by capture \2 (0 or more times)
) # end of \1

Ruby regex need to exclude pattern

I have the following strings
ALEXANDRITE OVAL 5.1x7.9 GIA# 6167482443 FINE w:1.16
ALEXANDRITE OVAL 4x6 FINE w:1.16
I want to match the 5.1 and 7.9 and the 4 and 6 and not w:1.16 or w: 1.16 or the 6167482443. So far I managed to come up with these:
Matching the w:1.16 w: 1.16
([w][:]\d\.?\d*|[w][:]\s?\d\.?\d*)
Matching the other digits:
\d+\.?\d{,3}
I kind of expected this not the return the long number sequence because of the {,3} but it still does.
My questions are :
1. How do I combine the two patterns excluding one and returning the other?
2. How do I exclude the long sequence of numbers? Why is it not being excluded now?
Thanks!
You could simply use the below regex.
\b(\d+(?:\.\d+)?)x(\d+(?:\.\d+)?)
DEMO
Explanation:
\b the boundary between a word char (\w) and
something that is not a word char
( group and capture to \1:
\d+ digits (0-9) (1 or more times)
(?: group, but do not capture (optional):
\. '.'
\d+ digits (0-9) (1 or more times)
)? end of grouping
) end of \1
x 'x'
( group and capture to \2:
\d+ digits (0-9) (1 or more times)
(?: group, but do not capture (optional):
\. '.'
\d+ digits (0-9) (1 or more times)
)? end of grouping
) end of \2
([\d\.])+x([\d\.])+
matches
5.1x7.9
4x6
(\d+(?:\.\d+)?)(?=x)|(?<=x)(\d+(?:\.\d+)?)
You can try this.See demo.
http://regex101.com/r/wQ1oW3/6
2)To ignore the long string you have to use \b\d{1,3}\b to specify boundaries.
http://regex101.com/r/wQ1oW3/7
Or else a part of long string will match.

How do I strip spaces & special characters from a string in specific positions? - Ruby

Say I have a string like this:
May 12 -
Where what I want to end up with is:
May 12
I tried doing a gsub(/\s+\W/, '') and that works to strip out the trailing space and the last hyphen.
But I am not sure how I remove the first space before the M.
Thoughts?
Use match instead of gsub (i.e. extract the relevant string, instead of trying to strip irrelevant parts), using the regex /\w+(?:\W+\w+)*/:
" May 12 - ".match(/\w+(?:\W+\w+)*/).to_s # => "May 12"
Note that this is vastly more efficient than using gsub – pitting my match regex against the suggested gsub regex, I get these benchmarks (on 5 million repetitions):
user system total real
match: 19.520000 0.060000 19.580000 ( 22.046307)
gsub: 31.830000 0.120000 31.950000 ( 35.781152)
Adding a gstrip! step as suggested does not significantly change this:
user system total real
match: 19.390000 0.060000 19.450000 ( 20.537461)
gsub.strip!: 30.800000 0.110000 30.910000 ( 34.140044)
use .strip! on your result .
" May 12".strip! # => "May 12"
How about:
/^\s+|\s+\W+$/
explanation:
/ : regex delim
^ : begining of string
\s+ : 1 or more spaces
| : OR
\s+\W+ : 1 or more spaces followed by 1 or more non word char
$ : end of string
/ : regex delim

Resources