how to limit a regular expression to the closest in ruby - ruby

Assuming I have:
[24] pry(main)> str="these (are) things (that) I want (to) know"
and I want
=> ["these", "things", "I want", "know"]
but
[25] pry(main)> str.split(/\(.*\)/)
I get:
=> ["these ", " know"]
[26] pry(main)>
How would I fix this? Sorry for multiple questions - a bit seperate issues.
edit #1
since we're splitting on a Regex, is there any way to also get the matched elements back?
like:
=> [["these", "things", "I want", "know"],["(are)","(that)","(too)"]]
where the first part is the splitted values and the second is the matched array?

Make the * quantifier ungreedy by putting a ? after it. Like so:
str.split(/\(.*?\)/)
.* without ? will match as much as possible, while you want the opposite effect.
You could also use a different approach and restrict what characters you want to match. For example:
str.split(/\([^()]*\)/)

Non regexp version:
s = "these (are) things (that) I want (to) know"
is_parenthesised = -> x {x.start_with?('(') && x.end_with?(')')}
p s.split(' ').partition &is_parenthesised #=> [["(are)", "(that)", "(to)"], ["these", "things", "I", "want", "know"]]

Here's another way:
[str.gsub(/\s*\(.*?\)\s*/, 0.chr).split(0.chr), str.scan(/(\(.*?\))/).flatten]
#=> [["these", "things", "I want", "know"], ["(are)", "(that)", "(to)"]]
I could have gsub'ed to any string I was certain was not in the data. ASCII 0 seemed a safe choice. split is definitely better for the first element, but I offer this in the interest of diversity.

Related

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]
Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"
The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]
I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

(Ruby) regex optional matches

I'm writing a Rack app to split hostnames ending with certain prefixes.
For example, the hostname (and port) hello.world.lvh.me:3000 needs to be split into tokens hello.world, .lvh.me and :3000. Additionally, the prefix (hello.world), suffix (.lvh.me) and port (:3000) are all optional.
So far, I have a (Ruby) regex that looks like /(.*)(\.lvh\.me)(\:\d+)?/.
This successfully breaks the hostname into component parts but it falls down when one or more of the optional components is missing, e.g. for hello.world:3000 or lvh.me:3000 or even plain old hello.world.
I've tried adding ? to each group to make them optional (/(.*)?(\.lvh\.me)?(\:(\d+)?/) but this invariably ends up with the first group, (.*), capturing the entire string and stopping there.
My gut feeling is that this is something which might be solved using lookaround but I'll admit this is a totally new realm of regex for me.
You can try with this pattern:
\A(?=[^:])(.+?)??((?:\.|\A)lvh\.me)?(:[0-9]+)?\z
the lookahead (?=[^:]) checks there is at least one character that is not the : (in other words, not the port alone). This means that at least hello.word or lvh.me is present.
The first group is optional and non-greedy ??, this means that it is matched only when needed.
\A and \z are anchors for the start and the end of the string (when ^ and $ are used for the line)
Note that the character class \d matches all unicode digits in Ruby, but in this case you only need ascii digits. It's better to use [0-9]
Note too that \A(?=[^:])((?>[^l:\n.]+|\.|\Bl|l(?!vh\.me\b))*)((?:\.|\A)lvh\.me)?(:[0-9]+)?\z may be more performant.
online demo
Try ^(.*?)?(\.?lvh\.me)?(\:\d+)?$
I added:
a ? to the first group making the * non-greedy
^,$ to anchor it to the start and end.
a ? to the \. before lvh because you want to match lvh.me:3000 not .lvh.me:3000
A Tokenizing Answer
Just for fun, I decided to see if there was a relatively simple way to do what you wanted without a complicated regular expression. The only regular expressions I used were for splitting and validation.
This works for me with your provided corpus, and several variations.
str = 'hello.world.lvh.me:3000'
tokens = str.split /[.:]/
port = tokens.last =~ /\A\d+\z/ ? ?: + tokens.pop : ''
domain = sprintf '.%s.%s', *tokens.pop(2)
prefix = tokens.join ?.
You'll certainly need to check for empty strings in certain cases, but it seems like it might be more straightforward and/or flexible than a pure regex solution. I find it more readable, anyway. If you truly need a single regular expression, though, I'm sure one of the other answers will help you out.
You could try splitting rather than matching,
irb(main):012:0> "hello.world.lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me", "3000"]
irb(main):013:0> "hello.world:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "3000"]
irb(main):014:0> "lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["lvh.me", "3000"]
irb(main):015:0> "hello.world".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world"]
irb(main):016:0> "hello.world.lvh.me".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me"]
Look, ma, no regex!
def split_up(str)
str.sub(':','.:')
.split('.')
.each_slice(2)
.map { |arr| arr.join('.') }
end
split_up("hello.world.lvh.me:3000") #=> ["hello.world", "lvh.me", ":3000"]
split_up("hello.world:3000") #=> ["hello.world", ":3000"]
split_up("hello.world.lvh.me") #=> ["hello.world", "lvh.me"]
split_up("hello.world") #=> ["hello.world"]
split_up("") #=> []
Steps:
str1 = "hello.world.lvh.me:3000" #=> "hello.world.lvh.me:3000"
str2 = str1.sub(':','.:') #=> "hello.world.lvh.me.:3000"
arr = str2.split('.') #=> ["hello", "world", "lvh", "me", ":3000"]
enum = arr.each_slice(2) #=> #<Enumerator: ["hello", "world", "lvh",
# "me", ":3000"]:each_slice(2)>
enum.to_a #=> [["hello", "world"], ["lvh", "me"],
# [":3000"]]
enum.map { |arr| arr.join('.') } #=> ["hello.world", "lvh.me", ":3000"]

Ruby on Rails - How do I split string and Number?

I have a string "FooFoo2014".
I want the result to be => "Foo Foo 2014"
Any idea?
This works fine:
puts "FooFoo2014".scan(/(\d+|[A-Z][a-z]+)/).join(' ')
# => Foo Foo 2014
Of course in condition that you separate numbers and words from capital letter.
"FooFoo2014"
.gsub(/(?<=\d)(?=\D)|(?<=\D)(?=\d)|(?<=[a-z])(?=[A-Z])/, " ")
# => "Foo Foo 2014"
Your example is a little generic. So this might be guessing in the wrong direction. That being said, it seems like you want to reformat the string a little:
"FooFoo2014".scan(/^([A-Z].*)([A-Z].*\D*)(\d+)$/).flatten.join(" ")
As "FooFoo2014" is a string with some internal structure important to you, you need to come up with the right regular expression yourself.
From your question, I extract two tasks:
split the FooFoo at the capital letter.
/([A-Z].*)([A-Z].*)/ would do that, given you only have standard latin letters
split the letter from the digits
/(.*\D)(\d+)/ achieves that.
The result of scan is an array in my version of ruby. Please verify that in your setup.
If you think that regular expressions are too complicated for this, I suggest that you take a good look into ActiveSupport. http://api.rubyonrails.org/v3.2.1/ might help you.
If its only letters then only digits:
target = "FooFoo2014"
match_data = target.match(/([A-Za-z]+)(\d+)/)
p match_data[1] # => "FooFoo"
p match_data[2] # => "2014
If it is two words each made of one capitalized letter then lowercase letters, then digits:
target = "FooBar2014"
match_data = target.match(/([A-Z][a-z]+)([A-Z][a-z]+)(\d+)/)
p match_data[1] # => "Foo"
p match_data[2] # => "Bar"
p match_data[3] # => "2014
Better regex are probably possible.

Ruby split on numbers vs letters

I'd like to split the following string on letters:
1234B
There are always only ever 4 digits and one letter. I just want to split those out.
Here is my attempt, I think I have the method right and the regex matches the number but I dont think my syntax or my regex is pertinent to the problem Im attempting to solve.
"1234A".split(/^\d{4}/)
What you want is not clear, but a general solution to this kind of situation is:
"1234A".scan(/\d+|\D+/)
# => ["1234", "A"]
If there are always 4 digits and 1 letter, there's no need to use regular expressions to split the string. Just do this:
str = "1234A"
digits,letter = str[0..3],str[4]
Looking at it purely from the perspective of splitting any string into groups of 4:
"1234A".scan(/.{1,4}/)
# => ["1234", "A"]
Another no-regex version:
str = "1234A"
str.chars.to_a.last # => "A"
str.chop # => "1234"

What's the difference between scan and match on Ruby string

I am new to Ruby and has always used String.scan to search for the first occurrence of a number. It is kind of strange that the returned value is in nested array, but I just go [0][0] for the values I want. (I am sure it has its purpose, just that I haven't used it yet.)
I just found out that there is a String.match method. And it seems to be more convenient because the returned array is not nested.
Here is an example of the two, first is scan:
>> 'a 1-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"]]
then is match
>> 'a 1-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
I have check the API, but I can't really differentiate the difference, as both referred to 'match the pattern'.
This question is, for simply out curiousity, about what scan can do that match can't, and vise versa. Any specific scenario that only one can accomplish? Is match the inferior of scan?
Short answer: scan will return all matches. This doesn't make it superior, because if you only want the first match, str.match[2] reads much nicer than str.scan[0][1].
ruby-1.9.2-p290 :002 > 'a 1-night stay, a 2-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"], ["a ", "2"]]
ruby-1.9.2-p290 :004 > 'a 1-night stay, a 2-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
#scan returns everything that the Regex matches.
#match returns the first match as a MatchData object, which contains data held by special variables like $& (what was matched by the Regex; that's what's mapping to index 0), $1 (match 1), $2, et al.
Previous answers state that scan will return every match from the string the method is called on but this is incorrect.
Scan keeps track of an index and continues looking for subsequent matches after the last character of the previous match.
string = 'xoxoxo'
p string.scan('xo') # => ['xo' 'xo' 'xo' ]
# so far so good but...
p string.scan('xox') # => ['xox']
# if this retured EVERY instance of 'xox' it would include a substring
# starting at indices 0 and 2 but only one match is found

Resources