What's the difference between match method and the =~ operator? - ruby

Two expressions:
puts "String has vowels" if "This is a test".match(/[aeiou]/)
and
puts "String has vowels" if "This is a test" =~ /[aeiou]/
seem identical. Are they not? I did some testing below:
"This is a test" =~ /[aeiou]/
# => 2
"This is a test".match(/[aeiou]/)
# => MatchData "i"
So it seems like =~ gives you the position of the first match and match method gives you the first character that matches. Is this correct? They both return true and so what's the difference here?

They just differ on what they return if there is a match. If there is no match, both return nil.
~= returns the numerical index of the character in the string where the match started
.match returns an instance of the class MatchData

You're correct.
Expanding on Nobita's answer, match is less efficient if you want to just check to see if a string matches a regexp (like in your case). In that case, you should use =~. See the answer to "Fastest way to check if a string matches or not a regexp in ruby?", which contains these benchmarks:
require 'benchmark'
"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=> 0.610000 0.000000 0.610000 ( 0.578133)
...
irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=> 1.703000 0.000000 1.703000 ( 1.578146)
So, in this case, =~ is a little less than three times faster than match

Related

Optional whitespace in regexp

I want to create a regexp to specify words between parenthesis. For example, I have a string like this:
"something(a,b)"
and
"something(c, d)"
and I want to extract the letters from between the parentheses.
In the first string I want to get an array ['a','b']. In the second, I want the array ['c','d'].
I have following method:
def suffixes(t)
(t.scan /\((\w+),(\w+)\)/).flatten
end
but this works only for the first case. In the second variant I have:
def suffixes(t)
(t.scan /\((\w+),[\s](\w+)\)/).flatten
end
But this works only for the second case. I don't know what regexp will operate in both cases.
You can use:
def suffixes(t)
(t.scan /\((\w+)\s*,\s*(\w+)\)/).flatten
end
\s* will match 0 or more spaces before and after comma.
Make the inbetween \s as optional.
def suffixes(t)
(t.scan /\((\w+),\s?(\w+)\)/).flatten
end
? after the \s would turn the space to optional (0 or 1).
I would suggest you to distinguish "scanning" for the text between parentheses and "splitting" the result by comma:
s = "something(c, d)"
s.match( /\((.+)\)/ )[1] # found text between parentheses
.split(/,/) # split the result by comma
.map(&:strip) # stripped the values
It’s more Ruby-like, in my understanding. Hope it helps.
UPD Thanks #theTinMan, there are two possibilities to improve an answer. First of all, s[/\((.+)\)/, 1] looks better ans executes faster than s.match( /\((.+)\)/ )[1]. Secondary, splitting by string is faster than splitting by regexp. The summing up:
s = "something(c, d)"
s[ /\((.+)\)/, 1 ] # found text between parentheses
.split(',') # split the result by comma
.map(&:strip) # stripped the values
Proof:
require 'benchmark'
n = 1_000_000
s = "something(c, d)"
Benchmark.bm do |x|
x.report { n.times { s.match( /\((.+)\)/ )[1].split(/,/).map(&:strip) } }
x.report { n.times { s.match( /\((.+)\)/ )[1].split(',').map(&:strip) } }
x.report { n.times { s[/\((.+)\)/, 1].split(/,/).map(&:strip) } }
x.report { n.times { s[/\((.+)\)/, 1].split(',').map(&:strip) } }
end
# user system total real
# 3.590000 0.000000 3.590000 ( 3.598151)
# 3.030000 0.000000 3.030000 ( 3.028137)
# 2.940000 0.000000 2.940000 ( 2.942490)
# 2.180000 0.000000 2.180000 ( 2.182447)
\((\w+)|(?!^)\G\s*,\s*(\w+)
Try this.This will work for all arguments.See demo.
https://regex101.com/r/vN3sH3/27

Regexp to match repeated substring

I would like to verify a string containing repeated substrings. The substrings have a particular structure. Whole string has a particular structure (substring split by "|"). For instance, the string can be:
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
How can I check that all repeated substrings match a regexp? I tried to check it with:
"1=23.00|6=22.12|12=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
But checking gives true even when several substrings do not match the regexp:
"1=23.00|6=ass|=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
# => #<MatchData "1=23.00" 1:"1=23.00">
The question is whether every repeated substring matches a regex. I understand that the substrings are separated by the character | or $/, the latter being the end of a line. We first need to obtain the repeated substrings:
a = str.split(/[#{$/}\|]/)
.map(&:strip)
.group_by {|s| s}
.select {|_,v| v.size > 1 }
.keys
Next we specify whatever regex you wish to use. I am assuming it is this:
REGEX = /[1-9][0-9]*=[1-9]+\.[0-9]+/
but it could be altered if you have other requirements.
As we wish to determine if all repeated substrings match the regex, that is simply:
a.all? {|s| s =~ REGEX}
Here are the calculations:
str =<<_
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
_
c = str.split(/[#{$/}\|]/)
#=> ["1=23.00", "6=22.12", "12=21.34", "112=20.34", "1=23.00",
# "6=22.12", "12=21.34", "1=23.00", "12=21.34", "1=23.00**"]
d = c.map(&:strip)
# same as c, possibly not needed or not wanted
e = d.group_by {|s| s}
# => {"1=23.00" =>["1=23.00", "1=23.00", "1=23.00"],
# "6=22.12" =>["6=22.12", "6=22.12"],
# "12=21.34" =>["12=21.34", "12=21.34", "12=21.34"],
# "112=20.34"=>["112=20.34"], "1=23.00**"=>["1=23.00**"]}
f = e.select {|_,v| v.size > 1 }
#=> {"1=23.00"=>["1=23.00", "1=23.00" , "1=23.00"],
# "6=22.12"=>["6=22.12", "6=22.12"],
# "12=21.34"=>["12=21.34", "12=21.34", "12=21.34"]}
a = f.keys
#=> ["1=23.00", "6=22.12", "12=21.34"]
a.all? {|s| s =~ REGEX}
#=> true
This will return true if there are any duplicates, false if there are not:
s = "1=23.00|6=22.12|12=21.34|112=20.34|3=23.00"
arr = s.split(/\|/).map { |s| s.gsub(/\d=/, "") }
arr != arr.uniq # => true
If you want to resolve it through regexp (not ruby), you should match whole string, not substrings. Well, I added [|] symbol and line ending to your regexp and it should works like you want.
([1-9][0-9]*[=][0-9\.]+[|]*)+$
Try it out.

How to delete specific characters from a string in Ruby?

I have several strings that look like this:
"((String1))"
They are all different lengths. How could I remove the parentheses from all these strings in a loop?
Do as below using String#tr :
"((String1))".tr('()', '')
# => "String1"
If you just want to remove the first two characters and the last two, then you can use negative indexes on the string:
s = "((String1))"
s = s[2...-2]
p s # => "String1"
If you want to remove all parentheses from the string you can use the delete method on the string class:
s = "((String1))"
s.delete! '()'
p s # => "String1"
For those coming across this and looking for performance, it looks like #delete and #tr are about the same in speed and 2-4x faster than gsub.
text = "Here is a string with / some forwa/rd slashes"
tr = Benchmark.measure { 10000.times { text.tr('/', '') } }
# tr.total => 0.01
delete = Benchmark.measure { 10000.times { text.delete('/') } }
# delete.total => 0.01
gsub = Benchmark.measure { 10000.times { text.gsub('/', '') } }
# gsub.total => 0.02 - 0.04
Using String#gsub with regular expression:
"((String1))".gsub(/^\(+|\)+$/, '')
# => "String1"
"(((((( parentheses )))".gsub(/^\(+|\)+$/, '')
# => " parentheses "
This will remove surrounding parentheses only.
"(((((( This (is) string )))".gsub(/^\(+|\)+$/, '')
# => " This (is) string "
Here is an even shorter way of achieving this:
1) using Negative character class pattern matching
irb(main)> "((String1))"[/[^()]+/]
=> "String1"
^ - Matches anything NOT in the character class. Inside the charachter class, we have ( and )
Or with global substitution "AKA: gsub" like others have mentioned.
irb(main)> "((String1))".gsub(/[)(]/, '')
=> "String1"
Use String#delete:
"((String1))".delete "()"
=> "String1"

Check for a substring at the end of string

Let's say I have two strings:
"This-Test has a "
"This has a-Test"
How do I match the "Test" at the end of string and only get the second as a result and not the first string. I am using include? but it will match all occurrences and not just the ones where the substring occurs at the end of string.
You can do this very simply using end_with?, e.g.
"Test something Test".end_with? 'Test'
Or, you can use a regex that matches the end of the string:
/Test$/ === "Test something Test"
"This-Test has a ".end_with?("Test") # => false
"This has a-Test".end_with?("Test") # => true
Oh, the possibilities are many...
Let's say we have two strings, a = "This-Test has a" and b = "This has a-Test.
Because you want to match any string that ends exactly in "Test", a good RegEx would be /Test$/ which means "capital T, followed by e, then s, then t, then the end of the line ($)".
Ruby has the =~ operator which performs a RegEx match against a string (or string-like object):
a =~ /Test$/ # => nil (because the string does not match)
b =~ /Test$/ # => 11 (as in one match, starting at character 11)
You could also use String#match:
a.match(/Test$/) # => nil (because the string does not match)
b.match(/Test$/) # => a MatchData object (indicating at least one hit)
Or you could use String#scan:
a.scan(/Test$/) # => [] (because there are no matches)
b.scan(/Test$/) # => ['Test'] (which is the matching part of the string)
Or you could just use ===:
/Test$/ === a # => false (because there are no matches)
/Test$/ === b # => true (because there was a match)
Or you can use String#end_with?:
a.end_with?('Test') # => false
b.end_with?('Test') # => true
...or one of several other methods. Take your pick.
You can use the regex /Test$/ to test:
"This-Test has a " =~ /Test$/
#=> nil
"This has a-Test" =~ /Test$/
#=> 11
You can use a range:
"Your string"[-4..-1] == "Test"
You can use a regex:
"Your string " =~ /Test$/
String's [] makes it nice and easy and clean:
"This-Test has a "[/Test$/] # => nil
"This has a-Test"[/Test$/] # => "Test"
If you need case-insensitive:
"This-Test has a "[/test$/i] # => nil
"This has a-Test"[/test$/i] # => "Test"
If you want true/false:
str = "This-Test has a "
!!str[/Test$/] # => false
str = "This has a-Test"
!!str[/Test$/] # => true

How do I get the match data for all occurrences of a Ruby regular expression in a string?

I need the MatchData for each occurrence of a regular expression in a string. This is different than the scan method suggested in Match All Occurrences of a Regex, since that only gives me an array of strings (I need the full MatchData, to get begin and end information, etc).
input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
numbers.match input # #<MatchData "12"> (only the first match)
input.scan numbers # ["12", "34", "567"] (all matches, but only the strings)
I suspect there is some method that I've overlooked. Suggestions?
You want
"abc12def34ghijklmno567pqrs".to_enum(:scan, /\d+/).map { Regexp.last_match }
which gives you
[#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]
The "trick" is, as you see, to build an enumerator in order to get each last_match.
My current solution is to add an each_match method to Regexp:
class Regexp
def each_match(str)
start = 0
while matchdata = self.match(str, start)
yield matchdata
start = matchdata.end(0)
end
end
end
Now I can do:
numbers.each_match input do |match|
puts "Found #{match[0]} at #{match.begin(0)} until #{match.end(0)}"
end
Tell me there is a better way.
I’ll put it here to make the code available via a search:
input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
input.gsub(numbers) { |m| p $~ }
The result is as requested:
⇒ #<MatchData "12">
⇒ #<MatchData "34">
⇒ #<MatchData "567">
See "input.gsub(numbers) { |m| p $~ } Matching data in Ruby for all occurrences in a string" for more information.
I'm surprised nobody mentioned the amazing StringScanner class included in Ruby's standard library:
require 'strscan'
s = StringScanner.new('abc12def34ghijklmno567pqrs')
while s.skip_until(/\d+/)
num, offset = s.matched.to_i, [s.pos - s.matched_size, s.pos - 1]
# ..
end
No, it doesn't give you the MatchData objects, but it does give you an index-based interface into the string.
input = "abc12def34ghijklmno567pqrs"
n = Regexp.new("\\d+")
[n.match(input)].tap { |a| a << n.match(input,a.last().end(0)+1) until a.last().nil? }[0..-2]
=> [#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]

Resources