Ruby Regexp: difference between new and union with a single regexp - ruby

I have simplified the examples. Say I have a string containing the code for a regex. I would like the regex to match a literal dot and thus I want it to be:
\.
So I create the following Ruby string:
"\\."
However when I use it with Regexp.union to create my regex, I get this:
irb(main):017:0> Regexp.union("\\.")
=> /\\\./
That will match a slash followed by a dot, not just a single dot. Compare the previous result to this:
irb(main):018:0> Regexp.new("\\.")
=> /\./
which gives the Regexp I want but without the needed union.
Could you explain why Ruby acts like that and how to make the correct union of regexes ? The context of utilization is that of importing JSON strings describing regexes and union-ing them in Ruby.

Passing a string to Regexp.union is designed to match that string literally. There is no need to escape it, Regexp.escape is already called internally.
Regexp.union(".")
#=> /\./
If you want to pass regular expressions to Regexp.union, don't use strings:
Regexp.union(Regexp.new("\\."))
#=> /\./

\\. is where you went wrong I think, if you want to match a . you should just use the first one \. Now you have a \ and \. and the first one is escaped.
To be safe just use the standard regex provided by Ruby which would be Regexp.new /\./ in your case
If you want to use union just use Regexp.union "." which should return /\./
From the ruby regex class:
Regexp.union("a+b*c") #=> /a\+b\*c/

Related

What is the difference between these three alternative ways to write Ruby regular expressions?

I want to match the path "/". I've tried the following alternatives, and the first two do match, but I don't know why the third doesn't:
/\A\/\z/.match("/") # <MatchData "/">
"/\A\/\z/".match("/") # <MatchData "/">
Regexp.new("/\A\/\z/").match("/") # nil
What's going on here? Why are they different?
The first snippet is the only correct one.
The second example is... misleading. That string literal "/\A\/\z/" is, obviously, not a regex. It's a string. Strings have #match method which converts its argument to a regexp (if not already one) and match against it. So, in this example, it's '/' that is the regular expression, and it matches a forward slash found in the other string.
The third line is completely broken: don't need the surrounding slashes there, they are part of regex literal, which you didn't use. Also use single quoted strings, not double quoted (which try to interpret escape sequences like \A)
Regexp.new('\A/\z').match("/") # => #<MatchData "/">
And, of course, none of the above is needed if you just want to check if a string consists of only one forward slash. Just use the equality check in this case.
s == '/'

Regular expression to clean string

I'm struggling to figure out even where to start with this. I believe there is a regular expression to make this a fairly straight forward task. I want to trim off the extra asterisks in a string.
Example string:
test="AM*BE*3***LAST****~"
I would like it to trim asterisks off only the end that don't have repeating symbols. So the resulting value in the variable would be:
test="AM*BE*3***LAST~"
In Perl I was able to use this:
s/\*+~+/~/;
Is there something similar I can do in Ruby? I'm sure there is, just struggling to find it for some reason. Any help would be greatly appreciated.
You could use this regex:
/\*+~$/
Then use the gsub method to replace all matches with a tilde ~:
test = "AM*BE*3***LAST****~"
test.gsub!(/\*+~$/, '~')
# => "AM*BE*3***LAST~"
Or you could use this more flexible regex, which matches any amount of characters after * until end of line:
/\*+([^*])+$/
Then use the first capture group ($1) as the replacement:
test.gsub(/\*+([^*])+$/) { $1 }
Ruby's String class has the [] method, which lets us use regexp as a parameter. We can also assign to that, allowing us to do things like:
foo = "AM*BE*3***LAST****~"
foo[/\*+~+$/] = '~'
foo # => "AM*BE*3***LAST~"
That reuses the match pattern from your Perl search/replace. (I'm assuming you only want to match at the end of the line because of your examples. If it needs to be anywhere in the string remove the trailing $ from the pattern.)
You can use Rubular and try to test the regex and achieve what you need based on the references down the page.
http://rubular.com/

Extracting numbers with regex in ruby from a numbers divided by a dot (thousand delimiter)

Trying to extract '4995' from the string '4.995,-' with regex in Ruby.
I tried with
/\d+/
Which seems to work from this Rubular screenshot: http://cl.ly/image/111c2x0N3s0C
but running it only outputs
4
You cannot match it in a single regex because it is not a single substring.
"4.995,-".gsub(/\D/, "") # => "4995"
I'm up-voting sawa's answer because it's a good answer.
But since you are new to regular expressions, you may want further explanation as to why his answer works for you.
When you are trying to match with the regexp /\d+/, what you are saying is "Match for me 1 or more consecutive digits." But your target string, 4.995,-, is not made up of only consecutive digits. It has a 4 and it has a 995. The first match of "1 or more consecutive digits" is 4. That's why what you're getting as a result is 4.
Try to look at your problem differently. Instead of saying, "Find me all the digits and extract those out," you could say, "Find me anything that's not a digit, and get rid of it." To do this, you can use ruby's search-and-replace function, gsub. gsub searches a target string for anything that matches a given regular expression, and then it replaces those matches with some replacement string that you also provide. Documentation on gsub can be found here
The regular expression for "non-digit" is /\D/. So, you can do a gsub that looks for any /\D/ and replaces it with a blank string.
'4.995,-'.gsub(/\D/,'')
Do as below using String#[] and String#tr:
"4.995,-"[/\d+.\d+/].tr('.','') # => "4995"
# more Rubyish way using #tr method only
"4.995,-".tr("^0-9",'') # => "4995"
p '4.995,-1'.delete('.')[/\d+/] #=> "4995"
Here's another way that, like #Arup's solution, works when a digit follows the first non-digit:
'4.995,-1'.sub('.','').to_i.to_s #=> "4995"
This works because
'4.995,-1'.sub('.','') #=> "4995,-1"
and to_i takes the first part part of a string that can be converted to a Fixnum.
Alternatively:
'4.995,-1'.to_f.to_s.sub('.','') #=> "4995"

ruby rspec and strings comparison

I'm not a ruby expert and may be this will seem a silly question...but I'm too courious about an oddity (I think) I've found in RSpec matcher called match.
You know match takes in input a string or a regex. Example:
"test".should match "test" #=> will pass
"test".should match /test/ #=> will pass
The strange begins when you insert special regex characters in the input string:
"*test*".should match "*test*" #=> will fail throwing a regex exception
This means (I thought) that input strings are interpreted as regex, then I should escape special regex characters to make it works:
"*test*".should match "\*test\*" #=> will fail with same exception
"*test*".should match /\*test\*/ #=> will pass
From this basic test, I understand that match treats input strings as regular expressions but it does not allow you to escape special regex characters.
Am I true? Is not this a singular behavior? I mean, it's a string or a regex!
EDIT AFTER ANSWER:
Following DigitalRoss (right) answer the following tests passed:
"*test*".should match "\\*test\\*" #=> pass
"*test*".should match '\*test\*' #=> pass
"*test*".should match /\*test\*/ #=> pass
What you are seeing is the different interpretation of backslash-escaped characters in String vs Regexp. In a soft (") quoted string, \* becomes a *, but /\*/ is really a backslash followed by a star.
If you use hard quotes (') for the String objects or double the backslash characters (only for the Strings, though) then your tests should produce the same results.

Backslash + captured group within Ruby regular expression

How do I excape a backslash before a captured group?
Example:
"foo+bar".gsub(/(\+)/, '\\\1')
What I expect (and want):
foo\+bar
what I unfortunately get:
foo\\1bar
How do I escape here correctly?
As others have said, you need to escape everything in that string twice. So in your case the solution is to use '\\\\\1' or '\\\\\\1'. But since you asked why, I'll try to explain that part.
The reason is that replacement sequence is being parsed twice--once by Ruby and once by the underlying regular expression engine, for whom \1 is its own escape sequence. (It's probably easier to understand with double-quoted strings, since single quotes introduce an ambiguity where '\\1' and '\1' are equivalent but '\' and '\\' are not.)
So for example, a simple replacement here with a captured group and a double quoted string would be:
"foo+bar".gsub(/(\+)/, "\\1") #=> "foo+bar"
This passes the string \1 to the regexp engine, which it understands as a reference to a capture group. In Ruby string literals, "\1" means something else entirely (ASCII character 1).
What we actually want in this case is for the regexp engine to receive \\\1. It also understands \ as an escape character, so \\1 is not sufficient and will simply evaluate to the literal output \1. So, we need \\\1 in the regexp engine, but to get to that point we need to also make it past Ruby's string literal parser.
To do that, we take our desired regexp input and double every backslash again to get through Ruby's string literal parser. \\\1 therefore requires "\\\\\\1". In the case of single quotes one slash can be omitted as \1 is not a valid escape sequence in single quotes and is treated literally.
Addendum
One of the reasons this problem is usually hidden is thanks to the use of /.+/ style regexp quotes, which Ruby treats in a special way to avoid the need to double escape everything. (Of course, this doesn't apply to gsub replacement strings.) But you can still see it in action if you use a string literal instead of a regexp literal in Regexp.new:
Regexp.new("\.").match("a") #=> #<MatchData "a">
Regexp.new("\\.").match("a") #=> nil
As you can see, we had to double-escape the . for it to be understood as a literal . by the regexp engine, since "." and "\." both evaluate to . in double-quoted strings, but we need the engine itself to receive \..
This happens due to a double string escaping. You should use 5 slashes in this case.
"foo+bar".gsub(/([+])/, '\\\\\1')
Adding \ two more times escapes this properly.
irb(main):011:0> puts "foo+bar".gsub(/(\+)/, '\\\\\1')
foo\+bar
=> nil

Resources