Stumped by a simple regex - ruby

I am trying to see if the string s contains any of the symbols in a regex. The regex below works fine on rubular.
s = "asd#d"
s =~ /[~!##$%^&*()]+/
But in Ruby 1.9.2, it gives this error message:
syntax error, unexpected ']', expecting tCOLON2 or '[' or '.'
s = "asd#d"; s =~ /[~!##$%^&*()]/
What is wrong?

This is actually a special case of string interpolation with global and instance variables that most seem not to know about. Since string interpolation also occurs within regex in Ruby, I'll illustrate below with strings (since they provide for an easier example):
#foo = "instancefoo"
$foo = "globalfoo"
"##foo" # => "instancefoo"
"#$foo" # => "globalfoo"
Thus you need to escape the # to prevent it from being interpolated:
/[~!#\#$%^&*()]+/
The only way that I know of to create a non-interpolated regex in Ruby is from a string (note single quotes):
Regexp.new('[~!##$%^&*()]+')

I was able to replicate this behavior in 1.9.3p0. Apparently there is a problem with the '#$' combination. If you escape either it works. If you reverse them it works:
s =~ /[~!#$#%^&*()]+/
Edit: in Ruby 1.9 #$ invokes variable interpolation, even when followed by a % which is not a valid variable name.

I disagree, you need to escape the $, its the end of string character.
s =~ /[~!##\$%^&*()]/ => 3
That is correct.

Related

Use ARGV[] argument vector to pass a regular expression in Ruby

I am trying to use gsub or sub on a regex passed through terminal to ARGV[].
Query in terminal: $ruby script.rb input.json "\[\{\"src\"\:\"
Input file first 2 lines:
[{
"src":"http://something.com",
"label":"FOO.jpg","name":"FOO",
"srcName":"FOO.jpg"
}]
[{
"src":"http://something123.com",
"label":"FOO123.jpg",
"name":"FOO123",
"srcName":"FOO123.jpg"
}]
script.rb:
dir = File.dirname(ARGV[0])
output = File.new(dir + "/output_" + Time.now.strftime("%H_%M_%S") + ".json", "w")
open(ARGV[0]).each do |x|
x = x.sub(ARGV[1]),'')
output.puts(x) if !x.nil?
end
output.close
This is very basic stuff really, but I am not quite sure on how to do this. I tried:
Regexp.escape with this pattern: [{"src":".
Escaping the characters and not escaping.
Wrapping the pattern between quotes and not wrapping.
Meditate on this:
I wrote a little script containing:
puts ARGV[0].class
puts ARGV[1].class
and saved it to disk, then ran it using:
ruby ~/Desktop/tests/test.rb foo /abc/
which returned:
String
String
The documentation says:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally, e.g. '\d' will match a backlash followed by ā€˜dā€™, instead of a digit.
That means that the regular expression, though it appears to be a regex, it isn't, it's a string because ARGV only can return strings because the command-line can only contain strings.
When we pass a string into sub, Ruby recognizes it's not a regular expression, so it treats it as a literal string. Here's the difference in action:
'foo'.sub('/o/', '') # => "foo"
'foo'.sub(/o/, '') # => "fo"
The first can't find "/o/" in "foo" so nothing changes. It can find /o/ though and returns the result after replacing the two "o".
Another way of looking at it is:
'foo'.match('/o/') # => nil
'foo'.match(/o/) # => #<MatchData "o">
where match finds nothing for the string but can find a hit for /o/.
And all that leads to what's happening in your code. Because sub is being passed a string, it's trying to do a literal match for the regex, and won't be able to find it. You need to change the code to:
sub(Regexp.new(ARGV[1]), '')
but that's not all that has to change. Regexp.new(...) will convert what's passed in into a regular expression, but if you're passing in '/o/' the resulting regular expression will be:
Regexp.new('/o/') # => /\/o\//
which is probably not what you want:
'foo'.match(/\/o\//) # => nil
Instead you want:
Regexp.new('o') # => /o/
'foo'.match(/o/) # => #<MatchData "o">
So, besides changing your code, you'll need to make sure that what you pass in is a valid expression, minus any leading and trailing /.
Based on this answer in the thread Convert a string to regular expression ruby, you should use
x = x.sub(/#{ARGV[1]}/,'')
I tested it with this file (test.rb):
puts "You should not see any number [0123456789].".gsub(/#{ARGV[0]}/,'')
I called the file like so:
ruby test.rb "\d+"
# => You should not see any number [].

Regex string with grouping?

I see in the documentation I'm able to do:
/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
puts dollars #=> prints 3
I was wondering if this would be possible:
string = "\$(\?<dlr>\d+)\.(\?<cts>\d+)"
/#{Regexp.escape(string)}/ =~ "$3.67"
I get:
`<main>': undefined local variable or method `dlr' for main:Object (NameError)
There are a few mistakes in your approach. First of all, let's look at your string:
string = "\$(\?<dlr>\d+)\.(\?<cts>\d+)"
You escape the dollar sign with "\$", but that is the same as just writing "$", consider:
"\$" == "$"
#=> true
To actually end up with the string "backslash followed by dollar" you would need to write "\\$". The same thing applies to the decimal character classes, you would have to write "\\d" to end up with the correct string.
The question marks on the other hand are actually part of the regex syntax, so you do not want to escape these at all. I recommend using single quotes for your original string, because that makes the input much easier:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
#=> "\\$(?<dlr>\\d+)\\.(?<cts>\\d+)"
The next issue is with Regexp.escape. Take a look at what regular expression it produces with the above string:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
Regexp.escape(string)
#=> "\\\\\\$\\(\\?<dlr>\\\\d\\+\\)\\\\\\.\\(\\?<cts>\\\\d\\+\\)"
That's one level too much escaping. Regexp.escape can be used when you want to match the literal characters that are contained in the string. For example, the escaped regex above will match the source string itself:
/#{Regexp.escape(string)}/ =~ string
#=> 0 # matches at offset 0
Instead, you can use Regexp.new to treat the source as an actual regular expression.
The last issue is then how you access the match result. Obviously, you are getting a NoMethodError. You might think that the match result is stored in local variables called dlr and cts, but that is not the case. You have two options to access the match data:
Use Regexp.match, it will return a MatchData object as result
Use regexp =~ string and then access the last match data with the global variable $~
I prefer the former, because it is easier to read. The full code would then look like this:
string = '\$(?<dlr>\d+)\.(?<cts>\d+)'
regexp = Regexp.new(string)
result = regexp.match("$3.67")
#=> #<MatchData "$3.67" dlr:"3" cts:"67">
result[:dlr]
#=> "3"
result[:cts]
#=> "67"

Replacing regex capture with the same capture and an extra string

I am trying to escape certain characters in a string. In particular, I want to turn
abc/def.ghi into abc\/def\.ghi
I tried to use the following syntax:
1.9.3p125 :076 > "abc/def.ghi".gsub(/([\/.])/, '\\\1')
=> "abc\\1def\\1ghi"
Hmm. This behaves as if capture replacements didn't work. Yet, when I tried this:
1.9.3p125 :075 > "abc/def.ghi".gsub(/([\/.])/, '\1')
=> "abc/def.ghi"
... I got the replacement to work, but, of course, my prefixes weren't part of it.
What is the correct syntax to do something like this?
This should be easier
gsub(/(?=[.\/])/, "\\")
If you are trying to prepare a string to be used as a regex pattern, use the right tool:
Regexp.escape('abc/def.ghi')
=> "abc/def\\.ghi"
You can then use the resulting string to create a regex:
/#{ Regexp.escape('abc/def.ghi') }/
=> /abc\/def\.ghi/
or:
Regexp.new(Regexp.escape('abc/def.ghi'))
=> /abc\/def\.ghi/
From the docs:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.
Regexp.escape('\*?{}.') #=> \\\*\?\{\}\.
You can pass a block to gsub:
>> "abc/def.ghi".gsub(/([\/.])/) {|m| "\\#{m}"}
=> "abc\\/def\\.ghi"
Not nearly as elegant as #sawa's answer, but it was the only way I could find to get it to work if you need the replacing string to contain the captured group/backreference (rather than inserting the replacement before the look-ahead).

Ruby remove everything except some characters?

How can I remove from a string all characters except white spaces, numbers, and some others?
Something like this:
oneLine.gsub(/[^ULDR0-9\<\>\s]/i,'')
I need only: 0-9 l d u r < > <space>
Also, is there a good document about the use of regex in Ruby, like a list of special characters with examples?
The regex you have is already working correctly. However, you do need to assign the result back to the string you're operating on. Otherwise, you're not changing the string (.gsub() does not modify the string in-place).
You can improve the regex a bit by adding a '+' quantifier (so consecutive characters can be replaced in one go). Also, you don't need to escape angle brackets:
oneLine = oneLine.gsub(/[^ULDR0-9<>\s]+/i, '')
A good resource with special consideration of Ruby regexes is the Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. A good online tutorial by the same author is here.
Good old String#delete does this without a regular expression. The ^ means 'NOT'.
str = "12eldabc8urp pp"
p str.delete('^0-9ldur<> ') #=> "12ld8ur "
Just for completeness: you don't need a regular expression for this particular task, this can be done using simple string manipulation:
irb(main):005:0> "asdasd123".tr('^ULDRuldr0-9<>\t\r\n ', '')
=> "dd123"
There's also the tr! method if you want to replace the old value:
irb(main):009:0> oneLine = 'UasdL asd 123'
irb(main):010:0> oneLine.tr!('^ULDRuldr0-9<>\t\r\n ', '')
irb(main):011:0> oneLine
=> "UdL d 123"
This should be a bit faster as well (but performance shouldn't be a big concern in Ruby :)

How to strip leading and trailing quote from string, in Ruby

I want to strip leading and trailing quotes, in Ruby, from a string. The quote character will occur 0 or 1 time. For example, all of the following should be converted to foo,bar:
"foo,bar"
"foo,bar
foo,bar"
foo,bar
You could also use the chomp function, but it unfortunately only works in the end of the string, assuming there was a reverse chomp, you could:
'"foo,bar"'.rchomp('"').chomp('"')
Implementing rchomp is straightforward:
class String
def rchomp(sep = $/)
self.start_with?(sep) ? self[sep.size..-1] : self
end
end
Note that you could also do it inline, with the slightly less efficient version:
'"foo,bar"'.chomp('"').reverse.chomp('"').reverse
EDIT: Since Ruby 2.5, rchomp(x) is available under the name delete_prefix, and chomp(x) is available as delete_suffix, meaning that you can use
'"foo,bar"'.delete_prefix('"').delete_suffix('"')
I can use gsub to search for the leading or trailing quote and replace it with an empty string:
s = "\"foo,bar\""
s.gsub!(/^\"|\"?$/, '')
As suggested by comments below, a better solution is:
s.gsub!(/\A"|"\Z/, '')
As usual everyone grabs regex from the toolbox first. :-)
As an alternate I'll recommend looking into .tr('"', '') (AKA "translate") which, in this use, is really stripping the quotes.
Another approach would be
remove_quotations('"foo,bar"')
def remove_quotations(str)
if str.start_with?('"')
str = str.slice(1..-1)
end
if str.end_with?('"')
str = str.slice(0..-2)
end
end
It is without RegExps and start_with?/end_with? are nicely readable.
It frustrates me that strip only works on whitespace. I need to strip all kinds of characters! Here's a String extension that will fix that:
class String
def trim sep=/\s/
sep_source = sep.is_a?(Regexp) ? sep.source : Regexp.escape(sep)
pattern = Regexp.new("\\A(#{sep_source})*(.*?)(#{sep_source})*\\z")
self[pattern, 2]
end
end
Output
'"foo,bar"'.trim '"' # => "foo,bar"
'"foo,bar'.trim '"' # => "foo,bar"
'foo,bar"'.trim '"' # => "foo,bar"
'foo,bar'.trim '"' # => "foo,bar"
' foo,bar'.trim # => "foo,bar"
'afoo,bare'.trim /[aeiou]/ # => "foo,bar"
Assuming that quotes can only appear at the beginning or end, you could just remove all quotes, without any custom method:
'"foo,bar"'.delete('"')
I wanted the same but for slashes in url path, which can be /test/test/test/ (so that it has the stripping characters in the middle) and eventually came up with something like this to avoid regexps:
'/test/test/test/'.split('/').reject(|i| i.empty?).join('/')
Which in this case translates obviously to:
'"foo,bar"'.split('"').select{|i| i != ""}.join('"')
or
'"foo,bar"'.split('"').reject{|i| i.empty?}.join('"')
Regexs can be pretty heavy and lead to some funky errors. If you are not dealing with massive strings and the data is pretty uniform you can use a simpler approach.
If you know the strings have starting and leading quotes you can splice the entire string:
string = "'This has quotes!'"
trimmed = string[1..-2]
puts trimmed # "This has quotes!"
This can also be turned into a simple function:
# In this case, 34 is \" and 39 is ', you can add other codes etc.
def trim_chars(string, char_codes=[34, 39])
if char_codes.include?(string[0]) && char_codes.include?(string[-1])
string[1..-2]
else
string
end
end
You can strip non-optional quotes with scan:
'"foo"bar"'.scan(/"(.*)"/)[0][0]
# => "foo\"bar"

Resources