I was trying to use gsub to remove non word characters in a string in a rails app. I used the following code:
somestring.gsub(/[\W]/i, '') #=> ""
but it is actually incorrect, it will remove letter k as well. The correct one should be:
somestring.gsub(/\W/i, '') #=> "kkk"
But my problem is that the unit test of a rails controller which contains the above code using rspec does not work, the unit test actually passes. So I created a pretty extreme test case in rspec
it "test this gsub" do
'kkk'.gsub(/[\W]/i, '').should == 'kkk'
end
the above test case should fail, but it actually passes. What is the problem here? Why would the test pass?
Ruby 1.9 switched to a different regular expression engine (Oniguruma), which accounts for the behavior change. This seems like a bug in it.
For your example, you can get around the issue by not specifying a case insensitive match:
irb(main):001:0> 'kkk'.gsub(/[\W]/i, '')
=> ""
irb(main):002:0> 'kkk'.gsub(/[\W]/, '')
=> "kkk"
irb(main):004:0> 'kkk'.gsub(/\W/i, '')
=> "kkk"
irb(main):003:0> 'kkk'.gsub(/\W/, '')
=> "kkk"
Update: It looks like removing the character group is another approach. It might be that negated matches like that aren't necessarily valid in a character group?
Related
Using Ruby, I am trying to weed out spam messages the manual way, so why exactly does the below test return false when it should return true? The tested string is the original one, so you can literally copy/paste the whole thing into your ruby console to verify this example:
irb(main):053:0> "Веautiful women fоr sеx in yоur town АU: https://links.wtf/qLFs".include? "sex"
=> false
Hint: If you replace the word "sex" inside the entire string by typing it in yourself, the test will return true as expected. So, somehow, the two "sex" strings are not the same, but on what level? How to test that correctly?
EDIT:
I have narrowed it all down to this (copy/paste it to test it!):
irb(main):073:0> "е" == "e"
=> false
JavaScript's charCodeAt method tells me that the two characters are a different Unicode value. Ruby's .ord method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE е according to a Unicode lookup table I found online.
Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.
#!/usr/bin/env ruby
CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze
for arg in ARGV
# next unless arg.is_a?(String)
arg.split('').each do |char|
p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
end
end
For reference, these are the .ord and .charCodeAt methods I used against your example. I started with JavaScript because it's a simple test in the browser console.
2.6.3 :005 > 'е'.ord
=> 1077
2.6.3 :006 > 'e'.ord
=> 101
'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101
Here are my test cases.
Expected:
JUNKINFRONThttp://francium.tech should be http://francium.tech
JUNKINFRONThttp://francium.tech/http should be http://francium.tech/http
francium.tech/http should be francium.tech/http (unaffected)
Actual result:
http://francium.tech
francium.tech/http
http
I am trying to write a regex replace for this. I tried this,
text.sub(/.*http/,'http')
However, my second and third test cases fail because it searches till the end. It would help if the answer could also do the case insensitivity.
2.5.0 :001 > url = 'francium.tech/http'
=> "francium.tech/http"
2.5.0 :002 > url.sub(/^.*?(?=http)/i,'')
=> "http"
As per my original comments, you can use the pattern as shown below. If you want a really small performance gain, you can remove one step in the regex by using the second pattern instead. If you're especially concerned with performance, the last one performs even quicker.
^.*?(?=https?://)
^.*?(?=https?:/{2})
^.*?(?=ht{2}ps?:/{2})
See code in use here
strings = [
"JUNKINFRONThttp://francium.tech",
"JUNKINFRONThttp://francium.tech/http",
"francium.tech/http"
]
strings.each { |s| puts s.sub(%r{^.*?(?=https?://)}, '') }
Outputs the following:
http://francium.tech
http://francium.tech/http
francium.tech/http
I think this may solve your problem.
str1 = 'JUNKINFRONThttp://francium.tech'# should be http://francium.tech
str2 = 'JUNKINFRONThttp://francium.tech/http'# should be http://francium.tech/http
str3 = 'francium.tech/http' #should be francium.tech/http (unaffected)
str4 = 'JUNKINFRONThttps://francium.tech/http'# should be https://francium.tech/http
[str1, str2, str3, str4].each do |str|
puts str.gsub(/^.*(http|https):\/\//i, "\\1://")
end
Result:
http://francium.tech
http://francium.tech/http
francium.tech/http
https://francium.tech/http
When using regex you should make sure to use unique strings like http:\\ or better http:\\[SOMETHING].[AT_LEAST_TWO_CHARS][MAYBE_A_SLASH] and so on...
This works for your given cases:
str = ['JUNKINFRONThttp://francium.tech',
'JUNKINFRONThttp://francium.tech/http',
'francium.tech/http']
str.each do |str|
puts str.sub(/^.*?(https?:\/{2})/, '\1') # with capturing group
puts str.sub(/^.*?(?=https?:\/{2})/, '') # with positive lookahead
end
By using a group we can use it for the replacement, another method would be to use a positive lookahead
I have just started playing with Ruby and I'm stuck on something. Is
there some trick to modify the casefold attribute of a Regexp object after
it's been instantiated?
The best idea what I tried is the following:
irb(main):001:0> a = Regexp.new('a')
=> /a/
irb(main):002:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
But none of the below seems to work:
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> nil
Something I don't understand is happening here. Where did the 'i' go on line
8?
irb(main):07:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
irb(main):08:0> aA.to_s
=> "(?-mix:a)"
irb(main):09:0>
I am using Ruby 1.9.3.
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold? #=> false
As your console output shows, a.to_s includes the case sensitiveness as an option for your subexpression, so aA is being defined as
/(?-mix:a)/i
so you're asking ruby for a regular expression that is case insensitive, but the only thing in that case insensitive regexp is a group for when case sensitivity has be turned on, so the net effect is that 'a' is matched case sensitively
Since the result of to_s is just the regular expression string itself - no delimiters or external flags - the flags are translated into the (?i:...) syntax that sets or clears them temporarily inside the expression itself. This lets you get a Regexp object back out via a simple Regexp.new(s) call that will match the same strings.
The wrapping, unfortunately, includes explicitly clearing the flags that are not set on the object. So your first regex gets stringified into something between (?:-i...) - that is, the casefold option is explicitly turned off between the parentheses. Turning it back on for the object doesn't have any effect.
You can use a.source instead of a.to_s to get just the original expression, without the flag settings:
irb(main):001:0> a=/a/
=> /a/
irb(main):002:0> aA = Regexp.new(a.source, Regexp::IGNORECASE)
=> /a/i
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> 0
As Frederick already explains, calling to_s on a regex will add modifiers around it that ensure that its properties like case-sensitiveness are preserved. So if you insert a case-sensitive regex into a case-insensitive regex, the inserted part will still be case-sensitive. Likewise the modifiers given to Regexp.new will have no effect if the first argument is a regex or the result of calling to_s on one.
To solve this issue, call source on the regex instead of to_s. Unlike to_s, source simply returns the source of regex without adding anything:
aA = Regexp.new(a.source, Regexp::IGNORECASE)
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold?
Because (?i:...) sets the i flag locally, not globally. It only applies to the part of the regex within the parentheses, not the whole regex. Of course in this case the whole regex is within the parentheses, but that doesn't matter as far as methods like casefold? are concerned.
Take this snippet of code which is supposed to replace a href tag with its URL:
irb> s='<p>Click here!</p>'
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p></p>"
This regex fails (URL is not found). Then I escape the < character in the regex, and it works:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
1: According to RubyMine's inspections, this escape should not be necessary. Is this correct? If so, why is the escape of > apparently not necessary as well?
2: Afterwards in the same IRB session, with the same string, the original regex suddenly works too:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
Is this because the $1 variable is not cleared when calling gsub again? If so, is it intentional behaviour or is this a Ruby regex bug?
3: When I change the string, and reexecute the same command, $1 will only change after calling gsub twice on the changed string:
irb> s='<p>Click here!</p>'
=> "<p>Click here!</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/xxxxyyy</p>"
Is this intentional? If so, what is the logic behind this?
4: As replacement character, some tutorials suggest using "#{$n}", others suggest using '\n'. With the backslash variant, the problems above do not appear. Why - what is the difference between the two?
Thank you!
$1 contains the first capture of the last match. In your example, it is evaluated before the matching (actually even before gsub is called), therefore the value of $1 is fixed to nil (because you did not match anything, yet). So you always get the first capture of the previous match, you do not even need to change your original regex to get the expected result the second time:
s='<p>Click here!</p>'
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p></p>"
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p>http://localhost/activate/57f7e805827f</p>"
You can pass a block to gsub though, which is evaluated after the matching, e. g.
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/){ $1 }
# => "<p>http://localhost/activate/57f7e805827f</p>"
This way, $1 behaves as you'd expect. I like to always use named captures so i don't have to keep track of the numbers when i add a capture, though:
s.gsub(/<a href="(?<href>([^ '"]*))"([^>]*)?>([^<]*)<\/a>/){ $~[:href] }
# => "<p>http://localhost/activate/57f7e805827f</p>"
Does Ruby have a some_string.starts_with("abc") method that's built in?
It's called String#start_with?, not String#startswith: In Ruby, the names of boolean-ish methods end with ? and the words in method names are separated with an _. On Rails you can use the alias String#starts_with? (note the plural - and note that this method is deprecated). Personally, I'd prefer String#starts_with? over the actual String#start_with?
Your question title and your question body are different. Ruby does not have a starts_with? method. Rails, which is a Ruby framework, however, does, as sepp2k states. See his comment on his answer for the link to the documentation for it.
You could always use a regular expression though:
if SomeString.match(/^abc/)
# SomeString starts with abc
^ means "start of string" in regular expressions
If this is for a non-Rails project, I'd use String#index:
"foobar".index("foo") == 0 # => true
You can use String =~ Regex. It returns position of full regex match in string.
irb> ("abc" =~ %r"abc") == 0
=> true
irb> ("aabc" =~ %r"abc") == 0
=> false