I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?
In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation
irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"
It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.
d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.
Related
I am trying to escape certain characters in a string. In particular, I want to turn
abc/def.ghi into abc\/def\.ghi
I tried to use the following syntax:
1.9.3p125 :076 > "abc/def.ghi".gsub(/([\/.])/, '\\\1')
=> "abc\\1def\\1ghi"
Hmm. This behaves as if capture replacements didn't work. Yet, when I tried this:
1.9.3p125 :075 > "abc/def.ghi".gsub(/([\/.])/, '\1')
=> "abc/def.ghi"
... I got the replacement to work, but, of course, my prefixes weren't part of it.
What is the correct syntax to do something like this?
This should be easier
gsub(/(?=[.\/])/, "\\")
If you are trying to prepare a string to be used as a regex pattern, use the right tool:
Regexp.escape('abc/def.ghi')
=> "abc/def\\.ghi"
You can then use the resulting string to create a regex:
/#{ Regexp.escape('abc/def.ghi') }/
=> /abc\/def\.ghi/
or:
Regexp.new(Regexp.escape('abc/def.ghi'))
=> /abc\/def\.ghi/
From the docs:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.
Regexp.escape('\*?{}.') #=> \\\*\?\{\}\.
You can pass a block to gsub:
>> "abc/def.ghi".gsub(/([\/.])/) {|m| "\\#{m}"}
=> "abc\\/def\\.ghi"
Not nearly as elegant as #sawa's answer, but it was the only way I could find to get it to work if you need the replacing string to contain the captured group/backreference (rather than inserting the replacement before the look-ahead).
I came upon a strange character (using Nokogiri).
irb(main):081:0> sss.dump
=> "\"\\u{a0}\""
irb(main):082:0> puts sss
=> nil
irb(main):083:0> sss
=> " "
irb(main):084:0> sss =~ /\s/
=> nil
irb(main):085:0> sss =~ /[[:print:]]/
=> 0
irb(main):087:0> sss == ' '
=> false
irb(main):088:0> sss.length
=> 1
Any idea what is this strange character?
When it's displayed in a webpage, it's a white space, but it doesn't match a whitespace \s
using regular expression. Ruby even thinks it's a printable character!
How do I detect characters like this and exclude them or flag them as whitespace (if possible)?
Thanks
It's the non-breaking space. In HTML, it's used pretty frequently and often written as . One way to find out the identity of a character like "\u{a0}" is to search the web for U+00A0 (using four or more hexadecimal digits) because that's how the Unicode specification notates Unicode code points.
The non-breaking space and other things like it are included in the regex /[[:space:]]/.
Here is a Ruby code that explains the issue:
1.8.7 :018 > pattern[:key] = '554 5.7.1.*Service unavailable; Helo command .* blocked using'
=> "554 5.7.1.*Service unavailable; Helo command .* blocked using"
1.8.7 :019 > line = '554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl'
=> "554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl"
1.8.7 :020 > line =~ /554 5.7.1.*Service unavailable; Helo command .* blocked using/
=> 0
1.8.7 :021 > line =~ /pattern[:key]/
=> nil
1.8.7 :022 >
Regex works when using directly as a string but doesn't work when I'm using it as a value of hash.
This isn't a Ruby question per se, it's how to construct a regex pattern that accomplishes what you want.
In "regex-ese", /pattern[:key]/ means:
Find pattern.
Following pattern look for one of :, k, e or y.
Ruby doesn't automatically interpolate variables in strings or regex patterns like Perl does, so, instead, we have to mark where the variable is using #{...} inline.
If you're only using /pattern[:key]/ as a pattern, don't bother interpolating it into a pattern. Instead, take the direct path and let Regexp do it for you:
pattern[:key] = 'foo'
Regexp.new(pattern[:key])
=> /foo/
Which is the same result as:
/#{pattern[:key]}/
=> /foo/
but doesn't waste CPU cycles.
Another of your attempts used ., [ and ], which are reserved characters in patterns, used to help define patterns. If you need to use such characters, you can have Ruby's Regexp.escape add \ escape characters appropriately, preserving their normal/literal meaning in the string:
Regexp.escape('5.7.1 [abc]')
=> "5\\.7\\.1\\ \\[abc\\]"
which, in real life is "5\.7\.1\ \[abc\]" (when not being displayed in IRB)
To use that in a regex, use:
Regexp.new(Regexp.escape('5.7.1 [abc]'))
=> /5\.7\.1\ \[abc\]/
line =~ /#{pattern[:key]}/
or...
line =~ Regexp.new pattern[:key]
If you want to escape regex special characters...
line =~ /#{Regexp.quote pattern[:key]}/
Edit: Since you're new to ruby, may I suggest you do this, wherever pattern is defined:
pattern[:key] = Regexp.new '554 5.7.1.*Service unavailable; Helo command .* blocked using'
Then you can simple use the Regexp object stored in pattern
line =~ pattern[:key]
Isn't the diacritical mark above "a" should be removed by the Regex?
"hǎo".gsub(/\p{Nonspacing_Mark}/, '')
=> "hǎo"
"hǎo".gsub(/\p{Mn}/, '')
=> "hǎo"
Update:
I kind of get it from how it works in Java.
Normalizer.normalize("hǎo", Form.NFD).replaceAll("\\p{Mn}+", "")
I need to normalizer it first to split the "ǎ" into "a" and the diacritical mark.
puts UnicodeUtils.nfkd("ﻺ (hǎo)").gsub(/[\p{Nonspacing_Mark}]/, '')
See How to replace the Unicode gem on Ruby 1.9?
What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?
In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'
Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.
Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.
'abc'[1..1] # => "b"
'abc'[1].chr # => "b"