Isn't the diacritical mark above "a" should be removed by the Regex?
"hǎo".gsub(/\p{Nonspacing_Mark}/, '')
=> "hǎo"
"hǎo".gsub(/\p{Mn}/, '')
=> "hǎo"
Update:
I kind of get it from how it works in Java.
Normalizer.normalize("hǎo", Form.NFD).replaceAll("\\p{Mn}+", "")
I need to normalizer it first to split the "ǎ" into "a" and the diacritical mark.
puts UnicodeUtils.nfkd("ﻺ (hǎo)").gsub(/[\p{Nonspacing_Mark}]/, '')
See How to replace the Unicode gem on Ruby 1.9?
Related
In ruby, I am trying to convert
'\"\"'
to
'""'
In short, what's the cleanest way to remove the backslashes?
You can use the gsub method on any string to remove unwanted characters.
some_string.gsub('\\"', '"')
Yet another:
'\"\"'.delete("\\") # => "\"\""
Use String#tr
s = '\"\"'
s.tr('\\', '') # => "\"\""
I need to replace certain ascii characters like # and & with their hex representations for a URL which would be 40 and 26 respectively.
How can I do this in ruby? there are also some characters most notably '-' which does not need to be replaced.
require 'uri'
URI.escape str, /[#&]/
Obviously, you can widen the regex with more characters you want to escape. Or, if you want to do a whitelisting approach, you can do, say,
URI.escape str, /[^-\w]/
This is ruby, so there's a mandatory 20 different ways to do it. Here's mine:
>> a = 'one&two%three'
=> "one&two%three"
>> a.gsub(/[&%]/, '&' => '&'.ord, '%' => '%'.ord)
=> "one38two37three"
I'm pretty sure Ruby has this functionality built in for URLs. However, if you want to define some more general translation facility you may use code like the following:
s = "h#llo world"
t = { " " => "%20", "#" => "%40" };
puts s.split(//).map { |c| t[c] || c }.join
Which would output
h%40llo%20world
In the above code, t is a hash defining the mapping from specific characters to their representation. The string is broken into characters and the hash is searched for each character's equivalent.
More generically and easily:
require 'uri'
URI.escape(your_string,Regexp.new("[^#{URI::PATTERN::UNRESERVED}]")
I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?
In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation
irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"
It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.
d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.
I'm trying to search a text for a match and return it with snippet around it. For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]).
However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware.
I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch:
"Résumé" =~ /s/u
=> 3
"Resume" =~ /s/u
=> 2
Result should be the same if regex was really working in multibyte (/u), but it's returning byte index.
How you get match index in characters, not bytes?
Or maybe some other way to get snippet around (each) match?
Not a real answer, but too long for a comment.
The code
print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u
on Windows (Ruby 1.8.6, release 26.) prints:
2
2
And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints:
3
2
How about using this jindex function I wrote, which corresponds to the other methods in the jcode library:
class String
def jslice *args
split(//)[*args].join rescue ""
end
def jindex match, start=0
if match.is_a? String
match = Regexp.new(Regexp.escape(match))
end
if self.jslice(start..-1) =~ match
$PREMATCH.jlength + start
else
nil
end
end
end
What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?
In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'
Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.
Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.
'abc'[1..1] # => "b"
'abc'[1].chr # => "b"