I have the following string:
Ruby :: \u041D\u043E\u0432\u0438\u043D\u0438
My question is how to convert it to utf8 characters (in my case cyrilic letters)?
In Ruby 1.9:
"\u041D\u043E\u0432\u0438\u043D\u0438".encode("UTF-8")
=> "Новини"
Related
I have a requirement wherein I want to dynamically create a unicode string using interpolation.For e.g. please see the following code tried out in irb
2.1.2 :016 > hex = 0x0905
=> 2309
2.1.2 :017 > b = "\u#{hex}"
SyntaxError: (irb):17: invalid Unicode escape
b = "\u#{hex}"
The hex-code 0x0905 corresponds to unicode for independent vowel for DEVANAGARI LETTER A.
I am unable to figure how to achieve the desired result.
You can pass an encoding to Integer#chr:
hex = 0x0905
hex.chr('UTF-8') #=> "अ"
The parameter can be omitted, if Encoding::default_internal is set to UTF-8:
$ ruby -E UTF-8:UTF-8 -e "p 0x0905.chr"
"अ"
You can also append codepoints to other strings:
'' << hex #=> "अ"
String interpolation happens after ruby decodes the escapes, so what you are trying to do is interpreted by ruby like an incomplete escape.
To create a unicode character from a number, you need to pack it:
hex = 0x0905
[hex].pack("U")
=> "अ"
Is it possible to get the old Ruby 1.8 behavior on a string, and work with it as a stream of bytes rather than an encoded string?
In particular, I'm trying to get a few bytes combined with a Unicode-encoded string, so:
\xFF\x00\x01#{Unicode encoded string}
However, if I try to do that, it's also trying to encode \xFF\x00\x01 which won't work.
Code
What I'm trying to do in irb:
"#{[4278190080].pack("V").force_encoding("BINARY")}\xFF".force_encoding("BINARY")
This is giving me:
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):41
from /usr/bin/irb:12:in `<main>'
I also tried with ASCII-8BIT with no luck.
Just do string = string.force_encoding("ASCII-8BIT") to any string that you want to treat as a plain old series of bytes. Then you should be able to add the two strings together.
I think .force_encoding("BINARY") might work too.
You're interpolating string literal is in UTF-8 by default. I think the Encoding::CompatibilityError is caused by interpolating a BINARY encoded string within a UTF-8 string.
Try just concatenating strings with compatible encodings, eg:
irb> s = [4278190080].pack("V") + "\xFF".force_encoding("BINARY")
=> "\x00\x00\x00\xFF\xFF"
irb>> s.encoding
=> #<Encoding:ASCII-8BIT>
irb> s=[4278190080].pack("V") + [0xFF].pack("C")
=> "\x00\x00\x00\xFF\xFF"
irb> s.encoding
=> #<Encoding:ASCII-8BIT>
I have a problem with saving records in MongoDB using Mongoid when they contain multibyte characters. This is the string:
a="Chris \xA5\xEB\xAE\xDFe\xA5"
I first convert it to BINARY and I then gsub it like this:
a.force_encoding("BINARY").gsub(0xA5.chr,"oo")
...which works fine:
=> "Chris oo\xEB\xAE\xDFeoo"
But it seems that I can not use the chr method if I use Regexp:
a.force_encoding("BINARY").gsub(/0x....?/.chr,"")
NoMethodError: undefined method `chr' for /0x....?/:Regexp
Anybody with the same issue?
Thanks a lot...
You can do that with interpolation
a.force_encoding("BINARY").gsub(/#{0xA5.chr}/,"")
gives
"Chris \xEB\xAE\xDFe"
EDIT: based on the comments, here a version that translates the binary encode string to an ascii representation and do a regex on that string
a.unpack('A*').to_s.gsub(/\\x[A-F0-9]{2}/,"")[2..-3] #=>"Chris "
the [2..-3] at the end is to get rid of the beginning [" and and trailing "]
NOTE: to just get rid of the special characters you also could just use
a.gsub(/\W/,"") #=> "Chris"
The actual string does not contain the literal characters \xA5: that is just how characters that would otherwise be unprintable are shown to you (similar when a string contains a newline ruby shows you \n).
If you want to change any non ascii stuff you could do this
a="Chris \xA5\xEB\xAE\xDFe\xA5"
a.force_encoding('BINARY').encode('ASCII', :invalid => :replace, :undef => :replace, :replace => 'oo')
This starts by forcing the string to the binary encoding (you always want to start with a string where the bytes are valid for its encoding. binary is always valid since it can contain arbitrary bytes). Then it converts it to ASCII. Normally this would raise an error since there are characters that it doesn't know what to do with but the extra options we've passed tell it to replace invalid/undefined sequences with the characters 'oo'
I am using URI.unescape to unescape text, unfortunately I run into weird error:
# encoding: utf-8
require('uri')
URI.unescape("%C3%9Fą")
results in
C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
from exe/fail.rb:3:in `<main>'
why?
Don't know why but you can use CGI.unescape method:
# encoding: utf-8
require 'cgi'
CGI.unescape("%C3%9Fą")
The implementation of URI.unescape is broken for non-ASCII inputs. The 1.9.3 version looks like this:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end
The regex in use is /%[a-fA-F\d]{2}/. So it goes through the string looking for a percent sign followed by two hex digits; in the block $& will be the matched text ('%C3' for example) and $&[1,2] be the matched text without the leading percent sign ('C3'). Then we call String#hex to convert that hexadecimal number to a Fixnum (195) and wrap it in an Array ([195]) so that we can use Array#pack to do the byte mangling for us. The problem is that pack gives us a single binary byte:
> puts [195].pack('C').encoding
ASCII-8BIT
The ASCII-8BIT encoding is also known as "binary" (i.e. plain bytes with no particular encoding). Then the block returns that byte and String#gsub tries to insert into the UTF-8 encoded copy of str that gsub is working on and you get your error:
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
because you can't (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:
URI.unescape("%C3%9F") # Works
URI.unescape("%C3µ") # Fails
URI.unescape("µ") # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ") # Fails
URI.unescape("%C3%9Fpancakes") # Works
Things start falling apart once you start mixing non-ASCII data into your URL encoded string.
One simple fix is to switch the string to binary before try to decode it:
def unescape(str, escaped = #regexp[:ESCAPED])
encoding = str.encoding
str = str.dup.force_encoding('binary')
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end
Another option is to push the force_encoding into the block:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end
I'm not sure why the gsub fails in some cases but succeeds in others.
To expand on Vasiliy's answer that suggests using CGI.unescape:
As of Ruby 2.5.0, URI.unescape is obsolete.
See https://ruby-doc.org/stdlib-2.5.0/libdoc/uri/rdoc/URI/Escape.html#method-i-unescape.
"This method is obsolete and should not be used. Instead, use CGI.unescape, URI.decode_www_form or URI.decode_www_form_component depending on your specific use case."
I want to know how to convert these kind of characters to their unicode form, such as the following one:
Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90
The unicode characters of the upper string is:
Delphi_7.0数据库开发入门与范例解析
Anybody knows how to do the conversion using Ruby? Thanks.
This is an URI-encoded string:
require 'uri'
#=> true
s = 'Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90'
#=> "Delphi_7.0%E6%95%B0%E6%8D%AE%E5%BA%93%E5%BC%80%E5%8F%91%E5%85%A5%E9%97%A8%E4%B8%8E%E8%8C%83%E4%BE%8B%E8%A7%A3%E6%9E%90"
URI.decode s
#=> "Delphi_7.0数据库开发入门与范例解析"