convert utf-8 to unicode in ruby - ruby

The UTF-8 of "龅" is E9BE85 and the unicode is U+9F85. Following code did not work as expected:
irb(main):004:0> "龅"
=> "\351\276\205"
irb(main):005:0> Iconv.iconv("unicode","utf-8","龅").to_s
=> "\377\376\205\237"
P.S: I am using Ruby1.8.7.

Ruby 1.9+ is much better equipped to deal with Unicode than 1.8.7, so, I strongly suggest running under 1.9.2 if at all possible.
Part of the problem is that 1.8 didn't understand that a UTF-8 or Unicode character could be more than one byte long. 1.9 does understand that and introduces things like String#each_char.
require 'iconv'
# encoding: UTF-8
RUBY_VERSION # => "1.9.2"
"龅".encoding # => #<Encoding:UTF-8>
"龅".each_char.entries # => ["龅"]
Iconv.iconv("unicode","utf-8","龅").to_s # =>
# ~> -:8:in `iconv': invalid encoding ("unicode", "utf-8") (Iconv::InvalidEncoding)
# ~> from -:8:in `<main>'
To get the list of available encodings with Iconv, do:
require 'iconv'
puts Iconv.list
It's a long list so I won't add it here.

You can try this:
"%04x" % "龅".unpack("U*")[0]
=> "9f85"

Should use UNICODEBIG// as the target encoding
irb(main):014:0> Iconv.iconv("UNICODEBIG//","utf-8","龅")[0].each_byte {|b| puts b.to_s(16)}
9f
85
=> "\237\205"

Related

Ruby to_json issue with error "illegal/malformed utf-8"

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'
\xAE is not a valid character in UTF-8, you have to use \u00AE instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"
Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.

Delete non-UTF characters from a string in Ruby?

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.
You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
For more info look into Ruby-Docs
You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.
You can use /n, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.
Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"
The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join
data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

How do I get YAML in Ruby as of 1.9.3 to dump ASCII-8Bit strings as strings?

Here's the problem: I might have strings that are UTF-8, and I might have strings that are US-ASCII. Regardless of the encoding, I'd like YAML.dump(str) to actually dump String objects, instead of these useless !binary objects as the example shows.
Is there a flag or something I'm not seeing to force YAML.dump() to do the right thing?
Ruby 1.9.1 example
YAML::VERSION # "0.60"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- foo\n"
Ruby 1.9.3 example
YAML::VERSION # "1.2.2"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- !binary |-\n Zm9v\n"
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"
So, looks like using the old yamler engine with force the old behavior.
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"

Thor & YAML outputting as binary?

I'm using Thor and trying to output YAML to a file. In irb I get what I expect. Plain text in YAML format. But when part of a method in Thor, its output is different...
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def set
test = {"name" => "Xavier", "age" => 30}
puts test
# {"name"=>"Xavier", "age"=>30}
puts test.to_yaml
# !binary "bmFtZQ==": !binary |-
# WGF2aWVy
# !binary "YWdl": 30
File.open("data/config.yml", "w") {|f| f.write(test.to_yaml) }
end
end
Any ideas?
All Ruby 1.9 strings have an encoding attached to them.
YAML encodes some non-UTF8 strings as binary, even when they look innocent, without any high-bit characters. You might think that your code is always using UTF8, but builtins can return non-UTF8 strings (ex File path routines).
To avoid binary encoding, make sure all your strings encodings are UTF-8 before calling to_yaml. Change the encoding with force_encoding("UTF-8") method.
For example, this is how I encode my options hash into yaml:
options = {
:port => 26000,
:rackup => File.expand_path(File.join(File.dirname(__FILE__), "../sveg.rb"))
}
utf8_options = {}
options.each_pair { |k,v| utf8_options[k] = ((v.is_a? String) ? v.force_encoding("UTF-8") : v)}
puts utf8_options.to_yaml
Here is an example of yaml encoding simple strings as binary
>> x = "test"
=> "test"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.to_yaml
=> "--- test\n...\n"
>> x.force_encoding "ASCII-8BIT"
=> "test"
>> x.to_yaml
=> "--- !binary |-\n dGVzdA==\n"
After version 1.9.3p125, ruby build-in YAML engine will treat all BINARY encoding differently than before. All you need to do is to set correct non-BINARY encoding before your String.to_yaml.
in Ruby 1.9, All String object have attached a Encoding object
and as following blog ( by James Edward Gray II ) mentioned, ruby have build in three type of encoding when String is generated:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings.
One of encoding may solve your problem => Source code Encoding
This is the encoding of your source code, and can be specify by adding magic encoding string at the first line or second line ( if you have a sha-bang string at the first line of your source code )
the magic encoding code could be one of following:
# encoding: utf-8
# coding: utf-8
# -- encoding : utf-8 --
so in your case, if you use ruby 1.9.3p125 or later, this should be solved by adding one of magic encoding in the beginning of your code.
# encoding: utf-8
require 'thor'
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def bar
test = {"name" => "Xavier", "age" => 30}
puts test
#{"name"=>"Xavier", "age"=>30}
puts test["name"].encoding.name
#UTF-8
puts test.to_yaml
#---
#name: Xavier
#age: 30
puts test.to_yaml.encoding.name
#UTF-8
end
end
I have been struggling with this using 1.9.3p545 on Windows - just with a simple hash containing strings - and no Thor.
The gem ZAML solves the problem quite simply:
require 'ZAML'
yaml = ZAML.dump(some_hash)
File.write(path_to_yaml_file, yaml)

How can i transform the utf8 chars to iso8859-1

the question is that the title say! who can tell me how do this in ruby!
~ UPDATE ~
ruby-iconv has been superseded from Ruby 1.9.3 onwards by the encode method. See
Jörg W Mittag's answer for details, but in short:
utf8string = "èàòppè"
iso_string = utf8string.encode('ISO-8859-1')
I agree with Williham Totlandt in thinking that this type of conversion might not be the smartest idea ever, but anyway: use ruby-iconv :)
utf8string = "èàòppè"
iso_string = Iconv.conv 'iso8859-1', 'UTF-8', utf8string
With Ruby 1.9, that's particularly easy, because all strings carry their encoding with them:
# coding: UTF-8
u = 'µ'
As you can see, the string is encoded as UTF-8:
p u.encoding # => #<Encoding:UTF-8>
p u.bytes.to_a # => [194, 181]
Transcoding the string is quite easy:
i = u.encode('ISO-8859-1')
i is now in ISO-8859-1 encoding:
p i.encoding # => #<Encoding:ISO8859-1>
p i.bytes.to_a # => [181]
If you want to write to a file, the network, an IO stream or the console, it gets even easier. In Ruby 1.9, those objects are tagged with an encoding just like strings are, and transcoding happens automatically. Just say print or puts and Ruby will do the transcoding for you:
File.open('test.txt', 'w', encoding: 'ISO-8859-1') do |f|
f.puts u
end

Resources