Ruby to_json issue with error "illegal/malformed utf-8" - ruby

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'

\xAE is not a valid character in UTF-8, you have to use \u00AE instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"

Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.

Related

Displaying a string unescaped in IRB or in general

Suppose I have a string "this\n\tis \"helpful\"" and I'd like it to be displayed in the terminal, unescaped, for copy/paste reasons, i.e.
this
is "helpful"
Is this possible in terminal, either in IRB or otherwise?
11:15:14lasto1.9.3 ~/clients 🐤 irb
1.9.3-p448 :001 > s = "this\n\tis \"helpful\""
=> "this\n\tis \"helpful\""
1.9.3-p448 :002 > puts s
this
is "helpful"
=> nil
1.9.3-p448 :003 >

Ruby JSON.parse returning incorrect data for unicode

I'm trying to parse some JSON containing escaped unicode characters using JSON.parse. But on one machine, using json/ext, it gives back incorrect values. For example, \u2030 should return E2 80 B0 in UTF-8, but instead I'm getting 01 00 00. It fails with either the escaped "\\u2030" or the unescaped "\u2030".
1.9.2p180 :001 > require 'json/ext'
=> true
1.9.2p180 :002 > s = JSON.parse '{"f":"\\u2030"}'
=> {"f"=>"\u0001\u0000\u0000"}
1.9.2p180 :003 > s["f"].encoding
=> #<Encoding:UTF-8>
1.9.2p180 :004 > s["f"].valid_encoding?
=> true
1.9.2p180 :005 > s["f"].bytes.map do |x| x; end
=> [1, 0, 0]
It works on my other machine with the same version of ruby and similar environment variables. The Gemfile.lock on both machines is identical, including json (= 1.6.3). It does work with json/pure on both machines.
1.9.2p180 :001 > require 'json/pure'
=> true
1.9.2p180 :002 > s = JSON.parse '{"f":"\\u2030"}'
=> {"f"=>"‰"}
1.9.2p180 :003 > s["f"].encoding
=> #<Encoding:UTF-8>
1.9.2p180 :004 > s["f"].valid_encoding?
=> true
1.9.2p180 :005 > s["f"].bytes.map do |x| x; end
=> [226, 128, 176]
So is there something else in my environment or setup that could be causing it to parse incorrectly?
Recently ran into this same problem, and I tracked it down to this Ruby bug caused by the declaration of this buffer in Ruby 1.9.2 and how it gets optimized by GCC. It's fixed in this commit.
You can recompile Ruby with -O0 or use a newer version of Ruby (1.9.3 or better) to fix it.
Try upgrade your JSON Gem (at least to 1.6.6) or newest 1.7.1.

JSON with JRuby - Not parsing the result in UTF-8

I am using JSON implementation for Ruby in my rails project to parse the JSON string sent by ajax, but I found that although the json string is in UTF-8, the result coming out is in ASCII-8BIT by default, see below
jruby-1.6.7 :068 > json_text = '["に到着を待っている"]'
=> "[\"に到着を待っている\"]"
jruby-1.6.7 :069 > json_text.encoding
=> #<Encoding:UTF-8>
jruby-1.6.7 :070 > json_parsed = JSON.parse(json_text)
=> ["\u00E3\u0081\u00AB\u00E5\u0088\u00B0\u00E7\u009D\u0080\u00E3\u0082\u0092\u00E5\u00BE\u0085\u00E3\u0081\u00A3\u00E3\u0081\u00A6\u00E3\u0081\u0084\u00E3\u0082\u008B"]
jruby-1.6.7 :071 > json_parsed.first.encoding
=> #<Encoding:ASCII-8BIT>
I don't want it being escaped, I would like to have a UTF-8 result. Is there a way to set that? I check the documentation of the JSON project, finding not encoding options for the method JSON.parse. Maybe I missed something, how could I do that?
UPDATE:
as notified by #fl00r, this example is working fine in MRI, but not in JRUBY
This looks like a bug, as this actually works when using the pure version:
jruby-1.6-head :001 > require 'json/pure'
=> true
jruby-1.6-head :002 > json_text = '["に到着を待っている"]'
=> "[\"に到着を待っている\"]"
jruby-1.6-head :003 > json_parsed = JSON.parse(json_text)
=> ["に到着を待っている"]
jruby-1.6-head :004 > json_parsed.first.encoding
=> #<Encoding:UTF-8>
jruby-1.6-head :005 >
Edit: Just saw you opened a ticket for this...
Edit 2: This actually seems to have already been fixed by this commit. To install latest code from json:
$ git clone https://github.com/flori/json.git
$ cd json
$ rake jruby_gem
$ jruby -S gem install pkg/json-1.6.6-java.gem

Why does Iconv work different in irb and the Ruby interpreter?

I have to convert Latin chars like éáéíóúÀÉÍÓÚ etc., into a string to similar ones without special accents or wired symbols:
é -> e
è -> e
Ä -> A
I have a file named "test.rb":
require 'iconv'
puts Iconv.iconv("ASCII//translit", "utf-8", 'è').join
When I paste those lines into irb it works, returning "e" as expected.
Running:
$ ruby test.rb
I get "?" as output.
I'm using irb 0.9.5(05/04/13) and Ruby 1.8.7 (2011-06-30 patchlevel 352) [i386-linux].
Ruby 1.8.7 was not multibyte character savvy like 1.9+ is. In general, it treats a string as a series of bytes, rather than characters. If you need better handling of such characters, consider upgrading to 1.9+.
James Gray has a series of articles about dealing with multibyte characters in Ruby 1.8. I highly recommend taking the time to read through them. It's a complex subject so you'll want to read the entire series he wrote a couple times.
Also, 1.8 encoding support needs the $KCODE flag set:
$KCODE = "U"
so you'll need to add that to code running in 1.8.
Here is a bit of sample code:
#encoding: UTF-8
require 'rubygems'
require 'iconv'
chars = "éáéíóúÀÉÍÓÚ"
puts Iconv.iconv("ASCII//translit", "utf-8", chars)
puts chars.split('')
puts chars.split('').join
Using ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-darwin10.7.0] and running it in IRB, I get:
1.8.7 :001 > #encoding: UTF-8
1.8.7 :002 >
1.8.7 :003 > require 'iconv'
true
1.8.7 :004 >
1.8.7 :005 > chars = "\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232"
"\303\251\303\241\303\251\303\255\303\263\303\272\303\200\303\211\303\215\303\223\303\232"
1.8.7 :006 >
1.8.7 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars)
'e'a'e'i'o'u`A'E'I'O'U
nil
1.8.7 :008 >
1.8.7 :009 > puts chars.split('')
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
nil
1.8.7 :010 > puts chars.split('').join
éáéíóúÀÉÍÓÚ
At line 9 in the output I told Ruby to split the line into its concept of characters, which in 1.8.7, was bytes. The resulting '?' mean it didn't know what to do with the output. A line 10 I told it to split, which resulted in an array of bytes, which join then reassembled into the normal string, allowing the multibyte characters to be translated normally.
Running the same code using Ruby 1.9.2 shows better, and more expected and desirable, behavior:
1.9.2p290 :001 > #encoding: UTF-8
1.9.2p290 :002 >
1.9.2p290 :003 > require 'iconv'
true
1.9.2p290 :004 >
1.9.2p290 :005 > chars = "éáéíóúÀÉÍÓÚ"
"éáéíóúÀÉÍÓÚ"
1.9.2p290 :006 >
1.9.2p290 :007 > puts Iconv.iconv("ASCII//translit", "utf-8", chars)
'e'a'e'i'o'u`A'E'I'O'U
nil
1.9.2p290 :008 >
1.9.2p290 :009 > puts chars.split('')
é
á
é
í
ó
ú
À
É
Í
Ó
Ú
nil
1.9.2p290 :010 > puts chars.split('').join
éáéíóúÀÉÍÓÚ
Ruby maintained the multibyte-ness of the characters, through the split('').
Notice that in both cases, Iconv.iconv did the right thing, it created characters that were visually similar to the input characters. While the leading apostrophe looks out of place, it's there as a reminder the characters were accented originally.
For more information, see the links on the right to related questions or try this SO search for [ruby] [iconv]

convert utf-8 to unicode in ruby

The UTF-8 of "龅" is E9BE85 and the unicode is U+9F85. Following code did not work as expected:
irb(main):004:0> "龅"
=> "\351\276\205"
irb(main):005:0> Iconv.iconv("unicode","utf-8","龅").to_s
=> "\377\376\205\237"
P.S: I am using Ruby1.8.7.
Ruby 1.9+ is much better equipped to deal with Unicode than 1.8.7, so, I strongly suggest running under 1.9.2 if at all possible.
Part of the problem is that 1.8 didn't understand that a UTF-8 or Unicode character could be more than one byte long. 1.9 does understand that and introduces things like String#each_char.
require 'iconv'
# encoding: UTF-8
RUBY_VERSION # => "1.9.2"
"龅".encoding # => #<Encoding:UTF-8>
"龅".each_char.entries # => ["龅"]
Iconv.iconv("unicode","utf-8","龅").to_s # =>
# ~> -:8:in `iconv': invalid encoding ("unicode", "utf-8") (Iconv::InvalidEncoding)
# ~> from -:8:in `<main>'
To get the list of available encodings with Iconv, do:
require 'iconv'
puts Iconv.list
It's a long list so I won't add it here.
You can try this:
"%04x" % "龅".unpack("U*")[0]
=> "9f85"
Should use UNICODEBIG// as the target encoding
irb(main):014:0> Iconv.iconv("UNICODEBIG//","utf-8","龅")[0].each_byte {|b| puts b.to_s(16)}
9f
85
=> "\237\205"

Resources