JSON with JRuby - Not parsing the result in UTF-8 - ruby

I am using JSON implementation for Ruby in my rails project to parse the JSON string sent by ajax, but I found that although the json string is in UTF-8, the result coming out is in ASCII-8BIT by default, see below
jruby-1.6.7 :068 > json_text = '["に到着を待っている"]'
=> "[\"に到着を待っている\"]"
jruby-1.6.7 :069 > json_text.encoding
=> #<Encoding:UTF-8>
jruby-1.6.7 :070 > json_parsed = JSON.parse(json_text)
=> ["\u00E3\u0081\u00AB\u00E5\u0088\u00B0\u00E7\u009D\u0080\u00E3\u0082\u0092\u00E5\u00BE\u0085\u00E3\u0081\u00A3\u00E3\u0081\u00A6\u00E3\u0081\u0084\u00E3\u0082\u008B"]
jruby-1.6.7 :071 > json_parsed.first.encoding
=> #<Encoding:ASCII-8BIT>
I don't want it being escaped, I would like to have a UTF-8 result. Is there a way to set that? I check the documentation of the JSON project, finding not encoding options for the method JSON.parse. Maybe I missed something, how could I do that?
UPDATE:
as notified by #fl00r, this example is working fine in MRI, but not in JRUBY

This looks like a bug, as this actually works when using the pure version:
jruby-1.6-head :001 > require 'json/pure'
=> true
jruby-1.6-head :002 > json_text = '["に到着を待っている"]'
=> "[\"に到着を待っている\"]"
jruby-1.6-head :003 > json_parsed = JSON.parse(json_text)
=> ["に到着を待っている"]
jruby-1.6-head :004 > json_parsed.first.encoding
=> #<Encoding:UTF-8>
jruby-1.6-head :005 >
Edit: Just saw you opened a ticket for this...
Edit 2: This actually seems to have already been fixed by this commit. To install latest code from json:
$ git clone https://github.com/flori/json.git
$ cd json
$ rake jruby_gem
$ jruby -S gem install pkg/json-1.6.6-java.gem

Related

Ruby to_json issue with error "illegal/malformed utf-8"

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'
\xAE is not a valid character in UTF-8, you have to use \u00AE instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"
Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.

Convert Ruby object into HTML string with awesome_print

I'm trying to prettify an Ruby object using awesome_print so I can place this string inside an email and send it off. So in terms of code, (I know this is wrong), but here's what I'm trying to achieve:
my_str = (ap error.object).to_str
# Do something with my_str, like stick it in a <pre> tag inside an html email.
How do I convert the output from ap to string? Reason I'm asking is as I noticed, ap seems to only return the object.
It doesn't seem to be documented in the README.md, but if you look at the Kernel modifications the library makes here: https://github.com/michaeldv/awesome_print/blob/master/lib/awesome_print/core_ext/kernel.rb
You can see that in addition to the ap method, the awesome_print gem also adds an ai method to all objects.
1.9.3p392 :001 > require 'awesome_print'
=> true
1.9.3p392 :002 > test = {a: "b"}
=> {:a=>"b"}
1.9.3p392 :003 > ap test
{
:a => "b"
}
1.9.3p392 :006 > test.ai
=> "{\n :a\e[0;37m => \e[0m\e[0;33m\"b\"\e[0m\n}"
1.9.3p392 :007 > test.ai(html:true)
=> "<pre>{\n <pre>:a</pre><kbd style=\"color:slategray\"> => </kbd><pre><kbd style=\"color:brown\">"b"</kbd></pre>\n}</pre>"
That said, the output formatting might not be that useful (the html version adds a ton of whitespace, and the non-html version has the weird terminal coloring characters), and being an undocumented feature, it's liable to break without warning in a minor version update.
The other thing worth noting in the kernel.rb above is that ap and ai have aliases: awesome_print and awesome_inspect.
awesomeprint is meant for printing ASCII colors and stuff, not HTML. What I'd use is pygments gem:
# gem install pygments.rb
require 'pygments'
str = <<EOT
# This is an awesome comment on my rb script
a = 2
puts a
hsh = {asdf: 1, qwer: 2, uiop: 3}
EOT
Pygments.highlight str
https://github.com/tmm1/pygments.rb

How do I avoid pretty-printing HTML in Nokogiri while using to_html?

I am using Nokogiri with Ruby on Rails v2.3.8.
Is there a way in which I can avoid pretty-printing in Nokogiri while using to_html?
I read that to_xml allows this to be done using to_xml(:indent => 0), but this doesn't work with to_html.
Right now I am using gsub to strip away new-line characters. Does Nokogiri provide any option to do it?
I solved this using .to_html(save_with: 0)?
2.1.0 :001 > require 'nokogiri'
=> true
2.1.0 :002 > doc = Nokogiri::HTML.fragment('<ul><li><span>hello</span> boom!</li></ul>')
=> #<Nokogiri::HTML::DocumentFragment:0x4e4cbd2 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x4e4c97a name="ul" children=[#<Nokogiri::XML::Element:0x4e4c47a name="li" children=[#<Nokogiri::XML::Element:0x4e4c240 name="span" children=[#<Nokogiri::XML::Text:0x4e4c0a6 "hello">]>, #<Nokogiri::XML::Text:0x4e4c86c " boom!">]>]>]>
2.1.0 :003 > doc.to_html
=> "<ul><li>\n<span>hello</span> boom!</li></ul>"
2.1.0 :004 > doc.to_html(save_with: 0)
=> "<ul><li><span>hello</span> boom!</li></ul>"
tested on: nokogiri (1.6.5) + libxml2 2.7.6.dfsg-1ubuntu1 + ruby 2.1.0p0 (2013-12-25 revision 44422) [i686-linux]
You can use Nokogiri::HTML.fragment() instead of just Nokogiri::HTML(). When you perform to_html it won't add newlines, a DOCTYPE header or make it 'pretty' in any way.

How to Call/Require Ruby 1.8 Lib from Ruby 1.9

I'm using a Ruby 1.8 lib kakasi-ruby, but it seems that it can only be compiled against Ruby 1.8 (https://github.com/hogelog/kakasi-ruby/issues/2)
My application is Ruby 1.9.3, so I need to call kakasi-ruby from Ruby 1.9.3.
How should I do?
Do I have to open a subprocess with Ruby 1.8, and wait for it finish to get the process return value?
Edit:
https://github.com/hogelog/kakasi-ruby
Found 3 possible paths:
There seems to be a branch for 1.9 in the repo. Maybe try to compile that instead?
Otherwise your fastest option is probably to go back to 1.8 depending on what kind of app it is.
Calling with 1.8 may work BUT since the library seems to be a binding to some C code you could probably call that code directly just as well.
BTW, here is the usage in Ruby 1.9
plee#sos:~/Japanese$ irb
1.9.3p194 :001 > require 'kakasi'
=> true
1.9.3p194 :002 > src="前原誠司経済財政相は4日、朝日新聞などのインタビューに対し"
=> "前原誠司経済財政相は4日、朝日新聞などのインタビューに対し"
1.9.3p194 :003 > src=src.encode("EUC-JP", "UTF-8")
=> "\x{C1B0}\x{B8B6}\x{C0BF}\x{BBCA}\x{B7D0}\x{BAD1}\x{BAE2}\x{C0AF}\x{C1EA}\x{A4CF}\x{A3B4}\x{C6FC}\x{A1A2}\x{C4AB}\x{C6FC}\x{BFB7}\x{CAB9}\x{A4CA}\x{A4C9}\x{A4CE}\x{A5A4}\x{A5F3}\x{A5BF}\x{A5D3}\x{A5E5}\x{A1BC}\x{A4CB}\x{C2D0}\x{A4B7}"
1.9.3p194 :004 > dst=Kakasi.kakasi("-w", src)
=> "\xC1\xB0\xB8\xB6 \xC0\xBF\xBB\xCA \xB7\xD0\xBA\xD1 \xBA\xE2\xC0\xAF \xC1\xEA \xA4\xCF \xA3\xB4 \xC6\xFC \xA1\xA2 \xC4\xAB\xC6\xFC\xBF\xB7\xCA\xB9 \xA4\xCA\xA4\xC9\xA4\xCE \xA5\xA4\xA5\xF3\xA5\xBF\xA5\xD3\xA5\xE5\xA1\xBC \xA4\xCB \xC2\xD0\xA4\xB7"
1.9.3p194 :005 > dst.force_encoding("EUC-JP")
=> "\x{C1B0}\x{B8B6} \x{C0BF}\x{BBCA} \x{B7D0}\x{BAD1} \x{BAE2}\x{C0AF} \x{C1EA} \x{A4CF} \x{A3B4} \x{C6FC} \x{A1A2} \x{C4AB}\x{C6FC}\x{BFB7}\x{CAB9} \x{A4CA}\x{A4C9}\x{A4CE} \x{A5A4}\x{A5F3}\x{A5BF}\x{A5D3}\x{A5E5}\x{A1BC} \x{A4CB} \x{C2D0}\x{A4B7}"
1.9.3p194 :006 > dst=dst.encode("UTF-8", "EUC-JP")
=> "前原 誠司 経済 財政 相 は 4 日 、 朝日新聞 などの インタビュー に 対し"
1.9.3p194 :007 >

Ruby JSON.parse returning incorrect data for unicode

I'm trying to parse some JSON containing escaped unicode characters using JSON.parse. But on one machine, using json/ext, it gives back incorrect values. For example, \u2030 should return E2 80 B0 in UTF-8, but instead I'm getting 01 00 00. It fails with either the escaped "\\u2030" or the unescaped "\u2030".
1.9.2p180 :001 > require 'json/ext'
=> true
1.9.2p180 :002 > s = JSON.parse '{"f":"\\u2030"}'
=> {"f"=>"\u0001\u0000\u0000"}
1.9.2p180 :003 > s["f"].encoding
=> #<Encoding:UTF-8>
1.9.2p180 :004 > s["f"].valid_encoding?
=> true
1.9.2p180 :005 > s["f"].bytes.map do |x| x; end
=> [1, 0, 0]
It works on my other machine with the same version of ruby and similar environment variables. The Gemfile.lock on both machines is identical, including json (= 1.6.3). It does work with json/pure on both machines.
1.9.2p180 :001 > require 'json/pure'
=> true
1.9.2p180 :002 > s = JSON.parse '{"f":"\\u2030"}'
=> {"f"=>"‰"}
1.9.2p180 :003 > s["f"].encoding
=> #<Encoding:UTF-8>
1.9.2p180 :004 > s["f"].valid_encoding?
=> true
1.9.2p180 :005 > s["f"].bytes.map do |x| x; end
=> [226, 128, 176]
So is there something else in my environment or setup that could be causing it to parse incorrectly?
Recently ran into this same problem, and I tracked it down to this Ruby bug caused by the declaration of this buffer in Ruby 1.9.2 and how it gets optimized by GCC. It's fixed in this commit.
You can recompile Ruby with -O0 or use a newer version of Ruby (1.9.3 or better) to fix it.
Try upgrade your JSON Gem (at least to 1.6.6) or newest 1.7.1.

Resources