Ruby to_yaml utf8 string - ruby

How can I make ruby to_yaml method to store utf8 strings with original signs but not escape sequence?

require 'yaml'
YAML::ENGINE.yamler='psych'
'Résumé'.to_yaml # => "--- Résumé\n...\n"
Ruby ships with two YAML engines: syck and psych. Syck is old and not maintained, but it is default in 1.9.2, so one needs to switch to psych. Psych dumps UTF-8 strings in UTF-8.

This is probably a really bad idea as I'm sure YAML has its reasons for encoding the characters as it does, but it doesn't seem too hard to undo:
require 'yaml'
require 'yaml/encoding'
text = "Ça va bien?"
puts text.to_yaml(:Encoding => :Utf8) # => --- "\xC3\x87a va bien?"
puts YAML.unescape(YAML.dump(text)) # => --- "Ça va bien?"

Checkout Ya2Yaml at RubyForge.

For Ruby 1.9.3+, this is not a problem: the default YAML engine is Psych, which supports UTF-8 by default.
For Ruby 1.9.2- you need to install the psych gem and require it before you require yaml:
irb(main):001:0> require 'yaml'
#=> true
irb(main):002:0> require 'psych'
#=> true
irb(main):003:0> YAML::ENGINE
#=> #<YAML::EngineManager:0x00000001a1f642 #yamler="syck">
irb(main):004:0> "ça va?".to_yaml
#=> "--- \"\\xC3\\xA7a va?\"\n"
irb(main):001:0> require 'psych' # gem install psych
#=> true
irb(main):002:0> require 'yaml'
#=> true
irb(main):003:0> YAML::ENGINE
#=> #<YAML::EngineManager:0x00000001a1f828 #yamler="psych">
irb(main):004:0> "ça va bien!".to_yaml
#=> "--- ça va bien!\n...\n"
Alternatively, set the yamler as Evgeny suggests (assuming you have installed the psych gem):
irb(main):001:0> require 'yaml'
#=> true
irb(main):002:0> YAML::ENGINE.yamler
#=> "syck"
irb(main):003:0> "ça va?".to_yaml
#=> "--- \"\\xC3\\xA7a va?\"\n"
irb(main):004:0> YAML::ENGINE.yamler = 'psych'
#=> "psych"
irb(main):005:0> "ça va".to_yaml
#=> "--- ça va\n...\n"

Related

Encoding and decoding ruby symbols

I discovered this behavior of multi_json ruby gem:
2.1.0 :001 > require 'multi_json'
=> true
2.1.0 :002 > sym = :symbol
=> :symbol
2.1.0 :003 > sym.class
=> Symbol
2.1.0 :004 > res = MultiJson.load MultiJson.dump(sym)
=> "symbol"
2.1.0 :005 > res.class
=> String
Is this an appropriate way to store ruby symbols? Does JSON provide some way to distinguish :symbol from "string"?
Nope is the simple answer. Most of the time it only really matters for hashes and there's a cheat on hashes, symbolize_keys!. Bottom line is that JSON does not understand symbols, just strings.
Since you are using MultiJson, you can also ask MultiJson to do this for you...
MultiJson.load('{"abc":"def"}', :symbolize_keys => true)

ruby 1.8.7 why .to_yaml converts some Strings to non-readable bytes

Parsing some webpages with nokogiri, i've got some issues while cleaning some Strings and saving them with YAML. To reproduce the problem look at this IRB session that reproduces the same problem:
irb(main):001:0> require 'yaml'
=> true
irb(main):002:0> "1,000 €".to_yaml
=> "--- !binary |\nMSwwMDAg4oKs\n\n"
irb(main):003:0> "1,0000 €".to_yaml
=> "--- \"1,0000 \\xE2\\x82\\xAC\"\n"
irb(main):004:0> "1,00 €".to_yaml
=> "--- !binary |\nMSwwMCDigqw=\n\n"
irb(main):005:0> "1 €".to_yaml
=> "--- !binary |\nMSDigqw=\n\n"
irb(main):006:0> "23 €".to_yaml
=> "--- !binary |\nMjMg4oKs\n\n"
irb(main):007:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
irb(main):008:0> "1200000 €".to_yaml
=> "--- \"1200000 \\xE2\\x82\\xAC\"\n"
irb(main):009:0> "120000 €".to_yaml
=> "--- \"120000 \\xE2\\x82\\xAC\"\n"
irb(main):010:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
To sum up, sometimes .to_yaml outputs are readable while other times the output is unreadable. The most intriguing aspect is that the strings are very similar.
How can I avoid those !binary ... outputs?
Whether YAML prefers to dump a string as text or binary is a matter of ratio between ASCII and non ASCII characters.
If you want to avoid !binary as much as possible, you should use the ya2yaml gem. It tries hard to dump strings as ASCII + escaped UTF-8.

replicate CSV.generate_line behaviour of ruby 1.8.7 in ruby 1.9.2

ruby 1.9 now uses fastercsv, but how do i replicate the generate_line behaviour of ruby 1.8.7 ?
ruby-1.8.7-p334 :010 > require 'csv'
=> true
ruby-1.8.7-p334 :010 > CSV.generate_line(["ab","cd"], "\t")
=> "ab\tcd"
ruby-1.9.2-p180 :002 > require 'csv'
=> true
ruby-1.9.2-p180 :007 > CSV.generate_line(["ab","cd"], :row_sep => ?\t)
=> "ab,cd\t"
Notice how \t is between the two array items in ruby 1.8.7 and at last in 1.9.2
You have to use col_sep instead. row_sep is the row separator:
CSV.generate_line(["ab","cd"], :col_sep => ?\t)
=> "ab\tcd\n"
or
CSV.generate_line(["ab","cd"], :col_sep => ?\t, :row_sep => '')
=> "ab\tcd"
You can find more details and additional options in the documentation.
CSV.generate_line(['a','b','c'],:col_sep=>"\t")

Using Netbeans, why does Ruby debug not display multibytes string properly?

The env are: netbeans(v=6.9.1), ruby-debug-base (v=0.10.4), ruby-debug-ide (0.4.16) ,ruby(v=1.8.7)
During the process of debuging a Ruby script, the debuger can not display multibytes properly and always displays "Binary Data" for multibytes string in variable window view:
require 'rubygems'
require 'active_support'
str = "调试程序"
str = str.mb_chars
puts "length: #{str.length}"
BTW, I tried 0.4.16, 0.4.11 for ruby-debug-ide, but they have the same output.
Can someone tell me how to make it to display the multibyte string properly in the debug variable window view?
Part of the problem is that Ruby 1.8.7 had the beginning of multi-byte support. You probably need to define your $KCODE value for your source. See The $KCODE Variable and jcode Library
Ruby 1.9.2 has much better support for it, so give it a try if that's an option.
This is from messing around with 1.9.2 and irb:
Greg:~ greg$ irb -f
irb(main):001:0> RUBY_VERSION
=> "1.9.2"
irb(main):002:0> str = "调试程序"
=> "调试程序"
irb(main):003:0> str
=> "调试程序"
irb(main):004:0> str.each_char.to_a
=> ["调", "试", "程", "序"]
irb(main):005:0> str.each_byte.to_a
=> [232, 176, 131, 232, 175, 149, 231, 168, 139, 229, 186, 143]
irb(main):006:0> str.valid_encoding?
=> true
irb(main):007:0> str.codepoints
=> #<Enumerator: "调试程序":codepoints>
irb(main):008:0> str.each_codepoint.to_a
=> [35843, 35797, 31243, 24207]
irb(main):009:0> str.each_codepoint.to_a.map { |i| i.to_s(16) }
=> ["8c03", "8bd5", "7a0b", "5e8f"]
irb(main):010:0> str.encoding
=> #<Encoding:UTF-8>
irb(main):011:0>
And, if I run the following in Textmate while 1.9.2 is set as my default:
# encoding: UTF-8
puts RUBY_VERSION
str = "调试程序"
puts str
which outputs:
# >> 1.9.2
# >> 调试程序
Ruby Debug19 gets mad with the same code so I need to look into what its problem is.

What is the difference between Ruby 1.8 and Ruby 1.9

I'm not clear on the differences between the "current" version of Ruby (1.8) and the "new" version (1.9). Is there an "easy" or a "simple" explanation of the differences and why it is so different?
Sam Ruby has a cool slideshow that outline the differences.
In the interest of bringing this information inline for easier reference, and in case the link goes dead in the abstract future, here's an overview of Sam's slides. The slideshow is less overwhelming to review, but having it all laid out in a list like this is also helpful.
Ruby 1.9 - Major Features
Performance
Threads/Fibers
Encoding/Unicode
gems is (mostly) built-in now
if statements do not introduce scope in Ruby.
What's changed?
Single character strings.
Ruby 1.9
irb(main):001:0> ?c
=> "c"
Ruby 1.8.6
irb(main):001:0> ?c
=> 99
String index.
Ruby 1.9
irb(main):001:0> "cat"[1]
=> "a"
Ruby 1.8.6
irb(main):001:0> "cat"[1]
=> 97
{"a","b"} No Longer Supported
Ruby 1.9
irb(main):002:0> {1,2}
SyntaxError: (irb):2: syntax error, unexpected ',', expecting tASSOC
Ruby 1.8.6
irb(main):001:0> {1,2}
=> {1=>2}
Action: Convert to {1 => 2}
Array.to_s Now Contains Punctuation
Ruby 1.9
irb(main):001:0> [1,2,3].to_s
=> "[1, 2, 3]"
Ruby 1.8.6
irb(main):001:0> [1,2,3].to_s
=> "123"
Action: Use .join instead
Colon No Longer Valid In When Statements
Ruby 1.9
irb(main):001:0> case 'a'; when /\w/: puts 'word'; end
SyntaxError: (irb):1: syntax error, unexpected ':',
expecting keyword_then or ',' or ';' or '\n'
Ruby 1.8.6
irb(main):001:0> case 'a'; when /\w/: puts 'word'; end
word
Action: Use semicolon, then, or newline
Block Variables Now Shadow Local Variables
Ruby 1.9
irb(main):001:0> i=0; [1,2,3].each {|i|}; i
=> 0
irb(main):002:0> i=0; for i in [1,2,3]; end; i
=> 3
Ruby 1.8.6
irb(main):001:0> i=0; [1,2,3].each {|i|}; i
=> 3
Hash.index Deprecated
Ruby 1.9
irb(main):001:0> {1=>2}.index(2)
(irb):18: warning: Hash#index is deprecated; use Hash#key
=> 1
irb(main):002:0> {1=>2}.key(2)
=> 1
Ruby 1.8.6
irb(main):001:0> {1=>2}.index(2)
=> 1
Action: Use Hash.key
Fixnum.to_sym Now Gone
Ruby 1.9
irb(main):001:0> 5.to_sym
NoMethodError: undefined method 'to_sym' for 5:Fixnum
Ruby 1.8.6
irb(main):001:0> 5.to_sym
=> nil
(Cont'd) Ruby 1.9
# Find an argument value by name or index.
def [](index)
lookup(index.to_sym)
end
svn.ruby-lang.org/repos/ruby/trunk/lib/rake.rb
Hash Keys Now Unordered
Ruby 1.9
irb(main):001:0> {:a=>"a", :c=>"c", :b=>"b"}
=> {:a=>"a", :c=>"c", :b=>"b"}
Ruby 1.8.6
irb(main):001:0> {:a=>"a", :c=>"c", :b=>"b"}
=> {:a=>"a", :b=>"b", :c=>"c"}
Order is insertion order
Stricter Unicode Regular Expressions
Ruby 1.9
irb(main):001:0> /\x80/u
SyntaxError: (irb):2: invalid multibyte escape: /\x80/
Ruby 1.8.6
irb(main):001:0> /\x80/u
=> /\x80/u
tr and Regexp Now Understand Unicode
Ruby 1.9
unicode(string).tr(CP1252_DIFFERENCES, UNICODE_EQUIVALENT).
gsub(INVALID_XML_CHAR, REPLACEMENT_CHAR).
gsub(XML_PREDEFINED) {|c| PREDEFINED[c.ord]}
pack and unpack
Ruby 1.8.6
def xchr(escape=true)
n = XChar::CP1252[self] || self
case n when *XChar::VALID
XChar::PREDEFINED[n] or
(n>128 ? n.chr : (escape ? "&##{n};" : [n].pack('U*')))
else
Builder::XChar::REPLACEMENT_CHAR
end
end
unpack('U*').map {|n| n.xchr(escape)}.join
BasicObject More Brutal Than BlankSlate
Ruby 1.9
irb(main):001:0> class C < BasicObject; def f; Math::PI; end; end; C.new.f
NameError: uninitialized constant C::Math
Ruby 1.8.6
irb(main):001:0> require 'blankslate'
=> true
irb(main):002:0> class C < BlankSlate; def f; Math::PI; end; end; C.new.f
=> 3.14159265358979
Action: Use ::Math::PI
Delegation Changes
Ruby 1.9
irb(main):002:0> class C < SimpleDelegator; end
=> nil
irb(main):003:0> C.new('').class
=> String
Ruby 1.8.6
irb(main):002:0> class C < SimpleDelegator; end
=> nil
irb(main):003:0> C.new('').class
=> C
irb(main):004:0>
Defect 17700
Use of $KCODE Produces Warnings
Ruby 1.9
irb(main):004:1> $KCODE = 'UTF8'
(irb):4: warning: variable $KCODE is no longer effective; ignored
=> "UTF8"
Ruby 1.8.6
irb(main):001:0> $KCODE = 'UTF8'
=> "UTF8"
instance_methods Now an Array of Symbols
Ruby 1.9
irb(main):001:0> {}.methods.sort.last
=> :zip
Ruby 1.8.6
irb(main):001:0> {}.methods.sort.last
=> "zip"
Action: Replace instance_methods.include? with method_defined?
Source File Encoding
Basic
# coding: utf-8
Emacs
# -*- encoding: utf-8 -*-
Shebang
#!/usr/local/rubybook/bin/ruby
# encoding: utf-8
Real Threading
Race Conditions
Implicit Ordering Assumptions
Test Code
What's New?
Alternate Syntax for Symbol as Hash Keys
Ruby 1.9
{a: b}
redirect_to action: show
Ruby 1.8.6
{:a => b}
redirect_to :action => show
Block Local Variables
Ruby 1.9
[1,2].each {|value; t| t=value*value}
Inject Methods
Ruby 1.9
[1,2].inject(:+)
Ruby 1.8.6
[1,2].inject {|a,b| a+b}
to_enum
Ruby 1.9
short_enum = [1, 2, 3].to_enum
long_enum = ('a'..'z').to_enum
loop do
puts "#{short_enum.next} #{long_enum.next}"
end
No block? Enum!
Ruby 1.9
e = [1,2,3].each
Lambda Shorthand
Ruby 1.9
p = -> a,b,c {a+b+c}
puts p.(1,2,3)
puts p[1,2,3]
Ruby 1.8.6
p = lambda {|a,b,c| a+b+c}
puts p.call(1,2,3)
Complex Numbers
Ruby 1.9
Complex(3,4) == 3 + 4.im
Decimal Is Still Not The Default
Ruby 1.9
irb(main):001:0> 1.2-1.1
=> 0.0999999999999999
Regex “Properties”
Ruby 1.9
/\p{Space}/
Ruby 1.8.6
/[:space:]/
Splat in Middle
Ruby 1.9
def foo(first, *middle, last)
(->a, *b, c {p a-c}).(*5.downto(1))
Fibers
Ruby 1.9
f = Fiber.new do
a,b = 0,1
Fiber.yield a
Fiber.yield b
loop do
a,b = b,a+b
Fiber.yield b
end
end
10.times {puts f.resume}
Break Values
Ruby 1.9
match =
while line = gets
next if line =~ /^#/
break line if line.find('ruby')
end
“Nested” Methods
Ruby 1.9
def toggle
def toggle
"subsequent times"
end
"first time"
end
HTH!
One huge difference would be the move from Matz's interpreter to YARV, a bytecode virtual machine that helps significantly with performance.
Many now recommend The Ruby Programming Language over the Pickaxe - more to the point, it has all the details of the 1.8/1.9 differences.
Some more changes:
Returning a splat singleton array:
def function
return *[1]
end
a=function
ruby 1.9 : [1]
ruby 1.8 : 1
array arguments
def function(array)
array.each { |v| p v }
end
function "1"
ruby 1.8: "1"
ruby 1.9: undefined method `each' for "1":String

Resources