Thor & YAML outputting as binary? - ruby

I'm using Thor and trying to output YAML to a file. In irb I get what I expect. Plain text in YAML format. But when part of a method in Thor, its output is different...
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def set
test = {"name" => "Xavier", "age" => 30}
puts test
# {"name"=>"Xavier", "age"=>30}
puts test.to_yaml
# !binary "bmFtZQ==": !binary |-
# WGF2aWVy
# !binary "YWdl": 30
File.open("data/config.yml", "w") {|f| f.write(test.to_yaml) }
end
end
Any ideas?

All Ruby 1.9 strings have an encoding attached to them.
YAML encodes some non-UTF8 strings as binary, even when they look innocent, without any high-bit characters. You might think that your code is always using UTF8, but builtins can return non-UTF8 strings (ex File path routines).
To avoid binary encoding, make sure all your strings encodings are UTF-8 before calling to_yaml. Change the encoding with force_encoding("UTF-8") method.
For example, this is how I encode my options hash into yaml:
options = {
:port => 26000,
:rackup => File.expand_path(File.join(File.dirname(__FILE__), "../sveg.rb"))
}
utf8_options = {}
options.each_pair { |k,v| utf8_options[k] = ((v.is_a? String) ? v.force_encoding("UTF-8") : v)}
puts utf8_options.to_yaml
Here is an example of yaml encoding simple strings as binary
>> x = "test"
=> "test"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.to_yaml
=> "--- test\n...\n"
>> x.force_encoding "ASCII-8BIT"
=> "test"
>> x.to_yaml
=> "--- !binary |-\n dGVzdA==\n"

After version 1.9.3p125, ruby build-in YAML engine will treat all BINARY encoding differently than before. All you need to do is to set correct non-BINARY encoding before your String.to_yaml.
in Ruby 1.9, All String object have attached a Encoding object
and as following blog ( by James Edward Gray II ) mentioned, ruby have build in three type of encoding when String is generated:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings.
One of encoding may solve your problem => Source code Encoding
This is the encoding of your source code, and can be specify by adding magic encoding string at the first line or second line ( if you have a sha-bang string at the first line of your source code )
the magic encoding code could be one of following:
# encoding: utf-8
# coding: utf-8
# -- encoding : utf-8 --
so in your case, if you use ruby 1.9.3p125 or later, this should be solved by adding one of magic encoding in the beginning of your code.
# encoding: utf-8
require 'thor'
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def bar
test = {"name" => "Xavier", "age" => 30}
puts test
#{"name"=>"Xavier", "age"=>30}
puts test["name"].encoding.name
#UTF-8
puts test.to_yaml
#---
#name: Xavier
#age: 30
puts test.to_yaml.encoding.name
#UTF-8
end
end

I have been struggling with this using 1.9.3p545 on Windows - just with a simple hash containing strings - and no Thor.
The gem ZAML solves the problem quite simply:
require 'ZAML'
yaml = ZAML.dump(some_hash)
File.write(path_to_yaml_file, yaml)

Related

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

How do I get YAML in Ruby as of 1.9.3 to dump ASCII-8Bit strings as strings?

Here's the problem: I might have strings that are UTF-8, and I might have strings that are US-ASCII. Regardless of the encoding, I'd like YAML.dump(str) to actually dump String objects, instead of these useless !binary objects as the example shows.
Is there a flag or something I'm not seeing to force YAML.dump() to do the right thing?
Ruby 1.9.1 example
YAML::VERSION # "0.60"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- foo\n"
Ruby 1.9.3 example
YAML::VERSION # "1.2.2"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- !binary |-\n Zm9v\n"
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"
So, looks like using the old yamler engine with force the old behavior.
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"

ruby 1.9 character conversion errors while testing regex

I know there are a tons of docs and debates out there, but still:
This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.
What can I do?
# encoding: utf-8
require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'
url = 'http://www.website.com/url/test'
sio = open(url)
#cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, #cur_encoding)
txtdoc = doc.to_s
# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR
# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"
p /#{regex}/i =~ txtdoc # integer expected
I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?
#cur_encoding = doc.encoding # ISO-8859-15
ISO-8859-15 is not the correct encoding for the quoted page; it should have been UTF-8. iconving it to UTF-8 as if it were 8859-15 only compounds the problem.
This encoding is coming from a faulty <meta> tag in the document. A browser will ignore that tag and use the overriding encoding from the Content-Type: text/html;charset=utf-8 HTTP response header.
However Nokogiri appears not to be able to read this header from the open()ed stream. With the caveat that I know nothing about Ruby, looking at the source the problem would seem to be that it uses the property encoding from the string-or-IO instead of charset which seems to be what open-uri writes.
You can pass in an override encoding of your own, so I guess try:
sio= open(url)
doc= Nokogiri::HTML.parse(doc, nil, sio.charset) # should be UTF-8?
The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.
In your first problem, the string:
"Nove36 "
actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:
1.9.3-p125 :001 > s = "Foo \u00a0"
=> "Foo  "
1.9.3-p125 :002 > s.strip
=> "Foo  " #unchanged
In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:
# s as before
1.9.3-p125 :003 > s =~ /Foo / #2 spaces, no match
=> nil
1.9.3-p125 :004 > s =~ /Foo / #1 space, match
=> 0
1.9.3-p125 :005 > s =~ /Foo \u00a0/ #space and non breaking space, match
=> 0
When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.
The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:
sio = open(url)
#cur_encoding = sio.charset
txt = sio.read #read the whole file
txt.gsub! "\u00a0", " " #global replace
doc = Nokogiri::HTML(txt, nil, #cur_encoding) #use this new string instead...

hash strings get improperly encoded

I have a simple constant hash with string keys defined:
MY_CONSTANT_HASH = {
'key1' => 'value1'
}
Now, I've noticed that encoding.name on the key is US-ASCII. However, Encoding.default_internal is set to UTF-8 beforehand. Why is it not being properly encoded? I can't force_encoding later, because the object is frozen at that point, so I get this error:
can't modify frozen String
P.S.: I'm using ruby 1.9.3p0 (2011-10-30 revision 33570).
The default internal and external encodings are aimed at IO operations:
CSV
File data read from disk
File names from Dir
etc...
The easiest thing for you to do is to add a # encoding=utf-8 comment to tell Ruby that the source file is UTF-8 encoded. For example, if you run this:
# encoding=utf-8
H = { 'this' => 'that' }
puts H.keys.first.encoding
as a stand-alone Ruby script you'll get UTF-8, but if you run this:
H = { 'this' => 'that' }
puts H.keys.first.encoding
you'll probably get US-ASCII.

Ruby: Generate a utf-8 character from code point as string

I need to write all utf-8 characters in file. I have all codes as string "5363" or "328E", but I can't add it to \u, to make structure, like "\u5363". Help me please.
(this will work if you have ruby 1.9 or newer)
#irb -E utf-8
irb(main):032:0> s=""
=> ""
irb(main):033:0> i=0x328e
=> 12942
irb(main):034:0> s<<i
=> "㊎"
irb(main):036:0> s<<0x5363
=> "㊎卣"
for your case:
my_char_codes = ["5363","328E"]
s = ""
my_char_codes.each{ |c| s << c.to_i(16) }
# now s contains "㊎卣"

Resources