Ruby: parse yaml from ANSI to UTF-8 - ruby

Problem:
I have the yaml file test.yml that can be encoded in UTF-8 or ANSI:
:excel:
"Test":
"eins_Ä": :eins
"zwei_ä": :zwei
When I load the file I need it to be encoded in UTF-8 therefore tried to convert all of the Strings:
require 'yaml'
file = YAML::load_file('C:/Users/S61256/Desktop/test.yml')
require 'iconv'
CONV = Iconv.new("UTF-8", "ASCII")
class Test
def convert(hash)
hash.each{ |key, value|
convert(value) if value.is_a? Hash
CONV.iconv(value) if value.is_a? String
CONV.iconv(key) if key.is_a? String
}
end
end
t = Test.new
converted = t.convert(file)
p file
p converted
But when I try to run this example script it prints:
in 'iconv': eins_- (Iconv:IllegalSequence)
Questions:
1. Why does the error show up and how can I solve it?
2. Is there another (more appropiate) way to get the file's content in UTF-8?
Note:
I need this code to be compatible to Ruby 1.8 as well as Ruby 2.2. For Ruby 2.2 I would replace all the Iconv stuff with String::encode, but that's another topic.

The easiest way to deal with wrong encoded files is to read it in its original encoding, convert to UTF-8 and then pass to receiver (YAML in this case):
▶ YAML.load File.read('/tmp/q.yml', encoding: 'ISO-8859-1').force_encoding 'UTF-8'
#⇒ {:excel=>{"Test"=>{"eins_Ä"=>:eins, "zwei_ä"=>:zwei}}}
For Ruby 1.8 you should probably use Iconv, but the whole process (read as is, than encode, than yaml-load) remains the same.

Related

convert byte to string in rails

how to convert byte data to a string so I perform base64 decode on it and then zlib decompress it.
example: data = b'eJzLSM3JyQcABiwCFQ=='
Zlib::Inflate.inflate(Base64.decode64(bin_to_hex(data)))
def bin_to_hex(s)
s.unpack('C*').first
end
I'm getting "\xE2" from ASCII-8BIT to UTF-8 also getting undefined methodunpack'`
You are overcomplicating things. I have no idea what the leading b in data literal is supposed to mean, but this would work:
require 'base64'
Zlib::Inflate.inflate Base64.decode64('eJzLSM3JyQcABiwCFQ==')
#⇒ "hello"

Encode string as \uXXXX

I am trying to port a code from python to ruby, and having difficulties in one of the functions that encodes a UTF-8 string to JSON.
I have stripped down the code to what I believe is my problem.
I would like to make ruby output the exact same output as python.
The python code:
#!/usr/bin/env python
# encoding: utf-8
import json
import hashlib
text = "ÀÈG"
js = json.dumps( { 'data': text } )
print 'Python:'
print js
print hashlib.sha256(js).hexdigest()
The ruby code:
#!/usr/bin/env ruby
require 'json'
require 'digest'
text = "ÀÈG"
obj = {'data': text}
# js = obj.to_json # not using this, in order to get the space below
js = %Q[{"data": "#{text}"}]
puts 'Ruby:'
puts js
puts Digest::SHA256.hexdigest js
When I run both, this is the output:
$ ./test.rb && ./test.py
Ruby:
{"data": "ÀÈG"}
6cbe518180308038557d28ecbd53af66681afc59aacfbd23198397d22669170e
Python:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3
What do I need to change in the ruby code to make its output identical to the python output (at least the final hash)?
Notes:
I have tried things from this SO question (and others) without success.
The code above produces identical results when using only english characters, so I know the hashing is the same.
Surely someone will come along with a more elegant (or at least a more efficient and robust) solution, but here's one for the time being:
#!/usr/bin/env ruby
require 'json'
require 'digest'
text = 'ÀÈG'
.encode('UTF-16') # convert UTF-8 characters to UTF-16
.inspect # escape UTF-16 characters and convert back to UTF-8
.sub(/^"\\u[Ff][Ee][Ff][Ff](.*?)"$/, '\1') # remove outer quotes and BOM
.gsub(/\\u\w{4}/, &:downcase!) # downcase alphas in escape sequences
js = { data: text } # wrap in containing data structure
.to_json(:space=>' ') # convert to JSON with spaces after colons
.gsub(/\\\\u(?=\w{4})/, '\\u') # remove extra backslashes
puts 'Ruby:', js, Digest::SHA256.hexdigest(js)
Output:
$ ./test.rb
Ruby:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

hash strings get improperly encoded

I have a simple constant hash with string keys defined:
MY_CONSTANT_HASH = {
'key1' => 'value1'
}
Now, I've noticed that encoding.name on the key is US-ASCII. However, Encoding.default_internal is set to UTF-8 beforehand. Why is it not being properly encoded? I can't force_encoding later, because the object is frozen at that point, so I get this error:
can't modify frozen String
P.S.: I'm using ruby 1.9.3p0 (2011-10-30 revision 33570).
The default internal and external encodings are aimed at IO operations:
CSV
File data read from disk
File names from Dir
etc...
The easiest thing for you to do is to add a # encoding=utf-8 comment to tell Ruby that the source file is UTF-8 encoded. For example, if you run this:
# encoding=utf-8
H = { 'this' => 'that' }
puts H.keys.first.encoding
as a stand-alone Ruby script you'll get UTF-8, but if you run this:
H = { 'this' => 'that' }
puts H.keys.first.encoding
you'll probably get US-ASCII.

Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

I'm using ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
I get the following error
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".
I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this
"Non sp\xE9cifi\xE9".encode("UTF-8")
because I get this error:
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
Questions:
Can I get CSV to read my file in the appropriate encoding? If so, how?
How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.
With ruby >= 1.9 you can use
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
If you prefer a more verbose code, you can use:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
I have been dealing with this issue for a while and not any of the other solutions worked for me.
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)

Resources