Ruby CSV BOM|UTF-8 encoding for StringIO - ruby

Ruby 2.6.3.
I have been trying to parse a StringIO object into a CSV instance with the bom|utf-8 encoding, so that the BOM character (undesired) is stripped and the content is encoded to UTF-8:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns true
Apparently the bom|utf-8 encoding does not work for StringIO objects, but I found that it does work for files, for instance:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns false
Considering that I need to work with StringIO directly, why does CSV ignores the bom|utf-8 encoding? Is there any way to remove the BOM character from the StringIO instance?
Thank you!

Ruby 2.7 added the set_encoding_by_bom method to IO. This methods consumes the byte order mark and sets the encoding.
require 'csv'
require 'stringio'
CSV_READ_OPTIONS = { headers: true }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false

Ruby doesn't like BOMs. It only handles them when reading a file, never anywhere else, and even then it only reads them so that it can get rid of them. If you want a BOM for your string, or a BOM when writing a file, you have to handle it manually.
There are probably gems for doing this, though it's easy to do yourself
if string[0...3] == "\xef\xbb\xbf"
string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
string = string[2..-1].force_encoding('UTF-16LE')
# etc

I found out that forcing encoding to utf8 on the StringIO string and removing the BOM to generate a new StringIO worked:
require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # => false
The encoding option is no more needed. It may not be the best option memory-wise, but it works.

Related

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Ruby: parse yaml from ANSI to UTF-8

Problem:
I have the yaml file test.yml that can be encoded in UTF-8 or ANSI:
:excel:
"Test":
"eins_Ä": :eins
"zwei_ä": :zwei
When I load the file I need it to be encoded in UTF-8 therefore tried to convert all of the Strings:
require 'yaml'
file = YAML::load_file('C:/Users/S61256/Desktop/test.yml')
require 'iconv'
CONV = Iconv.new("UTF-8", "ASCII")
class Test
def convert(hash)
hash.each{ |key, value|
convert(value) if value.is_a? Hash
CONV.iconv(value) if value.is_a? String
CONV.iconv(key) if key.is_a? String
}
end
end
t = Test.new
converted = t.convert(file)
p file
p converted
But when I try to run this example script it prints:
in 'iconv': eins_- (Iconv:IllegalSequence)
Questions:
1. Why does the error show up and how can I solve it?
2. Is there another (more appropiate) way to get the file's content in UTF-8?
Note:
I need this code to be compatible to Ruby 1.8 as well as Ruby 2.2. For Ruby 2.2 I would replace all the Iconv stuff with String::encode, but that's another topic.
The easiest way to deal with wrong encoded files is to read it in its original encoding, convert to UTF-8 and then pass to receiver (YAML in this case):
▶ YAML.load File.read('/tmp/q.yml', encoding: 'ISO-8859-1').force_encoding 'UTF-8'
#⇒ {:excel=>{"Test"=>{"eins_Ä"=>:eins, "zwei_ä"=>:zwei}}}
For Ruby 1.8 you should probably use Iconv, but the whole process (read as is, than encode, than yaml-load) remains the same.

SmarterCSV and file encoding issues in Ruby

I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

how to save StringIO (pdf) data into file

I want to save pdf file which is located in external remote server with ruby. The pdf file is coming in StringIO. I tried saving the data with File.write but it is not working. I received the below error .
ArgumentError: string contains null byte
How to save now ?
require 'stringio'
sio = StringIO.new("he\x00llo")
File.open('data.txt', 'w') do |f|
f.puts(sio.read)
end
$ cat data.txt
hello
Response to comment:
Okay, try this:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:utf-8') do |f|
f.puts(sio.read)
end
--output:--
1.rb:7:in `write': "\xB5" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
To get rid of that error, you can set the encoding of the StringIO to UTF-8:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
sio.set_encoding('UTF-8') #Change the encoding to what it should be.
File.open('data.txt', 'w:UTF-8') do |f|
f.puts(sio.read)
end
Or, you can use the File.open modes:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:UTF-8:ASCII-8BIT') do |f|
f.puts(sio.read)
end
But, that assumes the data is encoded in UTF-8. If you actually have binary data, i.e. data that isn't encoded because it represents a .jpg file for instance, then that won't work.

Thor & YAML outputting as binary?

I'm using Thor and trying to output YAML to a file. In irb I get what I expect. Plain text in YAML format. But when part of a method in Thor, its output is different...
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def set
test = {"name" => "Xavier", "age" => 30}
puts test
# {"name"=>"Xavier", "age"=>30}
puts test.to_yaml
# !binary "bmFtZQ==": !binary |-
# WGF2aWVy
# !binary "YWdl": 30
File.open("data/config.yml", "w") {|f| f.write(test.to_yaml) }
end
end
Any ideas?
All Ruby 1.9 strings have an encoding attached to them.
YAML encodes some non-UTF8 strings as binary, even when they look innocent, without any high-bit characters. You might think that your code is always using UTF8, but builtins can return non-UTF8 strings (ex File path routines).
To avoid binary encoding, make sure all your strings encodings are UTF-8 before calling to_yaml. Change the encoding with force_encoding("UTF-8") method.
For example, this is how I encode my options hash into yaml:
options = {
:port => 26000,
:rackup => File.expand_path(File.join(File.dirname(__FILE__), "../sveg.rb"))
}
utf8_options = {}
options.each_pair { |k,v| utf8_options[k] = ((v.is_a? String) ? v.force_encoding("UTF-8") : v)}
puts utf8_options.to_yaml
Here is an example of yaml encoding simple strings as binary
>> x = "test"
=> "test"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.to_yaml
=> "--- test\n...\n"
>> x.force_encoding "ASCII-8BIT"
=> "test"
>> x.to_yaml
=> "--- !binary |-\n dGVzdA==\n"
After version 1.9.3p125, ruby build-in YAML engine will treat all BINARY encoding differently than before. All you need to do is to set correct non-BINARY encoding before your String.to_yaml.
in Ruby 1.9, All String object have attached a Encoding object
and as following blog ( by James Edward Gray II ) mentioned, ruby have build in three type of encoding when String is generated:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings.
One of encoding may solve your problem => Source code Encoding
This is the encoding of your source code, and can be specify by adding magic encoding string at the first line or second line ( if you have a sha-bang string at the first line of your source code )
the magic encoding code could be one of following:
# encoding: utf-8
# coding: utf-8
# -- encoding : utf-8 --
so in your case, if you use ruby 1.9.3p125 or later, this should be solved by adding one of magic encoding in the beginning of your code.
# encoding: utf-8
require 'thor'
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def bar
test = {"name" => "Xavier", "age" => 30}
puts test
#{"name"=>"Xavier", "age"=>30}
puts test["name"].encoding.name
#UTF-8
puts test.to_yaml
#---
#name: Xavier
#age: 30
puts test.to_yaml.encoding.name
#UTF-8
end
end
I have been struggling with this using 1.9.3p545 on Windows - just with a simple hash containing strings - and no Thor.
The gem ZAML solves the problem quite simply:
require 'ZAML'
yaml = ZAML.dump(some_hash)
File.write(path_to_yaml_file, yaml)

Resources