Currently I'm generating a CSV with UTF-8 encoding from my administrate ui. But swedish letters "åäö" is not shown correctly in excel or in the label printer programs (P-touch Editor 5.4 & Dymo Connect) I'm using.
After talking to their support I've been told the CSV needs to be ANSI encoded. How do I do that?
My code:
def to_csv
attributes = %w{full_name street_address postal_code city}
CSV.generate(headers: true, col_sep: ",") do |csv|
csv << attributes
orders.all.each do |order|
csv << attributes.map{ |attr| order.address.send(attr) }
end
end
end
By default CSV uses Encoding.default_external as encoding, most likely this is UTF-8.
In your case you have to override it, but first you need to know which ANSI encoding you actually need. (What is ANSI format?)
Most likely you can use Windows-1252 or ISO-8859-1.
Then you can set the external encoding of the CSV string like this:
CSV.generate(headers: true, col_sep: ",", encoding: Encoding::ISO_8859_1)
CSV.generate(headers: true, col_sep: ",", encoding: Encoding::WINDOWS_1252)
Strings work, too:
CSV.generate(headers: true, col_sep: ",", encoding: 'ISO-8859-1')
Related
Ruby 2.6.3.
I have been trying to parse a StringIO object into a CSV instance with the bom|utf-8 encoding, so that the BOM character (undesired) is stripped and the content is encoded to UTF-8:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns true
Apparently the bom|utf-8 encoding does not work for StringIO objects, but I found that it does work for files, for instance:
require 'csv'
CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze
# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # This returns false
Considering that I need to work with StringIO directly, why does CSV ignores the bom|utf-8 encoding? Is there any way to remove the BOM character from the StringIO instance?
Thank you!
Ruby 2.7 added the set_encoding_by_bom method to IO. This methods consumes the byte order mark and sets the encoding.
require 'csv'
require 'stringio'
CSV_READ_OPTIONS = { headers: true }.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom
first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false
Ruby doesn't like BOMs. It only handles them when reading a file, never anywhere else, and even then it only reads them so that it can get rid of them. If you want a BOM for your string, or a BOM when writing a file, you have to handle it manually.
There are probably gems for doing this, though it's easy to do yourself
if string[0...3] == "\xef\xbb\xbf"
string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
string = string[2..-1].force_encoding('UTF-16LE')
# etc
I found out that forcing encoding to utf8 on the StringIO string and removing the BOM to generate a new StringIO worked:
require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF") # => false
The encoding option is no more needed. It may not be the best option memory-wise, but it works.
I am generating CSV files that needs to be opened and reviewed in Excel once they have been generated. It seems that Excel requires a different encoding than UTF-8.
Here is my config and generation code:
csv_config = {col_sep: ";",
row_sep: "\n",
encoding: Encoding::UTF_8
}
csv_string = CSV.generate(csv_config) do |csv|
csv << ["Text a", "Text b", "Text æ", "Text ø", "Text å"]
end
When opening this in Excel, the special characters are not being displayed properly:
Text a Text b Text æ Text ø Text å
Any idea how to ensure proper encoding?
Excel understands UTF-8 CSV if it has BOM. That can be done like:
Use CSV.generate
# the argument of CSV.generate is default string
csv_string = CSV.generate("\uFEFF") do |csv|
csv << ["Text a", "Text b", "Text æ", "Text ø", "Text å"]
end
Use CSV.open
filename = "/tmp/example.csv"
# Default output encoding is UTF-8
CSV.open(filename, "w") do |csv|
csv.to_io.write "\uFEFF" # use CSV#to_io to write BOM directly
csv << ["Text a", "Text b", "Text æ", "Text ø", "Text å"]
end
The top voted answer from #joaofraga worked for me, but I found an alternative solution that also worked - no UTF-8 to ISO-8859-1 transcoding required.
From what I've read, Excel, can indeed handle UTF-8, but for some reason, it doesn't recognize it by default. But if you add a BOM to the beginning of the CSV data, this seems to cause Excel to realise that the file is UTF-8.
So, if you have a CSV like so:
csv_string = CSV.generate(csv_config) do |csv|
csv << ["Text a", "Text b", "Text æ", "Text ø", "Text å"]
end
just add a BOM byte like so:
"\uFEFF" + csv_string
In my case, my controller is sending the CSV as a file, so this is what my controller looks like:
def show
respond_to do |format|
format.csv do
# add BOM to force Excel to realise this file is encoded in UTF-8, so it respects special characters
send_data "\uFEFF" + csv_string, type: :csv, filename: "csv.csv"
end
end
end
I should note that UTF-8 itself does not require or recommend a BOM at all, but as I mentioned, adding it in this case seemed to nudge Excel into realising that the file was indeed UTF-8.
You should switch the encoding to ISO-8859-1 as following:
CSV.generate(encoding: 'ISO-8859-1') { |csv| csv << ["Text á", "Text é", "Text æ"] }
For your context, you can do this:
config = {
col_sep: ';',
row_sep: ';',
encoding: 'ISO-8859-1'
}
CSV.generate(config) { |csv| csv << ["Text á", "Text é", "Text æ"] }
I had the same issue and that encoding fixed.
config = {
encoding: 'ISO-8859-1'
}
CSV.generate(config) { |csv| csv << ["Text á", "Text é", "Text æ"] }
With https://github.com/gtd/csv_builder, I had to:
In the controller action:
#output_encoding = 'UTF-8'
send_data "\uFEFF" + render_to_string(), type: :csv, filename: #filename
Atop the csv.csvbuilder template:
faster_csv.to_io.write("\uFEFF")
I don't know why I had to add the BOM twice, but it did not work with either one on its own.
i have this csv file
file data.csv:
data.csv: ASCII text
This file has ~10000 lines with some UTF-8 literal chars.
For example:
1388357672.209253000,48:a2:2d:78:84:10,\xe5\x87\xb6\xe5\xb7\xb4\xe5\xb7\xb4\xe8\x87\xad\xe7\x98\xaa\xe7\x98\xaa\xe7\x9a\x84\xe6\x80\xaa\xe5\x85\xbd\xe5\x87\xba
I iterate over this file in Ruby and save every line in my postgresql db
File.open(filename, "r").each_line do |line|
CSV.parse(line, encoding: 'UTF-8') do |row|
//Save to Postgresql
end
end
I have now the problem that the UTF-8 literal string is saved in the db and not the correct UTF-8 string. I can convert every line with echo -e "line" but this takes much time. Is ther a way that ruby can do this task?
Try this:
CSV.parse(line, encoding: 'UTF-8') do |row|
row = row.map do |elem|
elem.gsub(/\\x../) {|s| [s[2..-1].hex].pack("C")}.force_encoding("UTF-8")
end
//Save to Postgresql
end
Just put each cell in double quotes:
"\xe5\x87\xb6\xe5\xb7\xb4\xe5\xb7\xb4\xe8\x87\xad\xe7\x98\xaa\xe7\x98\xaa\xe7\x9a\x84\xe6\x80\xaa\xe5\x85\xbd\xe5\x87\xba"
=> "凶巴巴臭瘪瘪的怪兽出"
I'm using ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
I get the following error
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".
I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this
"Non sp\xE9cifi\xE9".encode("UTF-8")
because I get this error:
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
Questions:
Can I get CSV to read my file in the appropriate encoding? If so, how?
How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.
With ruby >= 1.9 you can use
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
If you prefer a more verbose code, you can use:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
I have been dealing with this issue for a while and not any of the other solutions worked for me.
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)
I've a Mac VBA script making a request to a Ruby Sinatra web app.
The text passing from Excel contains characters such as é. Ruby (version 1.9.2) chokes on these characters as Excel is not sending them as UTF-8.
# encoding: utf-8
require 'rubygems'
require 'sinatra'
require "sinatra/reloader" if development?
configure do
class << Sinatra::Base
def options(path, opts={}, &block)
route 'OPTIONS', path, opts, &block
end
end
Sinatra::Delegator.delegate :options
end
options '/' do
response.headers["Access-Control-Allow-Origin"] = "*"
response.headers["Access-Control-Allow-Methods"] = "POST"
halt 200
end
post '/fetch' do
chars = []
params['excel_input'].valid_encoding? #returns false
params['excel_input']
end
My Excel VBA:
Sub FetchAddress()
For Each oDest In Selection
With ActiveSheet.QueryTables.Add(Connection:="URL;http://localhost:4567/fetch", Destination:=oDest)
.PostText = "excel_input=" & oDest.Offset(0, -1).Value
.RefreshStyle = xlOverwriteCells
.SaveData = True
.Refresh
End With
Next
End Sub
The character é comes out the other end as Ž.
It looks like the text in Excel is encoded as Windows-1252 http://en.wikipedia.org/wiki/Windows-1252.
The byte representation of the character is 142 (or Ž in Windows-1252).
iconv can convert the input to UTF-8. It converts the character encoding from one encoding to another. So something like this should work:
require "iconv"
...
post '/fetch' do
excel_input = Iconv.conv("UTF-8", "WINDOWS-1252", params['excel_input'])
...
end
you can also probably look at: https://github.com/jmhodges/rchardet
then, you can autodetect charset and then convert it to utf-8.
Ruby 1.9 Encodings: A Primer and the Solution for Rails - yehuda katz is a good read. If you have some time. Goes in to depth about encodings and how to convert between them.