Ruby CSV UTF-8 Encoding/Translation Issues - ruby

I'm having problems with Ruby 1.9 CSV and invalid UTF-8 characters in my data.
My code looks something like this:
CSV.foreach("small-test2.csv", options) do |row |
name, workgroup, address, actual, output = row
next if nbname == "NBName"
#ssl_info[name] = workgroup, address, actual, output
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
clean = ic.iconv(output + ' ')[0..-2]
puts clean
end
However I'm still getting the following:
ArgumentError: invalid byte sequence in UTF-8
=~ at org/jruby/RubyRegexp.java:1487
=~ at org/jruby/RubyString.java:1686
Is there anything I'm missing here?

try this,
output.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Related

How to encode csv file in Roo (Rails) : invalid byte sequence in UTF-8

I am trying to upload a csv file but getting invalid byte sequence in UTF-8 error. I am using 'roo' gem.
My code is like this :
def upload_results_csv file
spreadsheet = MyFileUtil.open_file(file)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
(2..spreadsheet.last_row).each do |i|
row = Hash[[header, spreadsheet.row(i)].transpose]
...
...
end
class MyFileUtil
def self.open_file(file)
case File.extname(file.original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{file.original_filename}"
end
end
end.
I don't know how to encode csv file. Please help!
Thanks
To safely convert a string to utf-8 you can do:
str.encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: '')
also see this blog post.
Since the roo gem will only take filenames as constructor argument, not plain IO objects, the only solution I can think of is to write a sanitized version to a tempfile and pass it to roo, along the lines of
require 'tempfile'
def upload_results_csv file
tmpfile = Tempfile.new(file.path)
tmpfile.write(File.read(file.path).encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: ''))
tmpfile.rewind
spreadsheet = MyFileUtil.open_file(tmpfile, file.original_filename)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
# ...
ensure
tmpfile.close
tmpfile.unlink
end
You need to alter MyFileUtil as well, because the original filename needs to be passed down:
class MyFileUtil
def self.open_file(file, original_filename)
case File.extname(original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{original_filename}"
end
end
end

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.
a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)
' : undefined method `encode!' for # (NoMethodError)
Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?
Attempt 1
Tried this but it failed with same invalid byte sequence error.
IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
# extract three columns: time stamp, url, ip
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
Solution
This seems to have worked for me.
a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
Does Ruby provide a way to do File.read() with specified encoding?
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.
could you please provide an example to the alternative above.
require 'csv'
CSV.foreach("log.csv", encoding: "utf-8") do |row|
md = row[0].match /watch\?v=/
puts row[0], row[1], row[3] if md
end
Or,
CSV.foreach("log.csv", 'rb:utf-8') do |row|
If you need more speed, use the fastercsv gem.
This seems to have worked for me.
File.readlines('log.csv', :encoding => 'ISO-8859-1')
Yes, in order to read a file you have to know its encoding.
In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.
I did
File.readlines(email, :encoding => 'UTF-8').each do |line|
but this didn't work with some Japanese characters so I added this on the next line and that worked fine.
line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

UndefinedConversionError trying to parse Arabic from email body

using mail for ruby I am getting this message:
mail.rb:22:in `encode': "\xC7" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from mail.rb:22:in `<main>'
If I remove encode I get a message ruby
/var/lib/gems/1.9.1/gems/bson-1.7.0/lib/bson/bson_ruby.rb:63:in `rescue in to_utf8_binary': String not valid utf-8: "<div dir=\"ltr\"><div class=\"gmail_quote\">l<br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div dir=\"rtl\">\xC7\xE1\xE4\xD5 \xC8\xC7\xE1\xE1\xDB\xC9 \xC7\xE1\xDA\xD1\xC8\xED\xC9</div></div>\r\n</div><br></div>\r\n</div><br></div>\r\n</div><br></div>" (BSON::InvalidStringEncoding)
This is my code:
require 'mail'
require 'mongo'
connection = Mongo::Connection.new
db = connection.db("DB")
db = Mongo::Connection.new.db("DB")
newsCollection = db["news"]
Mail.defaults do
retriever_method :pop3, :address => "pop.gmail.com",
:port => 995,
:user_name => 'my_username',
:password => '*****',
:enable_ssl => true
end
emails = Mail.last
#Checks if email is multipart and decods accordingly. Put to extract UTF8 from body
plain_part = emails.multipart? ? (emails.text_part ? emails.text_part.body.decoded : nil) : emails.body.decoded
html_part = emails.html_part ? emails.html_part.body.decoded : nil
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => plain_part.encode('UTF-8') }
msgID = newsCollection.insert(mongoMessage) #add the document to the database and returns it's ID
puts msgID
For English and Hebrew it works perfectly but it seems gmail is sending arabic with different encoding. Replacing UTF-8 with ASCII-8BIT gives a similar error.
I get the same result when using plain_part for plain email messages. I am handling emails from one specific source so I can put html_part with confidence it's not causing the error.
To make it extra weird Subject in Arabic is rendered perfectly.
What encoding should I use?
If you use encode without options, it will raise this error, if you're string pretends to be an encoding but contains characters from another encoding.
try it in this way:
plain_part.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
this replaces invalid and undefined chars for the given encoding with an "?"(more info). If this is not sufficent for your needs, you need to find a way to check if your plain_part string is valid.
For example you can use valid_encoding?(more info) for this.
I recently stumbled across a similar problem, where I couldn't be sure what encoding it really is, so I wrote this (maybe a little humble) method. May it helps you, to find a way to fix your problem.
def self.encode!(str)
return nil if str.nil?
known_encodings = %w(
UTF-8
ISO-8859-1
)
begin
str.encode(Encoding.find('UTF-8'))
rescue Encoding::UndefinedConversionError
fixed_str = ""
known_encodings.each do |encoding|
fixed_str = str
if fixed_str.force_encoding(encoding).valid_encoding?
return fixed_str.encode(Encoding.find('UTF-8'))
end
end
return str.encode(Encoding.find('UTF-8'), {:invalid => :replace, :undef => :replace, :replace => '?'})
end
end
I found a work around.
Since only specific emails will be sent to this account to just to use on this application I have full control over formatting. For some reason mail decodes text/plain attachment perfectly
so:
emails.attachments.each do | attachment |
if (attachment.content_type.start_with?('text/plain'))
# extracting txt file
begin
body = attachment.body.decoded
rescue Exception => e
puts "Unable to save data for #{filename} because #{e.message}"
end
end
end
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => body }

Delete non-UTF characters from a string in Ruby?

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.
You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
For more info look into Ruby-Docs
You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.
You can use /n, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.
Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"
The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join
data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

Ruby: Generate a utf-8 character from code point as string

I need to write all utf-8 characters in file. I have all codes as string "5363" or "328E", but I can't add it to \u, to make structure, like "\u5363". Help me please.
(this will work if you have ruby 1.9 or newer)
#irb -E utf-8
irb(main):032:0> s=""
=> ""
irb(main):033:0> i=0x328e
=> 12942
irb(main):034:0> s<<i
=> "㊎"
irb(main):036:0> s<<0x5363
=> "㊎卣"
for your case:
my_char_codes = ["5363","328E"]
s = ""
my_char_codes.each{ |c| s << c.to_i(16) }
# now s contains "㊎卣"

Resources