File.readlines invalid byte sequence in UTF-8 (ArgumentError) - ruby

I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.
a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)
' : undefined method `encode!' for # (NoMethodError)
Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?
Attempt 1
Tried this but it failed with same invalid byte sequence error.
IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
# extract three columns: time stamp, url, ip
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
Solution
This seems to have worked for me.
a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
Does Ruby provide a way to do File.read() with specified encoding?

I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.
could you please provide an example to the alternative above.
require 'csv'
CSV.foreach("log.csv", encoding: "utf-8") do |row|
md = row[0].match /watch\?v=/
puts row[0], row[1], row[3] if md
end
Or,
CSV.foreach("log.csv", 'rb:utf-8') do |row|
If you need more speed, use the fastercsv gem.
This seems to have worked for me.
File.readlines('log.csv', :encoding => 'ISO-8859-1')
Yes, in order to read a file you have to know its encoding.

In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.
I did
File.readlines(email, :encoding => 'UTF-8').each do |line|
but this didn't work with some Japanese characters so I added this on the next line and that worked fine.
line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Related

SmarterCSV and file encoding issues in Ruby

I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

How to encode csv file in Roo (Rails) : invalid byte sequence in UTF-8

I am trying to upload a csv file but getting invalid byte sequence in UTF-8 error. I am using 'roo' gem.
My code is like this :
def upload_results_csv file
spreadsheet = MyFileUtil.open_file(file)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
(2..spreadsheet.last_row).each do |i|
row = Hash[[header, spreadsheet.row(i)].transpose]
...
...
end
class MyFileUtil
def self.open_file(file)
case File.extname(file.original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{file.original_filename}"
end
end
end.
I don't know how to encode csv file. Please help!
Thanks
To safely convert a string to utf-8 you can do:
str.encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: '')
also see this blog post.
Since the roo gem will only take filenames as constructor argument, not plain IO objects, the only solution I can think of is to write a sanitized version to a tempfile and pass it to roo, along the lines of
require 'tempfile'
def upload_results_csv file
tmpfile = Tempfile.new(file.path)
tmpfile.write(File.read(file.path).encode('utf-8', 'binary', invalid: :replace, undef: :replace, replace: ''))
tmpfile.rewind
spreadsheet = MyFileUtil.open_file(tmpfile, file.original_filename)
header = spreadsheet.row(1) # THIS LINE RAISES THE ERROR
# ...
ensure
tmpfile.close
tmpfile.unlink
end
You need to alter MyFileUtil as well, because the original filename needs to be passed down:
class MyFileUtil
def self.open_file(file, original_filename)
case File.extname(original_filename)
when ".csv" then
Roo::Csv.new(file.path,csv_options: {encoding: Encoding::UTF_8})
when ".xls" then
Roo::Excel.new(file.path, nil, :ignore)
when ".xlsx" then
Roo::Excelx.new(file.path, nil, :ignore)
else
raise "Unknown file type: #{original_filename}"
end
end
end

Removing whitespaces in a CSV file

I have a string with extra whitespace:
First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
I want to parse this line and remove the whitespaces.
My code looks like:
namespace :db do
task :populate_contacts_csv => :environment do
require 'csv'
csv_text = File.read('file_upload_example.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
end
end
#prices = CSV.parse(IO.read('prices.csv'), :headers=>true,
:header_converters=> lambda {|f| f.strip},
:converters=> lambda {|f| f ? f.strip : nil})
The nil test is added to the row but not header converters assuming that the headers are never nil, while the data might be, and nil doesn't have a strip method. I'm really surprised that, AFAIK, :strip is not a pre-defined converter!
You can strip your hash first:
csv.each do |unstriped_row|
row = {}
unstriped_row.each{|k, v| row[k.strip] = v.strip}
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Edited to strip hash keys too
CSV supports "converters" for the headers and fields, which let you get inside the data before it's passed to your each loop.
Writing a sample CSV file:
csv = "First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
first,last,email ,mobile phone ,company,title ,street,city,state,zip,country, birthday,gender ,contact type
"
File.write('file_upload_example.csv', csv)
Here's how I'd do it:
require 'csv'
csv = CSV.open('file_upload_example.csv', :headers => true)
[:convert, :header_convert].each { |c| csv.send(c) { |f| f.strip } }
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Which outputs:
First Name: 'first'
Last Name: 'last'
Email: 'email'
The converters simply strip leading and trailing whitespace from each header and each field as they're read from the file.
Also, as a programming design choice, don't read your file into memory using:
csv_text = File.read('file_upload_example.csv')
Then parse it:
csv = CSV.parse(csv_text, :headers => true)
Then loop over it:
csv.each do |row|
Ruby's IO system supports "enumerating" over a file, line by line. Once my code does CSV.open the file is readable and the each reads each line. The entire file doesn't need to be in memory at once, which isn't scalable (though on new machines it's becoming a lot more reasonable), and, if you test, you'll find that reading a file using each is extremely fast, probably equally fast as reading it, parsing it then iterating over the parsed file.

UndefinedConversionError trying to parse Arabic from email body

using mail for ruby I am getting this message:
mail.rb:22:in `encode': "\xC7" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from mail.rb:22:in `<main>'
If I remove encode I get a message ruby
/var/lib/gems/1.9.1/gems/bson-1.7.0/lib/bson/bson_ruby.rb:63:in `rescue in to_utf8_binary': String not valid utf-8: "<div dir=\"ltr\"><div class=\"gmail_quote\">l<br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div dir=\"rtl\">\xC7\xE1\xE4\xD5 \xC8\xC7\xE1\xE1\xDB\xC9 \xC7\xE1\xDA\xD1\xC8\xED\xC9</div></div>\r\n</div><br></div>\r\n</div><br></div>\r\n</div><br></div>" (BSON::InvalidStringEncoding)
This is my code:
require 'mail'
require 'mongo'
connection = Mongo::Connection.new
db = connection.db("DB")
db = Mongo::Connection.new.db("DB")
newsCollection = db["news"]
Mail.defaults do
retriever_method :pop3, :address => "pop.gmail.com",
:port => 995,
:user_name => 'my_username',
:password => '*****',
:enable_ssl => true
end
emails = Mail.last
#Checks if email is multipart and decods accordingly. Put to extract UTF8 from body
plain_part = emails.multipart? ? (emails.text_part ? emails.text_part.body.decoded : nil) : emails.body.decoded
html_part = emails.html_part ? emails.html_part.body.decoded : nil
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => plain_part.encode('UTF-8') }
msgID = newsCollection.insert(mongoMessage) #add the document to the database and returns it's ID
puts msgID
For English and Hebrew it works perfectly but it seems gmail is sending arabic with different encoding. Replacing UTF-8 with ASCII-8BIT gives a similar error.
I get the same result when using plain_part for plain email messages. I am handling emails from one specific source so I can put html_part with confidence it's not causing the error.
To make it extra weird Subject in Arabic is rendered perfectly.
What encoding should I use?
If you use encode without options, it will raise this error, if you're string pretends to be an encoding but contains characters from another encoding.
try it in this way:
plain_part.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
this replaces invalid and undefined chars for the given encoding with an "?"(more info). If this is not sufficent for your needs, you need to find a way to check if your plain_part string is valid.
For example you can use valid_encoding?(more info) for this.
I recently stumbled across a similar problem, where I couldn't be sure what encoding it really is, so I wrote this (maybe a little humble) method. May it helps you, to find a way to fix your problem.
def self.encode!(str)
return nil if str.nil?
known_encodings = %w(
UTF-8
ISO-8859-1
)
begin
str.encode(Encoding.find('UTF-8'))
rescue Encoding::UndefinedConversionError
fixed_str = ""
known_encodings.each do |encoding|
fixed_str = str
if fixed_str.force_encoding(encoding).valid_encoding?
return fixed_str.encode(Encoding.find('UTF-8'))
end
end
return str.encode(Encoding.find('UTF-8'), {:invalid => :replace, :undef => :replace, :replace => '?'})
end
end
I found a work around.
Since only specific emails will be sent to this account to just to use on this application I have full control over formatting. For some reason mail decodes text/plain attachment perfectly
so:
emails.attachments.each do | attachment |
if (attachment.content_type.start_with?('text/plain'))
# extracting txt file
begin
body = attachment.body.decoded
rescue Exception => e
puts "Unable to save data for #{filename} because #{e.message}"
end
end
end
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => body }

Ruby CSV UTF-8 Encoding/Translation Issues

I'm having problems with Ruby 1.9 CSV and invalid UTF-8 characters in my data.
My code looks something like this:
CSV.foreach("small-test2.csv", options) do |row |
name, workgroup, address, actual, output = row
next if nbname == "NBName"
#ssl_info[name] = workgroup, address, actual, output
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
clean = ic.iconv(output + ' ')[0..-2]
puts clean
end
However I'm still getting the following:
ArgumentError: invalid byte sequence in UTF-8
=~ at org/jruby/RubyRegexp.java:1487
=~ at org/jruby/RubyString.java:1686
Is there anything I'm missing here?
try this,
output.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Resources