UndefinedConversionError trying to parse Arabic from email body

UndefinedConversionError trying to parse Arabic from email body - ruby

using mail for ruby I am getting this message:
mail.rb:22:in `encode': "\xC7" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from mail.rb:22:in `<main>'
If I remove encode I get a message ruby
/var/lib/gems/1.9.1/gems/bson-1.7.0/lib/bson/bson_ruby.rb:63:in `rescue in to_utf8_binary': String not valid utf-8: "<div dir=\"ltr\"><div class=\"gmail_quote\">l<br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div dir=\"rtl\">\xC7\xE1\xE4\xD5 \xC8\xC7\xE1\xE1\xDB\xC9 \xC7\xE1\xDA\xD1\xC8\xED\xC9</div></div>\r\n</div><br></div>\r\n</div><br></div>\r\n</div><br></div>" (BSON::InvalidStringEncoding)
This is my code:
require 'mail'
require 'mongo'
connection = Mongo::Connection.new
db = connection.db("DB")
db = Mongo::Connection.new.db("DB")
newsCollection = db["news"]
Mail.defaults do
retriever_method :pop3, :address => "pop.gmail.com",
:port => 995,
:user_name => 'my_username',
:password => '*****',
:enable_ssl => true
end
emails = Mail.last
#Checks if email is multipart and decods accordingly. Put to extract UTF8 from body
plain_part = emails.multipart? ? (emails.text_part ? emails.text_part.body.decoded : nil) : emails.body.decoded
html_part = emails.html_part ? emails.html_part.body.decoded : nil
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => plain_part.encode('UTF-8') }
msgID = newsCollection.insert(mongoMessage) #add the document to the database and returns it's ID
puts msgID
For English and Hebrew it works perfectly but it seems gmail is sending arabic with different encoding. Replacing UTF-8 with ASCII-8BIT gives a similar error.
I get the same result when using plain_part for plain email messages. I am handling emails from one specific source so I can put html_part with confidence it's not causing the error.
To make it extra weird Subject in Arabic is rendered perfectly.
What encoding should I use?

If you use encode without options, it will raise this error, if you're string pretends to be an encoding but contains characters from another encoding.
try it in this way:
plain_part.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
this replaces invalid and undefined chars for the given encoding with an "?"(more info). If this is not sufficent for your needs, you need to find a way to check if your plain_part string is valid.
For example you can use valid_encoding?(more info) for this.
I recently stumbled across a similar problem, where I couldn't be sure what encoding it really is, so I wrote this (maybe a little humble) method. May it helps you, to find a way to fix your problem.
def self.encode!(str)
return nil if str.nil?
known_encodings = %w(
UTF-8
ISO-8859-1
)
begin
str.encode(Encoding.find('UTF-8'))
rescue Encoding::UndefinedConversionError
fixed_str = ""
known_encodings.each do |encoding|
fixed_str = str
if fixed_str.force_encoding(encoding).valid_encoding?
return fixed_str.encode(Encoding.find('UTF-8'))
end
end
return str.encode(Encoding.find('UTF-8'), {:invalid => :replace, :undef => :replace, :replace => '?'})
end
end

I found a work around.
Since only specific emails will be sent to this account to just to use on this application I have full control over formatting. For some reason mail decodes text/plain attachment perfectly
so:
emails.attachments.each do | attachment |
if (attachment.content_type.start_with?('text/plain'))
# extracting txt file
begin
body = attachment.body.decoded
rescue Exception => e
puts "Unable to save data for #{filename} because #{e.message}"
end
end
end
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => body }

Related

Ruby compatibility error encoding

I'm having a problem. Let's look:
C:\temp> ruby script.rb
script.rb => Powershell output
puts "ę" => ę #irb \xA9
puts "\xA9" => ▯
puts "ę"=="\xA9" => false
input = $stdin.gets.chomp => input=="ę"
puts "e#{input}e" => eęe
puts "ę"==input => false
puts "ę#{input}" => Encoding::Compatibility Error Utf8 & CP852
irb => #command line in ruby
puts "ę"=="\xA9" => true
input = $stdin.gets.chomp => input=="ę"
puts "ę"==input => true && "\xA9"==input => true
puts "ę#{input}" => ęę
It looks like powershell's input uses other font for all special characters than ruby and notepad++(?). Can i change that so it will work when i type in prompt(when asked) and does not show an error?
Edit: Sorry for misdirection. I added invoke and specified that file has extension ".rb" not ".txt"
Edit2: Ok, I've researched some more information and I've been trying do some encoding(UTF8) to a variable. Somethin' strange occured.
puts "ę#{input.encoding}" => ęCP852
puts "\xA9" => UTF-8
Encoding to CP852 has revealed that encoding pass on bytes. I learned that value of "ę"=20+99=119, "ą" = 20 + 85, 20 = C4
Ok. got it ".encoding" - shows what encoding i use. And that resolve this problem.
puts "ę#{input.encode "UTF-8"}" => ęę
Thanks everyone for your input.

If your input in prompt( either cmd or powershell) is causing problems due to incompatibility of using differents encodings just try to encode it via methods in script.encode "UTF-8" #in case of Ruby language If you dont know what methods do that just google your_language_name encoding

Ruby_send the result of scraping through email

With Ruby, my app:
checks if the page status is 200
Parses the PDF files if so
sends via email the result of scraping
Having tested all the parts of the code, everything works fine, except one thing, the mail that is sent doesn't contain the result of my scrpaing;
What is the issue, is it related to the variable #monscrape that may be not recongnised in the final party of the code ?
My code:
require 'open-uri'
require "net/http"
require 'rubygems'
require 'pdf/reader'
require 'mail'
options = { :address => "smtp.gmail.com",
:port => 587,
:domain => 'gmail.com',
:user_name => 'mail#gmail.com',
:password => 'pwd',
:authentication => 'plain',
:enable_starttls_auto => true
}
lien= "http://www.example.com"
url = URI.parse(lien)
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
if res.code == "200"
io = open('http://www.example.com')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
res = page.text
#monscrape = res.scan(/text[\s\S]*text/)
end
Mail.defaults do
delivery_method :smtp, options
end
Mail.deliver do
to 'mail#hotmail.com'
from 'Author <mail#gmail.com>'
subject 'testing sendmail'
html_part do
content_type 'text/html; charset=UTF-8'
body '<h1>Please find below the scrape <%= #monscrape %></h1>'
end
end
else
puts "the link doenst work"
end

The problem is the Mail.deliver block is evaluated using instance_eval. Therefore no local instance #variables will be visible to the Mail block.
So #monscrape will always be nil inside the Mail.deliver block.
One solution is to use a local (non-instance) variable instead:
monscrape = "test"
Mail.deliver do
...
body "<h1>Please find below the scrape #{monscrape}</h1>"
...
end
Also note that Mail does not support ERB(!) therefore you cannot use something like <%= monscrape %> in the body. You have to treat it like a normal string using string expansion with double quotes " and not single quotes '.
See further discussion and options here:
Why can't the Mail block see my variable?

You can't use
res = req.request_head(url.path)
when url.path returns "". request_head expects a path of at least "/". That implies you need to fix up the URL being passed so it at least has the root path "/".
url = URI.parse('http://www.example.com')
url.path # => ""
req.request_head(url.path)
*** ArgumentError Exception: HTTP request path is empty
vs.
url = URI.parse('http://www.example.com/')
url.path # => "/"
req.request_head(url.path)
#<Net::HTTPOK 200 OK readbody=true>
The second problem is you're trying to read something as PDF that isn't a PDF file. Example.com returns HTML, which is text. You can't use:
io = open('http://www.example.com')
reader = PDF::Reader.new(io)
Trying to returns "PDF does not contain EOF marker".
It's really important that you understand what types of objects/resources are being returned by a site when you request a URL. You can't declare them willy-nilly and expect code to accept it without errors.

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.
a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)
' : undefined method `encode!' for # (NoMethodError)
Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?
Attempt 1
Tried this but it failed with same invalid byte sequence error.
IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
# extract three columns: time stamp, url, ip
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
Solution
This seems to have worked for me.
a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
Does Ruby provide a way to do File.read() with specified encoding?

I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.
could you please provide an example to the alternative above.
require 'csv'
CSV.foreach("log.csv", encoding: "utf-8") do |row|
md = row[0].match /watch\?v=/
puts row[0], row[1], row[3] if md
end
Or,
CSV.foreach("log.csv", 'rb:utf-8') do |row|
If you need more speed, use the fastercsv gem.
This seems to have worked for me.
File.readlines('log.csv', :encoding => 'ISO-8859-1')
Yes, in order to read a file you have to know its encoding.

In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.
I did
File.readlines(email, :encoding => 'UTF-8').each do |line|
but this didn't work with some Japanese characters so I added this on the next line and that worked fine.
line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Ruby CSV UTF-8 Encoding/Translation Issues

I'm having problems with Ruby 1.9 CSV and invalid UTF-8 characters in my data.
My code looks something like this:
CSV.foreach("small-test2.csv", options) do |row |
name, workgroup, address, actual, output = row
next if nbname == "NBName"
#ssl_info[name] = workgroup, address, actual, output
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
clean = ic.iconv(output + ' ')[0..-2]
puts clean
end
However I'm still getting the following:
ArgumentError: invalid byte sequence in UTF-8
=~ at org/jruby/RubyRegexp.java:1487
=~ at org/jruby/RubyString.java:1686
Is there anything I'm missing here?

try this,
output.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

How to send a binary file via Pony?

I use below function to create a hash of the whole content of a directory so I can send all files as attachments.
def get_attachments_from_directory(dir)
attachment_to_send = Hash.new
Dir[dir.gsub("\\","/")+"/*"].each {|file|
file_to_send = File.read(file)
#file_to_send = File.read(file, :binmode => true)
attachment_to_send[File.basename(file)]=file_to_send
}
return attachment_to_send
end
and then I use below function to send the attachments out
def email_it(body, subject, to, from, attachment_to_send)
$smtp = 'mail.com'
$smtp_port = 25
Pony.mail(
:to => to,
:from => from,
:subject => subject,
:body => Nokogiri::HTML(body).text,
:html_body => body
:attachments => attachment_to_send,
:via => :smtp,
:via_options => {
:address => $smtp,
:port => $smtp_port,
:enable_starttls_auto => false
}
)
end
There are two files in my testing directory: .log and .png. Both of them are sent and received but .png is corrupted. gmail said that the image file cannot be displayed because it contains errors. The file name of .png file is correct in my gmail account. The file size is wrong. Much much smaller.
Show original in gmail gives me
----==_mimepart_4fd9515347359_fc1e853c88342d
Date: Thu, 14 Jun 2012 12:49:55 +1000
Mime-Version: 1.0
Content-Type: image/png;
charset=UTF-8;
filename="error_when_time_out - login at 2012-06-14 12.48.55.png"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="error_when_time_out - login at 2012-06-14 12.48.55.png"
Content-ID: <4fd95153648c7_fc1e853c883518#RATionalxp.mail>
iVBORwq/XNErlnbxOOrmtdDZDYaMWm16lTatQptSpk4t12RW6HNq6IJGvvyB
sabbUovDe5+loc9U3yPX9Yr1vWJDv9Q4KNcPydUDQnkfV9LNFrTTOc2GrEZd
Rr0uo06fUUdn1jOZ9RxRnZBZJ2bXG3M4yoTplFqeWJFGFoVBjDanha5JoWOM
bx3hi0aTQPSQNcikNoMVeYrndSi9YeVl1jxLr07oNrrn11F1kv3AeL9C8Mpi
bkTrjvku73RaeOP6/KvXVv5yzfC6vfCqHf/H64Y/XNf//obujzf0f7lp+PMt
... it continues ....
Xvjq8X//p/Ocdy68s2/DZ//5/Muvf/rvt319XzQf8p9J+7wpSTTguXYPo3Dy
TYiIaNAvYXs5ir9gv4akEz5MOO6DxGPf150oPfApIe6Yu5SVblRBYgL1TrWq
QqWsUnFag5rYTagbCD4lJCgO2hYdpGzQteqR9NCgo3ZTmh0=
----==_mimepart_4fd9515347359_fc1e853c88342d--
inspect of the hash outputs
{"error_when_time_out - login at 2012-06-14 12.50.12.png"=>"\211PNG\n\277\\\321
+\226v\3618\352\346\265\320\331\r\206\214Zmz\2256\255B\233R\246N-\327dV\350sj\35
0\202F\276\374\201\261\246\333R\213\303{\237\245\241\317T\337#\327\365\212\365\2
75bC\277\3248(\327\017\311\325\003By\037W\322\315\026\264\3239\315\206\254F]F\27
5.\243N\237QGg\3263\231\365\034Q\235\220Y'f\327\es8\312\204\351\224Z\236X\221F\0
26\205A\2146\247\205\256I\241c\214o\035\341\213F\223#\364\2205\310\2446\203\025y
If I try to read the file with #file_to_send = File.read(file, :binmode => true)
I get an error: TypeError - can't convert Hash into Integer:
ruby 1.8.7 (2010-08-16 patchlevel 302) [i386-mingw32]
mime-types (1.16)
pony (1.3)

The conventional way to read binary data without any CR+LF translation is:
File.open(file, 'rb').read
Ruby 1.9 introduces a few new ways to do this that you might be inadvertently trying in your 1.8.7 environment. The second argument to read is the number of bytes you want to read, not the mode of the file.
Be sure to read the documentation on any method you're unfamiliar with. Sometimes things aren't quite what you'd expect.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

UndefinedConversionError trying to parse Arabic from email body - ruby

Related

Ruby compatibility error encoding

Ruby_send the result of scraping through email

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

Ruby CSV UTF-8 Encoding/Translation Issues

How to send a binary file via Pony?

Categories

Resources