Does it exist MIME Headers Decoder (RFC 2047) for ruby? [duplicate] - ruby

I have the following header:
From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>
I can easily split out the stuff before the <, which leaves me with
"=?iso-8859-1?Q?Marta_Falc=E3o?="
What can I use to turn this into "Marta Falcão"?

Using the newer Mail gem:
Mail::Encodings.value_decode(str) or
Mail::Encodings.unquote_and_convert_to(str, to_encoding)

Thanks to Roland Illig for his comment, which led me to two options:
install rfc2047-ruby and call Rfc2047.decode(header)
install TMail and call TMail::Unquoter.unquote_and_convert_to(header, 'utf-8') or better yet TMail::Address.parse(header).friendly, the latter of which strips out the <email address> part

Use Ruby to implement RFC 2047 isn't hard:
module Rfc2047
TOKEN = /[\041\043-\047\052\053\055\060-\071\101-\132\134\136\137\141-\176]+/.freeze
ENCODED_TEXT = /[\041-\076\100-\176]+/.freeze
ENCODED_WORD = /=\?(?<charset>#{TOKEN})\?(?<encoding>[QB])\?(?<encoded_text>#{ENCODED_TEXT})\?=/i.freeze
class << self
def encode(input)
"=?#{input.encoding}?B?#{[input].pack('m0')}?="
end
def decode(input)
match_data = ENCODED_WORD.match(input)
raise ArgumentError if match_data.nil?
charset, encoding, encoded_text = match_data.captures
decoded =
case encoding
when 'Q', 'q' then encoded_text.unpack1('M')
when 'B', 'b' then encoded_text.unpack1('m')
end
decoded.force_encoding(charset)
end
end
end
Rfc2047.decode '=?iso-8859-1?Q?Marta_Falc=E3o?=' # => Marta_Falcão
Update
mikel/mail is currently having an encoding issue which might not decode the string correctly.
If that really bothers you, you can try new_rfc_2047:
$ gem install new_rfc_2047
$ ruby -rrfc_2047 -e 'puts Rfc2047.decode "From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>"'
From: Marta Falcão <marta.falcao#example.com.br>
Since the source code of mikel/mail is a little too complicated for me to do the modification, I just made my own gem for this.
Gem source is here: https://github.com/tonytonyjan/rfc_2047/

Related

Ruby Scripts - Reference Gems

I am writing my first ruby script and am curious how to actually have gem referenced in the script. I am unable to test the code before hand because it reads form an email in /etc/aliases through a pipe.
Any one one with experiences with ruby scripts to advise?
P.S So many bugs because not tested or refactored
Sample Script
#!/usr/bin/env ruby
# Reading files
mail = File.open(ARGV[0])
lines = []
mail.each_with_index do |i,line|
line[i] = lines.#remove leading and trailing spaces
end
first_line = line[1].strip
if line[1] /^(256)/
phone_number = first_line.gsub("+", "")
else
phone_number = "256#{first_line.gsub(/^0+/,"")}"
end
message = line[2].strip
# Sending message
url = "http://xxxxxxxxxxx.com/api/v2/json/messages?token=XXXXXXXXXXXXXXXXXXXXXXXXXXX&to=#{phone_number}&from=XXXXXX&message=#{CGI.escape(message)}"
5.times do |i|
response = HTTParty.get(url)
body = JSON.parse(response.body)
if body["status"] == "Success"
break
end
end
Gems in question are CGI, Httparty, and Json parsing.
Using external gems can be done by calling the "require" method.
So to include them in your script, the first few lines could be something like this:
#!/usr/bin/env ruby
require "json"
require "cgi"
require "httparty"
#rest of your code...
I assume you have installed your gems with gem install <gemname>?

UTF-8 Error in Ruby

I'm scraping a few websites and eventually I hit a UTF-8 error that looks like this:
/usr/local/lib/ruby/gems/1.9.1/gems/dm-core-1.2.0/lib/dm-core/support/ext/blank.rb:19:in
`=~': invalid byte sequence in UTF-8 (ArgumentError)
Now, I don't care about the websites being 100% accurate. Is there a way I can take the page I get and strip out any problem encodings and then pass it around inside my program?
I'm using ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0] if that matters.
Update:
def self.blank?(value)
return value.blank? if value.respond_to?(:blank?)
case value
when ::NilClass, ::FalseClass
true
when ::TrueClass, ::Numeric
false
when ::Array, ::Hash
value.empty?
when ::String
value !~ /\S/ ###This is the line 19 that has the issue.
else
value.nil? || (value.respond_to?(:empty?) && value.empty?)
end
end
end
When I try to save the following line:
What Happens in The Garage Tin Sign2. � � Newsletter Our monthly newsletter,
It throws the error. It's on page: http://www.stationbay.com/. But what is odd is that when I view it in my web browser it doesn't show the funny symbols in the source.
What do I do next?
The problem is that your string contains non-UTF-8 characters, but seems to have UTF-8 encoding forced. The following short code demonstrates the issue:
a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding? # returns false
a =~ /x/ # provokes ArgumentError: invalid byte sequence in UTF-8
The best way to fix this is to apply the proper encoding right from the beginning. If this is not an option, you can use String#encode:
a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding? # returns false
a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding? # returns true now
a ~= /x/ # works now

How to decode an RFC 2047 encoded email header in Ruby?

I have the following header:
From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>
I can easily split out the stuff before the <, which leaves me with
"=?iso-8859-1?Q?Marta_Falc=E3o?="
What can I use to turn this into "Marta Falcão"?
Using the newer Mail gem:
Mail::Encodings.value_decode(str) or
Mail::Encodings.unquote_and_convert_to(str, to_encoding)
Thanks to Roland Illig for his comment, which led me to two options:
install rfc2047-ruby and call Rfc2047.decode(header)
install TMail and call TMail::Unquoter.unquote_and_convert_to(header, 'utf-8') or better yet TMail::Address.parse(header).friendly, the latter of which strips out the <email address> part
Use Ruby to implement RFC 2047 isn't hard:
module Rfc2047
TOKEN = /[\041\043-\047\052\053\055\060-\071\101-\132\134\136\137\141-\176]+/.freeze
ENCODED_TEXT = /[\041-\076\100-\176]+/.freeze
ENCODED_WORD = /=\?(?<charset>#{TOKEN})\?(?<encoding>[QB])\?(?<encoded_text>#{ENCODED_TEXT})\?=/i.freeze
class << self
def encode(input)
"=?#{input.encoding}?B?#{[input].pack('m0')}?="
end
def decode(input)
match_data = ENCODED_WORD.match(input)
raise ArgumentError if match_data.nil?
charset, encoding, encoded_text = match_data.captures
decoded =
case encoding
when 'Q', 'q' then encoded_text.unpack1('M')
when 'B', 'b' then encoded_text.unpack1('m')
end
decoded.force_encoding(charset)
end
end
end
Rfc2047.decode '=?iso-8859-1?Q?Marta_Falc=E3o?=' # => Marta_Falcão
Update
mikel/mail is currently having an encoding issue which might not decode the string correctly.
If that really bothers you, you can try new_rfc_2047:
$ gem install new_rfc_2047
$ ruby -rrfc_2047 -e 'puts Rfc2047.decode "From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>"'
From: Marta Falcão <marta.falcao#example.com.br>
Since the source code of mikel/mail is a little too complicated for me to do the modification, I just made my own gem for this.
Gem source is here: https://github.com/tonytonyjan/rfc_2047/

Interpreting non-latin characters in Sinatra coming from Mac Excel 2011

I've a Mac VBA script making a request to a Ruby Sinatra web app.
The text passing from Excel contains characters such as é. Ruby (version 1.9.2) chokes on these characters as Excel is not sending them as UTF-8.
# encoding: utf-8
require 'rubygems'
require 'sinatra'
require "sinatra/reloader" if development?
configure do
class << Sinatra::Base
def options(path, opts={}, &block)
route 'OPTIONS', path, opts, &block
end
end
Sinatra::Delegator.delegate :options
end
options '/' do
response.headers["Access-Control-Allow-Origin"] = "*"
response.headers["Access-Control-Allow-Methods"] = "POST"
halt 200
end
post '/fetch' do
chars = []
params['excel_input'].valid_encoding? #returns false
params['excel_input']
end
My Excel VBA:
Sub FetchAddress()
For Each oDest In Selection
With ActiveSheet.QueryTables.Add(Connection:="URL;http://localhost:4567/fetch", Destination:=oDest)
.PostText = "excel_input=" & oDest.Offset(0, -1).Value
.RefreshStyle = xlOverwriteCells
.SaveData = True
.Refresh
End With
Next
End Sub
The character é comes out the other end as Ž.
It looks like the text in Excel is encoded as Windows-1252 http://en.wikipedia.org/wiki/Windows-1252.
The byte representation of the character is 142 (or Ž in Windows-1252).
iconv can convert the input to UTF-8. It converts the character encoding from one encoding to another. So something like this should work:
require "iconv"
...
post '/fetch' do
excel_input = Iconv.conv("UTF-8", "WINDOWS-1252", params['excel_input'])
...
end
you can also probably look at: https://github.com/jmhodges/rchardet
then, you can autodetect charset and then convert it to utf-8.
Ruby 1.9 Encodings: A Primer and the Solution for Rails - yehuda katz is a good read. If you have some time. Goes in to depth about encodings and how to convert between them.

In Ruby/Rails, how can I encode/escape special characters in URLs?

How do I encode or 'escape' the URL before I use OpenURI to open(url)?
We're using OpenURI to open a remote url and return the xml:
getresult = open(url).read
The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.
We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.
Don't use URI.escape as it has been deprecated in 1.9.
Rails' Active Support adds Hash#to_query:
{foo: 'asd asdf', bar: '"<#$dfs'}.to_query
# => "bar=%22%3C%23%24dfs&foo=asd+asdf"
Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.
Ruby Standard Library to the rescue:
require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read
See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)
The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.
All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).
The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.
Please, consider the following:
require 'cgi'
def encode_component(s)
# The space-encoding is a problem:
CGI.escape(s).gsub('+','%20')
end
def url_with_params(path, args = {})
return path if args.empty?
path + "?" + args.map do |k,v|
"#{encode_component(k.to_s)}=#{encode_component(v.to_s)}"
end.join("&")
end
def params_from_url(url)
path,query = url.split('?',2)
return [path,{}] unless query
q = query.split('&').inject({}) do |memo,p|
k,v = p.split('=',2)
memo[CGI.unescape(k)] = CGI.unescape(v)
memo
end
return [path, q]
end
u = url_with_params( "http://example.com",
"x[1]" => "& ?=/",
"2+2=4" => "true" )
# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"
params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI
I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.
I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:
http://osdir.com/ml/ruby-core/2010-06/msg00324.html
http://osdir.com/ml/lang-ruby-core/2009-06/msg00350.html
http://osdir.com/ml/ruby-core/2011-06/msg00748.html

Resources