How to properly decode string with quoted-printable encoding in Ruby - ruby

I'm trying to decode what I think is some quoted-printable encoded text that appears in an MBox email archive. I will give one example of some text I am having trouble with.
In the MBox, the following text appears:
"Demarcation by Theresa Castel=E3o-Lawless"
Properly decoded, I think this should appear as:
"Demarcation by Theresa Castelão-Lawless"
I'm basing my statement of what it should properly look like both off of
1) a web archive of the email in which the text is properly rendered as "Demarcation by Theresa Castelão-Lawless"
and 2) this page, which shows "=E3" as corresponding to a "ã" for quoted-printable https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html
I've tried the code below but it gives the wrong output.
string = "Demarcation by Theresa Castel=E3o-Lawless"
decoded_string = Mail::Encodings::QuotedPrintable.decode(string)
puts decoded_string + "\n"
The result from the code above is
"Demarcation by Theresa Castel?o-Lawless"
but as stated above, I want
"Demarcation by Theresa Castelão-Lawless"

Try to avoid weird Rails stuff when you have plain old good ruby to accomplish a task. String#unpack is your friend.
"Demarcation by Theresa Castel=E3o-Lawless".
unpack("M").first. # unpack as quoted printable
force_encoding(Encoding::ISO_8859_1).
encode(Encoding::UTF_8)
#⇒ "Demarcation by Theresa Castelão-Lawless"
or, as suggested in comments by #Stefan, one can pass the source encoding as the 2nd argument:
"Demarcation by Theresa Castel=E3o-Lawless".
unpack("M").first. # unpack as quoted printable
encode('utf-8', 'iso-8859-1')
Note: force_encoding is needed to tell the engine this is single-byte ISO with european accents before encoding into target UTF-8.

Related

Ruby decode string

In Ruby how can I get:
"b\x81rger" by providing the string "bürger".
I need to print special characters to a Zebra printer, I can see that "b\x81rger" prints "bürger", but sending "bürger" does not print the correct character.
Turns out it’s CP850.
Proper solution (Ruby 2.5+)
Normalize the unicode string and then encode it into CP850:
"bürger".unicode_normalize(:nfc).encode(Encoding::CP850)
#⇒ "b\x81rger"
Works for both special characters and combined diacritics.
Fallback solution (Ruby 2.5-)
Encode and pray it’s a composed umlaut:
"bürger".encode(Encoding::CP850)
#⇒ "b\x81rger"

How to decode a string in Ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?
This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".
Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

Encoding Unicode code points with Ruby

I'm retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.
For example, this is some text in the HTML as received (in ISO-8859-1):
\x95\x95 JOHNNY VENETTI \x95\x95
And when attempting to work with this text, it gets converted to this:
\u0095\u0095 JOHNNY VENETTI \u0095\u0095
So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I've tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.
First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.
In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:
Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ...
# children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>
If you don't have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:
s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"
Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

Incompatible encodings with ruby and Nokogiri HTML

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:
xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[#style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[#class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed
In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside.
I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.
Any clues?
[Edit]
Below are screenshots of the app:
-> Firefox with character encoding UTF-8
-> [Firefox with character encoding Western (ISO-8859-1)
It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(
After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'
I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.
Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.
Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:
require 'sinatra'
require 'open-uri'
get '/' do
html = open("http://flybynight.com.br/agenda.php").read
p [ html.encoding, html.valid_encoding? ]
#=> [#<Encoding:ISO-8859-1>, true]
str = html[ /Preview.+?John Digweed/ ]
p [ str, str.encoding, str.valid_encoding? ]
#=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]
utf8 = str.encode('UTF-8')
p [ utf8, utf8.encoding, utf8.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
require 'nokogiri'
doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
p doc.encoding
#=> "ISO-8859-1"
dig = doc.xpath("//div[#class='tit_inter_cnz']")[1]
p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
<<-ENDHTML
<!DOCTYPE html>
<html><head><title>Dig it!</title></head><body>
<p>Here it comes...</p>
<p>#{dig.text}</p>
</body></html>
ENDHTML
end
This properly serves up content with Content-Type:text/html;charset=utf-8 on my computer. Chrome does not show my this character in the browser, however.
Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96. This appears to be this Unicode character which seem to be an odd dash.
I would chalk this up to bad source data, and simply throw:
#encoding: UTF-8
at the top of your Ruby source file(s), and then put in:
f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character
Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.
Either make it an HTML entity, or a different Unicode dash. Either way a gsub is called for.
Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.
I work in publishing of scientific manuscripts and there are many dashes. The dash that you are using is not an ASCII dash, it is a unicode dash. Forcing the ISO encoding is probably having the effect of making the dash change.
http://www.fileformat.info/info/unicode/char/96/index.htm
That site is excellent for unicode issues.
The reason you are getting a square is that perhaps your browser does not support this. It is probably correctly rendered. I would keep UTF-8 encoding, and if you want to make that dash so everyone can see it, convert it to an ascii dash.
You may want to try Iconv to convert the characters to ASCII/UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue

Ruby 1.9 regex encoding

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.
Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)
You need to encode the initial string and use the FIXEDENCODING option.
1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
Strings have a property of encoding. Try to use method String#force_encoding before applying regex.
UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end

Resources