How to decode a string in Ruby - ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?

This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".

Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

Related

Best way to find potential UTF-8 characters from an imported email

So when you view the source of an email, there are several characters in there that should be converted back to UTF-8 by the email client.
For example, in Outlook, a source email may contain =C2=A9 which converts to the copyright symbol.
In ruby, is there a way that I can find these types of characters/patterns and convert them to HTML so that it's displayed in an HTML form? For example, taking things like =C2=A9 and converting it into its associated HTML format ©?
There are two things to consider. First, the original string format using = is called "quoted-printable". Force UTF-8 encoding. Then, use htmlentities to convert to HTML entities. Here is an example:
require 'htmlentities'
coder = HTMLEntities.new
string = '=C2=A9'.unpack("M").first.force_encoding('UTF-8')
coder.encode(string) # => "©"
coder.encode(string, :named) # => "©"
I hope you find that helpful.

In ruby, how do I turn a text representation of a byte in to a byte?

What is the best way to turn the string "FA" into /xFA/ ?
To be clear, I don't want to turn "FA" into 7065 or "FA".to_i(16).
In Java the equivalent would be this:
byte b = (byte) Integer.decode("0xFA");
So you're using / markers, but you aren't actually asking about regexps, right?
I think this does what you want:
['FA'].pack('H*')
# => "\xFA"
There is no actual byte type in ruby stdlib (I don't think? unless there's one I don't know about?), just Strings, that can be any number of bytes long (in this case, one). A single "byte" is typically represented as a 1-byte long String in ruby. #bytesize on a String will always return the length in bytes.
"\xFA".bytesize
# => 1
Your example happens not to be a valid UTF-8 character, by itself. Depending on exactly what you're doing and how you're environment is set up, your string might end up being tagged with a UTF-8 encoding by default. If you are dealing with binary data, and want to make sure the string is tagged as such, you might want to #force_encoding on it to be sure. It should NOT be neccesary when using #pack, the results should be tagged as ASCII-8BIT already (which has a synonym of BINARY, it's basically the "null encoding" used in ruby for binary data).
['FA'].pack('H*').encoding
=> #<Encoding:ASCII-8BIT
But if you're dealing with string objects holding what's meant to be binary data, not neccesarily valid character data in any encoding, it is useful to know you may sometimes need to do str.force_encoding("ASCII-8BIT") (or force_encoding("BINARY"), same thing), to make sure your string isn't tagged as a particular text encoding, which would make ruby complain when you try to do certain operations on it if it includes invalid bytes for that encoding -- or in other cases, possibly do the wrong thing
Actually for a regexp
Okay, you actually do want a regexp. So we have to take our string we created, and embed it in a regexp. Here's one way:
representation = "FA"
str = [representation].pack("H*")
# => "\xFA"
data = "\x01\xFA\xC2".force_encoding("BINARY")
regexp = Regexp.new(str)
data =~ regexp
# => 1 (matched on byte 1; the first byte of data is byte 0)
You see how I needed the force_encoding there on the data string, otherwise ruby would default to it being a UTF-8 string (depending on ruby version and environment setup), and complain that those bytes aren't valid UTF-8.
In some cases you might need to explicitly set the regexp to handle binary data too, the docs say you can pass a second argument 'n' to Regexp.new to do that, but I've never done it.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

UTF-8 Encoding Character set

I'm working on an e-mail app for fun and practice in Ruby and one of the mails has this subject:
=?UTF-8?B?4p22IEFuZHJvaWQgc3RpY2sgbWsgODA5aXYgKyB1c2IyZXRoZXJuZXQgYWRh?=\r\n
=?UTF-8?B?cHRlciAtNDYlIOKdtyBKb3NlcGggSm9zZXBoIGtldWtlbmNhcnJvdXNlbCAt?=\r\n
=?UTF-8?B?NTUlIOKduCA0IENlcnJ1dGkgYm94ZXJzaG9ydHMgLTcxJSDinbkgQXJub3Zh?=\r\n
=?UTF-8?B?IDkwIEc0IHRhYmxldCAtNDIl?=
I found out I look at a Base64 string and the parts between =?UTF-8?B? and ?= need to be decoded from Base64 to UTF-8.
Can someone explain how I need to decode a string like this in Ruby?
Try the Base64 module of ruby-1.9 stdlib, see example:
require "base64"
enc = Base64.encode64('Send reinforcements')
# -> "U2VuZCByZWluZm9yY2VtZW50cw==\n"
plain = Base64.decode64(enc)
# -> "Send reinforcements"
Since the =?UTF-8?B? is set the proper codepage or encoding, in which the original string was coded, it is required to be present in email messages. I believe strings without the defined codepage are defaulted to utf-8

Encoding Unicode code points with Ruby

I'm retrieving an HTML document that is parsed with Nokogiri. The HTML is using charset ISO-8859-1. The problem is there are some Unicode chars in the document which are converted to Unicode code points instead of their respective character.
For example, this is some text in the HTML as received (in ISO-8859-1):
\x95\x95 JOHNNY VENETTI \x95\x95
And when attempting to work with this text, it gets converted to this:
\u0095\u0095 JOHNNY VENETTI \u0095\u0095
So my question is, how can I ensure those characters are represented as their appropriate character instead of the code point? I've tried doing a gsub on the text, but that seems wrong for this. Also, I do not have control over the encoding of the HTML document.
First you should realize that this string is NOT ISO-8859-1 encoded (file says "Non-ISO extended-ASCII text" and the codepage verifies this). May well be this is your problem, in that case you should specify the right encoding (probably something like Windows-1252, in this case) in your HTML document.
In Nokogiri, you can also set the encoding explicitly in cases where the document specifies the wrong encoding:
Nokogiri.HTML("<p>\x95\x95 JOHNNY VENETTI \x95\x95</p>", nil, "Windows-1252")
# => #<Nokogiri::HTML::Document: ...
# children=[#<Nokogiri::XML::Text:0x15744cc "•• JOHNNY VENETTI ••">]>]>]>]>
If you don't have the option to solve this cleanly like above, you can also do it the hard way and associated the string with its correct encoding:
s = "\x95\x95 JOHNNY VENETTI \x95\x95"
s.encoding # => #<Encoding:ASCII-8BIT>
s.force_encoding 'Windows-1252'
s.encode! 'utf-8'
s # => "•• JOHNNY VENETTI ••"
Note that this last piece of code is Ruby 1.9 only. If you want, you can read more about the new encoding system in Ruby 1.9.

Resources