How to decode subject fetched via Net::IMAP which in UTF8? (ruby) - ruby

I'm using Net::IMAP.fetch to fetch some messages from Gmail. However, when I fetch a message which has a UTF8 subject (i.e., in cyrillic) I get something like this:
=?UTF-8?B?0KHRgNC/0YHQutC4INGE0L7RgNGD0Lwg0YLRgNCw?= =?UTF-8?B?0LbQuCDQuNC30LHQvtGA0L3QuCDQvNCw0YLQtdGA0Lg=?= =?UTF-8?B?0ZjQsNC7INC4INC90LAg0ZvQuNGA0LjQu9C40YY=?= =?UTF-8?B?0LggLSBjaXJpbGFjZSB0ZXN0?=
How can I convert the above string into UTF8?
NOTE: this is for ruby 1.8.7

The answer is:
Mail::Encodings.unquote_and_convert_to( string, 'utf-8' )
The point is that encoding of email subjects is "QUOTED-PRINTABLE" encoding (by default for Gmail).

Related

How to decode a string in Ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?
This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".
Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

How to send a UTF-8 encoded strings via TcpSocket in Ruby

How to send a UTF-8 encoded strings via TcpSocket in Ruby? When I'm trying to use the following code
msg = $stdin.gets.chomp
#server.puts(msg.encode('utf-8'))
it gives me the "ASCII-8BIT" encoding on the server:
msg = client.gets.chomp
puts msg.encoding
Output
ASCII-8BIT
Why? What am I doing wrong?
The data sent over the connection is just the raw bytes that make up the string, not the encoding that the client associates with them. The server therefore has no way to determine what the encoding should be and defaults to ASCII-8BIT which effectively means unknown.
If you know that the data will always be UTF-8 you can use set_encoding on the socket to always mark the received data as the correct encoding:
client.set_encoding('UTF-8')
msg = client.gets.chomp
If it is possible that the data is in a different encoding from each client you will need to work out some protocol where the client tells the server what that encoding is before sending the actual data. The server can then use set_encoding as above, or use force_encoding on the resulting string.

UTF-8 Encoding Character set

I'm working on an e-mail app for fun and practice in Ruby and one of the mails has this subject:
=?UTF-8?B?4p22IEFuZHJvaWQgc3RpY2sgbWsgODA5aXYgKyB1c2IyZXRoZXJuZXQgYWRh?=\r\n
=?UTF-8?B?cHRlciAtNDYlIOKdtyBKb3NlcGggSm9zZXBoIGtldWtlbmNhcnJvdXNlbCAt?=\r\n
=?UTF-8?B?NTUlIOKduCA0IENlcnJ1dGkgYm94ZXJzaG9ydHMgLTcxJSDinbkgQXJub3Zh?=\r\n
=?UTF-8?B?IDkwIEc0IHRhYmxldCAtNDIl?=
I found out I look at a Base64 string and the parts between =?UTF-8?B? and ?= need to be decoded from Base64 to UTF-8.
Can someone explain how I need to decode a string like this in Ruby?
Try the Base64 module of ruby-1.9 stdlib, see example:
require "base64"
enc = Base64.encode64('Send reinforcements')
# -> "U2VuZCByZWluZm9yY2VtZW50cw==\n"
plain = Base64.decode64(enc)
# -> "Send reinforcements"
Since the =?UTF-8?B? is set the proper codepage or encoding, in which the original string was coded, it is required to be present in email messages. I believe strings without the defined codepage are defaulted to utf-8

Failed to compare UTF-8 chrs in Ruby

I'm using Ruby - Cucumber for automation.
I'm trying to send Japanese chars as a parameter to the user defined function to verify in db.
Below is the statement what I have used :
x=$objDB.run_select_query_verifyText('select name from xxxx where id=1','ごせり槎ゃぱ')
In the run_select_query_verifyText() function I have the code to connect db and get the records from db and it will verify the the text which is passed as a parameter(Japanese chars. )
This function returns true if the string is match with table data in DB else false.
But I'm getting always false and I found that the Japanese string is converting as "??????" while comparing the data.
Note: My program is working fine with English chars.
Your problem is most likely with character encodings. The database returns the content in a different encoding that the Ruby string you are working with. You need to figure out what the db encoding is and make sure both are the same.
If you are using ruby 1.9, you can check the encoding current encoding with yourstring.encoding and change it to e.g. UTF-8 with yourstring.encode("UTF-8").
If you are on ruby 1.8 things are bit more tricky as the String class doesn't natively support encodings. You can use e.g. the character-encodings gem to work around this.

String not valid UTF-8 (BSON::InvalidStringEncoding) when saving a UTF8 compatible string to MongoDB through Mongoid ORM

I am importing data from a MySQL table into MongoDB using Mongoid for my ORM. I am getting an error when trying to save an email address as a string. The error is:
/Library/Ruby/Gems/1.8/gems/bson-1.2.4/lib/../lib/bson/bson_c.rb:24:in `serialize': String not valid UTF-8 (BSON::InvalidStringEncoding)
from /Library/Ruby/Gems/1.8/gems/bson-1.2.4/lib/../lib/bson/bson_c.rb:24:in `serialize'
From my GUI - this is a screenshot of the table info. You can see it's encoded in UTF8.
Also from my GUI - this is a screen shot of the field in my MySQL table that I am importing
This is what happens when I grab the data from MySQL CLI.
And finally, when I inspect the data in my ruby object, I get something that looks like this:
I'm a bit confused here because regardless my table is in UTF-8 and that funky is apparently valid UTF-8 character as a double byte. Anyone know why I'm getting this error?
Try using this helper:
http://snippets.dzone.com/posts/show/4527
It puts a method utf8? on the String. So you can grab the String from mysql and see if it is utf8:
my_string.utf8?
If is not, then you can try change the encoding of your String using other methods like:
my_string.asciify_utf8
my_string.latin1_to_utf8
my_string.cp1252_to_utf8
my_string.utf16le_to_utf8
Maybe this String is saved on mysql in one of these encodings.

Resources