How to send a UTF-8 encoded strings via TcpSocket in Ruby - ruby

How to send a UTF-8 encoded strings via TcpSocket in Ruby? When I'm trying to use the following code
msg = $stdin.gets.chomp
#server.puts(msg.encode('utf-8'))
it gives me the "ASCII-8BIT" encoding on the server:
msg = client.gets.chomp
puts msg.encoding
Output
ASCII-8BIT
Why? What am I doing wrong?

The data sent over the connection is just the raw bytes that make up the string, not the encoding that the client associates with them. The server therefore has no way to determine what the encoding should be and defaults to ASCII-8BIT which effectively means unknown.
If you know that the data will always be UTF-8 you can use set_encoding on the socket to always mark the received data as the correct encoding:
client.set_encoding('UTF-8')
msg = client.gets.chomp
If it is possible that the data is in a different encoding from each client you will need to work out some protocol where the client tells the server what that encoding is before sending the actual data. The server can then use set_encoding as above, or use force_encoding on the resulting string.

Related

Ruby: How to decode strings which are partially encoded or fully encoded?

I am getting encoded strings while parsing text files. I have no idea on how to decode them to english or it's original language.
"info#cloudag.com"
is the encoded string and needs to have it decoded.
I want to decode using Ruby.
Here is a link for your reference and I am expecting the same.
This looks like HTML encoding, not URL encoding.
require 'cgi'
CGI.unescapeHTML("info#cloudag.com")
#=> "info#cloudag.com"

Sinch sms api and chars with accent mark

When I try to send sms with the "ó" char I get a blank char instead.
I have read in the doc that:
the default alphabet is the GCM 7-bit, but characters in languages such
as Arabic, Chinese, Korean, Japanese, or Cyrillic alphabet languages
(e.g., Ukrainian, Serbian, Bulgarian, etc.) must be encoded using the
16-bit UCS–2 character encoding.
But if I encode the message with UTF-16 (I have read UCS-2 is UTF-16) I get a 40001 error. So, is posible to send special chars with sinch?
GSM-7 and USC-2 are encodings used by the Sinch backend to send the message over smpp. Currently Latin1 (iso-8859-1) is also used, and this is probably why you're getting this missing character since some sms providers do not support it and therefore decode the message using a different decoder. Sinch are removing Latin1 (which result in a shorter encoded short message than USC-2) support and will use USC-2 instead for messages that cannot be encoded with GSM-7 or ASCII.
I'm interested in the 40001 that you're getting. If you're setting the charset to utf-16 on the http request you should not do that. If you're doing something else please post your code (without appKey and secret) so I see more clearly how you generate that error.

How to decode a string in Ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?
This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".
Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

UTF-8 Encoding Character set

I'm working on an e-mail app for fun and practice in Ruby and one of the mails has this subject:
=?UTF-8?B?4p22IEFuZHJvaWQgc3RpY2sgbWsgODA5aXYgKyB1c2IyZXRoZXJuZXQgYWRh?=\r\n
=?UTF-8?B?cHRlciAtNDYlIOKdtyBKb3NlcGggSm9zZXBoIGtldWtlbmNhcnJvdXNlbCAt?=\r\n
=?UTF-8?B?NTUlIOKduCA0IENlcnJ1dGkgYm94ZXJzaG9ydHMgLTcxJSDinbkgQXJub3Zh?=\r\n
=?UTF-8?B?IDkwIEc0IHRhYmxldCAtNDIl?=
I found out I look at a Base64 string and the parts between =?UTF-8?B? and ?= need to be decoded from Base64 to UTF-8.
Can someone explain how I need to decode a string like this in Ruby?
Try the Base64 module of ruby-1.9 stdlib, see example:
require "base64"
enc = Base64.encode64('Send reinforcements')
# -> "U2VuZCByZWluZm9yY2VtZW50cw==\n"
plain = Base64.decode64(enc)
# -> "Send reinforcements"
Since the =?UTF-8?B? is set the proper codepage or encoding, in which the original string was coded, it is required to be present in email messages. I believe strings without the defined codepage are defaulted to utf-8

How to decode subject fetched via Net::IMAP which in UTF8? (ruby)

I'm using Net::IMAP.fetch to fetch some messages from Gmail. However, when I fetch a message which has a UTF8 subject (i.e., in cyrillic) I get something like this:
=?UTF-8?B?0KHRgNC/0YHQutC4INGE0L7RgNGD0Lwg0YLRgNCw?= =?UTF-8?B?0LbQuCDQuNC30LHQvtGA0L3QuCDQvNCw0YLQtdGA0Lg=?= =?UTF-8?B?0ZjQsNC7INC4INC90LAg0ZvQuNGA0LjQu9C40YY=?= =?UTF-8?B?0LggLSBjaXJpbGFjZSB0ZXN0?=
How can I convert the above string into UTF8?
NOTE: this is for ruby 1.8.7
The answer is:
Mail::Encodings.unquote_and_convert_to( string, 'utf-8' )
The point is that encoding of email subjects is "QUOTED-PRINTABLE" encoding (by default for Gmail).

Resources