Unescaping characters in a string with Ruby - ruby

Given a string in the following format (the Posterous API returns posts in this format):
s="\\u003Cp\\u003E"
How can I convert it to the actual ascii characters such that s="<p>"?
On OSX, I successfully used Iconv.iconv('ascii', 'java', s) but once deployed to Heroku, I receive an Iconv::IllegalSequence exception. I'm guessing that the system Heroku deploys to does't support the java encoder.
I am using HTTParty to make a request to the Posterous API. If I use curl to make the same request then I do not get the double slashes.
From HTTParty github page:
Automatic parsing of JSON and XML into
ruby hashes based on response
content-type
The Posterous API returns JSON (no double slashes) and HTTParty's JSON parsing is inserting the double slash.
Here is a simple example of the way I am using HTTParty to make the request.
class Posterous
include HTTParty
base_uri "http://www.posterous.com/api/2"
basic_auth "username", "password"
format :json
def get_posts
response = Posterous.get("/users/me/sites/9876/posts&api_token=1234")
# snip, see below...
end
end
With the obvious information (username, password, site_id, api_token) replaced with valid values.
At the point of snip, response.body contains a Ruby string that is in JSON format and response.parsed_response contains a Ruby hash object which HTTParty created by parsing the JSON response from the Posterous API.
In both cases the unicode sequences such as \u003C have been changed to \\u003C.

I've found a solution to this problem. I ran across this gist. elskwid had the identical problem and ran the string through a JSON parser:
s = ::JSON.parse("\\u003Cp\\u003E")
Now, s = "<p>".

I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.
In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:
class JsonParser < HTTParty::Parser
def json
::JSON.parse(body)
end
end
class Posterous
include HTTParty
parser ::JsonParser
#....
end

You can also use pack:
"a\\u00e4\\u3042".gsub(/\\u(....)/){[$1.hex].pack("U")} # "aäあ"
Or to do the reverse:
"aäあ".gsub(/[^ -~\n]/){"\\u%04x"%$&.ord} # "a\\u00e4\\u3042"

The doubled-backslashes almost look like a regular string being viewed in a debugger.
The string "\u003Cp\u003E" really is "<p>", only the \u003C is unicode for < and \003E is >.
>> "\u003Cp\u003E" #=> "<p>"
If you are truly getting the string with doubled backslashes then you could try stripping one of the pair.
As a test, see how long the string is:
>> "\\u003Cp\\u003E".size #=> 13
>> "\u003Cp\u003E".size #=> 3
>> "<p>".size #=> 3
All the above was done using Ruby 1.9.2, which is Unicode aware. v1.8.7 wasn't. Here's what I get using 1.8.7's IRB for comparison:
>> "\u003Cp\u003E" #=> "u003Cpu003E"

Related

API problem with ruby on rails : resultat?Message=Authorization+has+been+denied+for+this+request

I have a problem with API connection. I see this message in the URL:
resultat?Message=Authorization+has+been+denied+for+this+request.
My code is the following :
def find_with(siren)
#request = HTTParty.get('https://api.datainfogreffe.fr/api/v1/Entreprise/notapme/performance/(siren)?millesime=2020?token=KEY')
#result = JSON.parse(#request.body)
end
I can connect with the command line and I have an API key.
I do not know the API you try to use. But the query parameters look weird to me. Usually, query parameters start with a ? and are separated with an &.
Additionally, it looks like you try to use string interpolation to add the value of siren to the URL. String interpolation only works in strings with are double quotes ("), single quotes (') do not support string interpolation. And string interpolation is done with #{...}, not with ordinary parentheses.
Therefore I suggest changing the method to:
def find_with(siren)
#request = HTTParty.get(
"https://api.datainfogreffe.fr/api/v1/Entreprise/notapme/performance/#{siren}?millesime=2020&token=KEY"
)
#result = JSON.parse(#request.body)
end
It is not clear from the question where siren is coming from and how a siren might look like. Please keep in mind that it has to be in a specific format to generate a valid URL. If siren is provided by the user and it is not guaranteed that it will be a valid query param then I would suggest using a proper URL builder to ensure proper URL encoding of the siren.

How to decode a string in Ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?
This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".
Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

How do I get a UTF-8 string out of an MD5 digest?

I am trying to use an API that requires an MD5 hash to be sent in UTF-8 format.
Problem is, I can't find any way to actually make that happen.
require 'digest/md5'
api_sig = Digest::MD5.digest "api_key=blahblahblah"
puts api_sig
>> Decode error: not UTF-8
So I try force_encoding(Encoding::UTF_8). Same error. inspect, to_s, nothing gives me what I want.
How can I get a UTF-8 string representing an MD5 digest of another string?
Call Digest::MD5.hexdigest "api_key=blahblahblah"
The documentation of this is very poor, but you can find a lackluster explanation here: http://www.ruby-doc.org/stdlib-2.0/libdoc/digest/rdoc/Digest/Class.html#method-c-hexdigest

Ruby 1.9.3 add unsafe characters to URI.escape

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!
URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.
Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Coding japanese characters in a google API search string in Ruby

I am trying to perform searches in Japanese using the custom google search api as follows:
require 'httparty'
require 'json'
class Search
include HTTParty
format :json
end
#response = Search.get('https://www.googleapis.com/customsearch/v1?key=etcetc&q=JAPANESE SEARCH TERM')
When Japanese text is used it fails complaining of "invalid multibyte char (US-ASCII)"
How can I input Japanese text in a format which Ruby allows and google custom api also accepts?
Thanks for any advice.
add
# encoding: utf-8
to the top of the file
As a follow up google api may still not accept these Japanese search terms - it's very simple to escape them and use them in your search by using URI.escape
require 'uri'
retVal = URI.escape("Japanese term", Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))

Resources