Simple alternative to URI.escape - ruby

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.

I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Related

Is postgres byte array escaping broken in ruby?

Easy example:
Zlib::Inflate.inflate(PG::Connection.unescape_bytea(PG::Connection.escape_bytea(Zlib::Deflate.deflate('["128,491,128,487"]'))))
Zlib::DataError: incorrect data check
This is not issue with zlib because following succeeds:
Zlib::Inflate.inflate(Zlib::Deflate.deflate('["128,491,128,487"]'))
=> "[\"128,491,128,487\"]"
Parsing fails/succeeds dependant on provided string:
Zlib::Inflate.inflate(PG::Connection.unescape_bytea(PG::Connection.escape_bytea(Zlib::Deflate.deflate('["128,491,128,487", "128,491,128,490", "38,465,40,463"]'))))
=> "[\"128,491,128,487\", \"128,491,128,490\", \"38,465,40,463\"]"
Am i doing something wrong or is postgres bytea field escaping in ruby broken? What can i do as alternative?
Tried on: Ruby 2.2.3p173, gem pg-0.18.4; Ruby 2.4.1p111, gem pg-0.21.0
Answer from https://bitbucket.org/ged/ruby-pg/issues/262/postgres-bytearray-escaping :
The description of (un)escape_bytea is misleading. It's not the exact opposite of each other.
PG::Connection.escape_bytea uses the old and deprecated escaping mechanism, which must be double escaped - first as a BYTEA and secondary as a string literal (per adding ' before and after the string) for insertion into the SQL string. This is done in one step for convenience and performance reasons.
On contrary PG::Connection.unescape_bytea does only BYTEA unescaping, because it is meant to decode column data retrieved by a query. This data is not escaped as a string literal.
You now have two options to make the above work:
enco = PG::TextEncoder::Bytea.new
deco = PG::TextDecoder::Bytea.new
Zlib::Inflate.inflate(deco.decode(enco.encode(Zlib::Deflate.deflate('["128,491,128,487"]'))))
This makes use of the type encoders of the pg gem instead of libpq. The encoder uses the newer BYTEA escaping machanism and is a bit faster than libpq's functions. This is still independent of a server connection.
conn = PG.connect
Zlib::Inflate.inflate(conn.unescape_bytea(conn.escape_bytea(Zlib::Deflate.deflate('["128,491,128,487"]'))))
This makes use of the connection bound escaping functions of libpq. They also use the newer escaping mechanism, so that no double escaping occurs.

Ruby 1.9.3 add unsafe characters to URI.escape

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!
URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.
Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Alternative to URI.parse that allows hostnames to contain an underscore

I'm using DMOZ's list of url topics, which contains some urls that have hostnames that contain an underscore.
For example:
608 <ExternalPage about="http://outer_heaven4.tripod.com/index2.htm">
609 <d:Title>The Outer Heaven</d:Title>
610 <d:Description>Information and image gallery of McFarlane's action figures for Trigun, Akira, Tenchi Muyo and other Japanese Sci-Fi animations.</d:Description>
611 <topic>Top/Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures</topic>
612 </ExternalPage>
While this url will work in a web browser (or, at least, it does in mine :p), it's not legal according to the standard:
a hostname may not contain other characters, such as the underscore character (_),
which causes errors when trying to parse such URL with URI.parse:
[2] pry(main)> require 'uri'
=> true
[3] pry(main)> URI.parse "http://outer_heaven4.tripod.com/index2.htm"
URI::InvalidURIError: the scheme http does not accept registry part: outer_heaven4.tripod.com (or bad hostname?)
from ~/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'
Is there an alternative to URI.parse I can use that has lower strictness without just rolling my own?
Try Addressable::URI. It follows the RFCs more closely than URI and is very flexible.
require 'addressable/uri'
uri = Addressable::URI.parse('http://outer_heaven4.tripod.com/index2.htm')
uri.host
=> "outer_heaven4.tripod.com"
I've used it for some projects and have been happy with it. URI is getting a bit... rusty and is in need of TLC. Other's have commented on it too:
http://www.cloudspace.com/blog/2009/05/26/replacing-rubys-uri-with-addressable/
There was quite a discussion about URI's state several years ago among the Ruby developers. I can't find the link to it right now, but there was a recommendation that Addressable::URI be used as a replacement. I don't know if someone stepped up to take over URI development, or where things stand right now. In my own code I continue to use URI for simple things and switch to Addressable::URI when URI proves to do the wrong thing for me.

Escaping URL's (without double escaping)

How do I escape a URL as needed, without double escaping?
Is there a Ruby library that already does this? I wonder what algorithm WebKit or Chrome uses.
Two examples:
This URL is not valid, since the % is not escaped: http://x.co/op&k=21%. If you type it into the URL bar, it knows to escape it. (It is escaping the '%' behind the scenes, right?)
If you type http://localhost:3000/?s=hello%20world into a browser, it knows to not escape %20 again.
I want to reuse great code that has already worked the edge cases that browsers have to handle. I don't mind calling an external library if necessary.
Update: Yes, I know about URI.parse. No need to show me the syntax. My question is harder than that.
So far, the winners are:
Addressable::URI#normalize: "Returns a normalized URI object. NOTE: This method does not attempt to fully conform to specifications. It exists largely to correct other people’s failures to read the specifications, and also to deal with caching issues since several different URIs may represent the same resource and should not be cached multiple times."
Addressable::URI.heuristic_parse: "Converts an input to a URI. The input does not have to be a valid URI -- the method will use heuristics to guess what URI was intended. This is not standards-compliant, merely user-friendly."
Knowing whether you need to encode or decode multiple times is up to you. You're the programmer and need to be aware of what state the URL is in as you massage it.
A browser can assume that a % not followed by a numeric-value is bare, and should be escaped. See "Uniform Resource Identifier (URI): Generic Syntax" for more information.
You can use Ruby's built-in URI, or the Addressable::URI gems to encode/decode.
require 'uri'
uri = URI.parse('http://x.co/op')
uri.query = URI.encode_www_form('k' => '21%')
puts uri.to_s # => http://x.co/op?k=21%25
or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://x.co/op')
uri.query_values = {'k' => '21%'}
puts uri.to_s # => "http://x.co/op?k=21%25"

Hpricot error parsing special characters in URI

I'm working on a ruby script to grab historical stock prices from Yahoo, using Hpricot to parse the pages. This is mostly straighforward: the url is "http://finance.yahoo.com/q/hp?s=TickerSymbol" For example, to look up Google, I would use "http://finance.yahoo.com/q/hp?s=GOOG"
Unfortunately, it breaks down when I'm looking up the price of an index. The indexes are prefixed with a caret, such as "http://finance.yahoo.com/q/hp?s=^DJI" for the Dow.
The line:
ticker_symbol = '^DJI'
doc = Hpricot(open("http://finance.yahoo.com/q/hp?s=#{ticker_symbol}"))
throws this exception:
bad URI(is not URI?): http://finance.yahoo.com/q/hp?s=^DJI
Hpricot chokes on the caret (I think because the underlying Ruby URI library does). Is there a way to escape that character or force the library to try it?
Well, don't I feel dumb. Five more minutes and I got this working:
doc = Hpricot(open(URI.encode("http://finance.yahoo.com/q/hp?s=#{ticker_symbol}")))
So if anyone else is wondering, that's how you do it. facepalm
The escape for ^ is %5E; you could do a straight substitution on the URL.
http://finance.yahoo.com/q/hp?s=%5EDJI

Resources