Escaping URL's (without double escaping) - ruby

How do I escape a URL as needed, without double escaping?
Is there a Ruby library that already does this? I wonder what algorithm WebKit or Chrome uses.
Two examples:
This URL is not valid, since the % is not escaped: http://x.co/op&k=21%. If you type it into the URL bar, it knows to escape it. (It is escaping the '%' behind the scenes, right?)
If you type http://localhost:3000/?s=hello%20world into a browser, it knows to not escape %20 again.
I want to reuse great code that has already worked the edge cases that browsers have to handle. I don't mind calling an external library if necessary.
Update: Yes, I know about URI.parse. No need to show me the syntax. My question is harder than that.

So far, the winners are:
Addressable::URI#normalize: "Returns a normalized URI object. NOTE: This method does not attempt to fully conform to specifications. It exists largely to correct other people’s failures to read the specifications, and also to deal with caching issues since several different URIs may represent the same resource and should not be cached multiple times."
Addressable::URI.heuristic_parse: "Converts an input to a URI. The input does not have to be a valid URI -- the method will use heuristics to guess what URI was intended. This is not standards-compliant, merely user-friendly."

Knowing whether you need to encode or decode multiple times is up to you. You're the programmer and need to be aware of what state the URL is in as you massage it.
A browser can assume that a % not followed by a numeric-value is bare, and should be escaped. See "Uniform Resource Identifier (URI): Generic Syntax" for more information.
You can use Ruby's built-in URI, or the Addressable::URI gems to encode/decode.
require 'uri'
uri = URI.parse('http://x.co/op')
uri.query = URI.encode_www_form('k' => '21%')
puts uri.to_s # => http://x.co/op?k=21%25
or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://x.co/op')
uri.query_values = {'k' => '21%'}
puts uri.to_s # => "http://x.co/op?k=21%25"

Related

Simple alternative to URI.escape

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.
I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Remove everything before the first slash in a URL?

Using regex, how could I remove everything before the first path / in a URL?
Example URL: https://www.example.com/some/page?user=1&email=joe#schmoe.org
From that, I just want /some/page?user=1&email=joe#schmoe.org
In the case that it's just the root domain (ie. https://www.example.com/), then I just want the / to be returned.
The domain may or may not have a subdomain and it may or may not have a secure protocol. Really ultimately just wanting to strip out anything before that first path slash.
In the event that it matters, I'm running Ruby 1.9.3.
Don't use regex for this. Use the URI class. You can write:
require 'uri'
u = URI.parse('https://www.example.com/some/page?user=1&email=joe#schmoe.org')
u.path #=> "/some/page"
u.query #=> "user=1&email=joe#schmoe.org"
# All together - this will only return path if query is empty (no ?)
u.request_uri #=> "/some/page?user=1&email=joe#schmoe.org"
require 'uri'
uri = URI.parse("https://www.example.com/some/page?user=1&email=joe#schmoe.org")
> uri.path + '?' + uri.query
=> "/some/page?user=1&email=joe#schmoe.org"
As Gavin also mentioned, it's not a good idea to use RegExp for this, although it's tempting.
You could have URLs with special characters, even UniCode characters in them, which you did not expect when you wrote the RegExp. This can particularly happen in your query string. Using the URI library is the safer approach.
The same can be done using String#index
index(substring[, offset])
str = "https://www.example.com/some/page?user=1&email=joe#schmoe.org"
offset = str.index("//") # => 6
str[str.index('/',offset + 2)..-1]
# => "/some/page?user=1&email=joe#schmoe.org"
I strongly agree with the advice to use the URI module in this case, and I don't consider myself great with regular expressions. Still, it seems worthwhile to demonstrate one possible way to do what you ask.
test_url1 = 'https://www.example.com/some/page?user=1&email=joe#schmoe.org'
test_url2 = 'http://test.com/'
test_url3 = 'http://test.com'
regex = /^https?:\/\/[^\/]+(.*)/
regex.match(test_url1)[1]
# => "/some/page?user=1&email=joe#schmoe.org"
regex.match(test_url2)[1]
# => "/"
regex.match(test_url3)[1]
# => ""
Note that in the last case, the URL had no trailing '/' so the result is the empty string.
The regular expression (/^https?:\/\/[^\/]+(.*)/) says the string starts with (^) http (http), optionally followed by s (s?), followed by :// (:\/\/) followed by at least one non-slash character ([^\/]+), followed by zero or more characters, and we want to capture those characters ((.*)).
I hope that you find that example and explanation educational, and I again recommend against actually using a regular expression in this case. The URI module is simpler to use and far more robust.

Ruby 1.9.3 add unsafe characters to URI.escape

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!
URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.
Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Alternative to URI.parse that allows hostnames to contain an underscore

I'm using DMOZ's list of url topics, which contains some urls that have hostnames that contain an underscore.
For example:
608 <ExternalPage about="http://outer_heaven4.tripod.com/index2.htm">
609 <d:Title>The Outer Heaven</d:Title>
610 <d:Description>Information and image gallery of McFarlane's action figures for Trigun, Akira, Tenchi Muyo and other Japanese Sci-Fi animations.</d:Description>
611 <topic>Top/Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures</topic>
612 </ExternalPage>
While this url will work in a web browser (or, at least, it does in mine :p), it's not legal according to the standard:
a hostname may not contain other characters, such as the underscore character (_),
which causes errors when trying to parse such URL with URI.parse:
[2] pry(main)> require 'uri'
=> true
[3] pry(main)> URI.parse "http://outer_heaven4.tripod.com/index2.htm"
URI::InvalidURIError: the scheme http does not accept registry part: outer_heaven4.tripod.com (or bad hostname?)
from ~/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'
Is there an alternative to URI.parse I can use that has lower strictness without just rolling my own?
Try Addressable::URI. It follows the RFCs more closely than URI and is very flexible.
require 'addressable/uri'
uri = Addressable::URI.parse('http://outer_heaven4.tripod.com/index2.htm')
uri.host
=> "outer_heaven4.tripod.com"
I've used it for some projects and have been happy with it. URI is getting a bit... rusty and is in need of TLC. Other's have commented on it too:
http://www.cloudspace.com/blog/2009/05/26/replacing-rubys-uri-with-addressable/
There was quite a discussion about URI's state several years ago among the Ruby developers. I can't find the link to it right now, but there was a recommendation that Addressable::URI be used as a replacement. I don't know if someone stepped up to take over URI development, or where things stand right now. In my own code I continue to use URI for simple things and switch to Addressable::URI when URI proves to do the wrong thing for me.

What regex can I use to get the domain name from a url in Ruby?

I am trying to construct a regex to extract a domain given a url.
for:
http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/
should give:
abc.google.com
URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"
Not a regex, but probably more robust then anything we come up with here.
URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')
If you want to remove the www. as well this will work without raising any errors if the www. is not there.
Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.
([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$
you may be able to use the domain_name gem for this kind of work. From the README:
require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain #=> "example.co.uk"
Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:
def extract_domain
return 'abc.google.com'
end
But that's probably not what you meant …
Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!
require 'uri'
URI.parse('https://abc.google.com/').host # => 'abc.google.com'
And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Resources