Alternative to URI.parse that allows hostnames to contain an underscore - ruby

I'm using DMOZ's list of url topics, which contains some urls that have hostnames that contain an underscore.
For example:
608 <ExternalPage about="http://outer_heaven4.tripod.com/index2.htm">
609 <d:Title>The Outer Heaven</d:Title>
610 <d:Description>Information and image gallery of McFarlane's action figures for Trigun, Akira, Tenchi Muyo and other Japanese Sci-Fi animations.</d:Description>
611 <topic>Top/Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures</topic>
612 </ExternalPage>
While this url will work in a web browser (or, at least, it does in mine :p), it's not legal according to the standard:
a hostname may not contain other characters, such as the underscore character (_),
which causes errors when trying to parse such URL with URI.parse:
[2] pry(main)> require 'uri'
=> true
[3] pry(main)> URI.parse "http://outer_heaven4.tripod.com/index2.htm"
URI::InvalidURIError: the scheme http does not accept registry part: outer_heaven4.tripod.com (or bad hostname?)
from ~/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'
Is there an alternative to URI.parse I can use that has lower strictness without just rolling my own?

Try Addressable::URI. It follows the RFCs more closely than URI and is very flexible.
require 'addressable/uri'
uri = Addressable::URI.parse('http://outer_heaven4.tripod.com/index2.htm')
uri.host
=> "outer_heaven4.tripod.com"
I've used it for some projects and have been happy with it. URI is getting a bit... rusty and is in need of TLC. Other's have commented on it too:
http://www.cloudspace.com/blog/2009/05/26/replacing-rubys-uri-with-addressable/
There was quite a discussion about URI's state several years ago among the Ruby developers. I can't find the link to it right now, but there was a recommendation that Addressable::URI be used as a replacement. I don't know if someone stepped up to take over URI development, or where things stand right now. In my own code I continue to use URI for simple things and switch to Addressable::URI when URI proves to do the wrong thing for me.

Related

Simple alternative to URI.escape

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.
I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

ruby regex for removing url prefix and ending

I've been trying to figure this out, and i've searched but I'm stuck.
Lets say I have the string www.google.com or http://google.com or just google.com
and I want to extract the string google out of those parameters.
A solution I can think of is first removing the first parameters (www.) then removing the second section of the string (.com) but I know there is a similar more efficient way.
any help would be greatly appreciated!
First, start with a tool designed to work with URLs. Ruby includes URI, and there's also Addressable::URI.
Using these you can strip down a URI into its defined components:
require 'uri'
uri = URI.parse('http://www.ruby-doc.org/stdlib-2.1.1/libdoc/uri/rdoc/URI.html')
uri.host # => "www.ruby-doc.org"
If your string doesn't start with a scheme, you can add one. (Schemes are important.)
url = 'foo.bar.com/some/path'
URI.parse('http://' + url).host
# => "foo.bar.com"
From that point you're going to have a tough time determining what is the true host, versus the domain. A domain can be anything (pretty much) and the host can be the domain name. Possibly you can get a list of domains but, remember that the list is constantly changing.
ICANN has a list of TLDs, as does IANA. Those are ONLY the top-level-domains, not the hosts that sit under them. However, using those lists you can strip the TLD from a host, and at least be a tiny bit closer to where you want to be.

Escaping URL's (without double escaping)

How do I escape a URL as needed, without double escaping?
Is there a Ruby library that already does this? I wonder what algorithm WebKit or Chrome uses.
Two examples:
This URL is not valid, since the % is not escaped: http://x.co/op&k=21%. If you type it into the URL bar, it knows to escape it. (It is escaping the '%' behind the scenes, right?)
If you type http://localhost:3000/?s=hello%20world into a browser, it knows to not escape %20 again.
I want to reuse great code that has already worked the edge cases that browsers have to handle. I don't mind calling an external library if necessary.
Update: Yes, I know about URI.parse. No need to show me the syntax. My question is harder than that.
So far, the winners are:
Addressable::URI#normalize: "Returns a normalized URI object. NOTE: This method does not attempt to fully conform to specifications. It exists largely to correct other people’s failures to read the specifications, and also to deal with caching issues since several different URIs may represent the same resource and should not be cached multiple times."
Addressable::URI.heuristic_parse: "Converts an input to a URI. The input does not have to be a valid URI -- the method will use heuristics to guess what URI was intended. This is not standards-compliant, merely user-friendly."
Knowing whether you need to encode or decode multiple times is up to you. You're the programmer and need to be aware of what state the URL is in as you massage it.
A browser can assume that a % not followed by a numeric-value is bare, and should be escaped. See "Uniform Resource Identifier (URI): Generic Syntax" for more information.
You can use Ruby's built-in URI, or the Addressable::URI gems to encode/decode.
require 'uri'
uri = URI.parse('http://x.co/op')
uri.query = URI.encode_www_form('k' => '21%')
puts uri.to_s # => http://x.co/op?k=21%25
or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://x.co/op')
uri.query_values = {'k' => '21%'}
puts uri.to_s # => "http://x.co/op?k=21%25"

What regex can I use to get the domain name from a url in Ruby?

I am trying to construct a regex to extract a domain given a url.
for:
http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/
should give:
abc.google.com
URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"
Not a regex, but probably more robust then anything we come up with here.
URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')
If you want to remove the www. as well this will work without raising any errors if the www. is not there.
Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.
([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$
you may be able to use the domain_name gem for this kind of work. From the README:
require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain #=> "example.co.uk"
Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:
def extract_domain
return 'abc.google.com'
end
But that's probably not what you meant …
Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!
require 'uri'
URI.parse('https://abc.google.com/').host # => 'abc.google.com'
And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Resources