Ruby regexp: capture the path of url - ruby

From any URL I want to extract its path.
For example:
URL: https://stackoverflow.com/questions/ask
Path: questions/ask
It shouldn't be difficult:
url[/(?:\w{2,}\/).+/]
But I think I use a wrong pattern for 'ignore this' ('?:' - doesn't work). What is the right way?

I would suggest you don't do this with a regular expression, and instead use the built in URI lib:
require 'uri'
uri = URI::parse('http://stackoverflow.com/questions/ask')
puts uri.path # results in: /questions/ask
It has a leading slash, but thats easy to deal with =)

You can use regex in this case, which is faster than URI.parse:
s = 'http://stackoverflow.com/questions/ask'
s[s[/.*?\/\/[^\/]*\//].size..-1]
# => "questions/ask" (6,8 times faster)
s[/\/(?!.*\.).*/]
# => "/questions/ask" (9,9 times faster, but with an extra slash)
But if you don't care with the speed, use uri, as ctcherry showed, is more readable.

The approach presented by ctcherry is perfectly correct, but I prefer to use request.fullpath instead of including the URI library in the code. Just call request.fullpath in your views or controllers. But be careful, if you have any GET parameters in your URL it will be catched, in this case a use a split('?').first

Related

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

Capturing parts of an absolute filepath in Ruby

I'm writing a class that parses a filename. I've got 3 questions:
The regex
Given hello/this/is/my/page.html I want to capture three parts:
The parent folders: hello/this/is/my
The filename itself: page
The extension: .html
This is the regex: /^((?:[^\/]+\/)*)(\w+)(\.\w+)$/
The problem is that when I tried this (using Rubular), when I use a relative pathfile such as page.html, it all gets captured into the first capturing group.
Can someone suggest a regex that works correctly for both relative and absolute filepaths?
The class
Would this class be ok?
class RegexFilenameHelper
filenameRegex = /^((?:[^\/]+\/)*)(\w+)(\.\w+)$/
def self.getParentFolders(filePath)
matchData = filenameRegex.match(filePath)
return matchData[1]
end
def self.getFileName(filePath)
# ...
end
def self.getFileExtension(filePath)
# ...
end
end
I understand that it's inefficient to call .match for every function, but I don't intend to use all three functions sequentially.
I also intend to call the class itself, and not instantiate an object.
An aside
Assuming this is important: would you rather capture .html or html, and why?
Using the standard library:
As Tim Pietzcker suggested, the functionality is already implemented in the Pathname and File classes.
filepath = "hello/this/is/my/page.html"
Getting the parents: File.dirname(filepath) => "hello/this/is/my"
Getting the name: File.basename(filepath) => "page.html"
without extension: File.basename(filepath, File.extname(filepath)) => "page"
Getting the extension: File.extname(filepath) => ".html"
We call class methods without having to instantiate any class, which is exactly what I wanted.
It's not necessary for the file or folders to actually exist in the file system!
Thanks to Tim Pietzcker for letting me know!
Using regex:
If I had wanted to do it with regex, the correct regex would be ((?:^.*\/)?)([^\/]+)(\..*$).
((?:^.*\/)?): Captures everything before the last /, or nothing (that's what the last ? is for). This is the parent path, which is optional.
([^\/]+): Gets everything that's not /, which is the filename.
(\..*$): Captures everything coming after the last ., including it.
I tried this in Rubular and it worked like a charm, but I'm still not sure if the second capturing group is too broad, so be careful if you use this!
Thanks to user230910 for helping me get there! :)

Get full path in Sinatra route including everything after question mark

I have the following path:
http://192.168.56.10:4567/browse/foo/bar?x=100&y=200
I want absolutely everything that comes after "http://192.168.56.10:4567/browse/" in a string.
Using a splat doesn't work (only catches "foo/bar"):
get '/browse/*' do
Neither does the regular expression (also only catches "foo/bar"):
get %r{/browse/(.*)} do
The x and y params are all accessible in the params hash, but doing a .map on the ones I want seems unreasonable and un-ruby-like (also, this is just an example.. my params are actually very dynamic and numerous). Is there a better way to do this?
More info: my path looks this way because it is communicating with an API and I use the route to determine the API call I will make. I need the string to look this way.
If you are willing to ignore hash tag in path param this should work(BTW browser would ignore anything after hash in URL)
updated answer
get "/browse/*" do
p "#{request.path}?#{request.query_string}".split("browse/")[1]
end
Or even simpler
request.fullpath.split("browse/")[1]
get "/browse/*" do
a = "#{params[:splat]}?#{request.env['rack.request.query_string']}"
"Got #{a}"
end

Parsing URL string in Ruby

I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format
/xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla
What I would like to have is :
string1: /xyz/mov/exdaf/daeed.mov
string2: arg1=blabla&arg2=3bla3bla
so basically tokenise on ?
but can't find a good example.
Any help would be appreciated.
I think the best solution would be to use the URI module. (You can do things like URI.parse('your_uri_string').query to get the part to the right of the ?.) See http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/
Example:
002:0> require 'uri' # or even 'net/http'
true
003:0> URI
URI
004:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf')
#<URI::Generic:0xb7c0a190 URL:/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf>
005:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf').query
"arg1=bla&arg2=asdf"
006:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf').path
"/xyz/mov/exdaf/daeed.mov"
Otherwise, you can capture on a regex: /^(.*?)\?(.*?)$/. Then $1 and $2 are what you want. (URI makes more sense in this case though.)
Split the initial string on question marks.
str.split("?")
=> ["/xyz/mov/exdaf/daeed.mov", "arg1=blabla&arg2=3bla3bla"]
This seems to be what youre looking for, strings built-in split function:
"abc?def".split("?") => ["abc", "def"]
Edit: Bah, to slow ;)

Remove subdomain from string in ruby

I'm looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
How can I extend this to remove the subdomains that exist in some URLs?
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
This is a tricky issue. Some top-level domains do not accept registrations at the second level.
Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.
Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.
You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!
Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.
For posterity, here's an update from Oct 2014:
I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.
In combination with URI.parse for stripping protocol and paths, it works really well:
❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"
The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).
Ready for a complex regular expression? :)
re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip
Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).
I tested this expression on the following samples:
foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk
Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!
Something like:
def remove_subdomain(host)
# Not complete. Add all root domain to regexp
host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end
puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl
You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.
Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.
Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.
Why not just strip the .com or .co.uk and then split on '.' and get the last element?
some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1
Have to say it feels hacky. Are there any other domains like .co.uk?
I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.

Resources