How do I safely join relative url segments? - ruby

I'm trying to find a robust method of joining partial url path segments together. Is there a quick way to do this?
I tried the following:
puts URI::join('resource/', '/edit', '12?option=test')
I expect:
resource/edit/12?option=test
But I get the error:
`merge': both URI are relative (URI::BadURIError)
I have used File.join() in the past for this but something does not seem right about using the file library for urls.

URI's api is not neccearily great.
URI::join will work only if the first one starts out as an absolute uri with protocol, and the later ones are relative in the right ways... except I try to do that and can't even get that to work.
This at least doesn't error, but why is it skipping the middle component?
URI::join('http://somewhere.com/resource', './edit', '12?option=test')
I think maybe URI just kind of sucks. It lacks significant api on instances, such as an instance #join or method to evaluate relative to a base uri, that you'd expect. It's just kinda crappy.
I think you're going to have to write it yourself. Or just use File.join and other File path methods, after testing all the edge cases you can think of to make sure it does what you want/expect.
edit 9 Dec 2016 I figured out the addressable gem does it very nicely.
base = Addressable::URI.parse("http://example.com")
base + "foo.html"
# => #<Addressable::URI:0x3ff9964aabe4 URI:http://example.com/foo.html>
base = Addressable::URI.parse("http://example.com/path/to/file.html")
base + "relative_file.xml"
# => #<Addressable::URI:0x3ff99648bc80 URI:http://example.com/path/to/relative_file.xml>
base = Addressable::URI.parse("https://example.com/path")
base + "//newhost/somewhere.jpg"
# => #<Addressable::URI:0x3ff9960c9ebc URI:https://newhost/somewhere.jpg>
base = Addressable::URI.parse("http://example.com/path/subpath/file.html")
base + "../up-one-level.html"
=> #<Addressable::URI:0x3fe13ec5e928 URI:http://example.com/path/up-one-level.html>

Have uri as URI::Generic or subclass of thereof
uri.path += '/123'
Enjoy!
06/25/2016 UPDATE for skeptical folk
require 'uri'
uri = URI('http://ioffe.net/boris')
uri.path += '/123'
p uri
Outputs
<URI::HTTP:0x2341a58 URL:http://ioffe.net/boris/123>
Run me

The problem is that resource/ is relative to the current directory, but /edit refers to the top level directory due to the leading slash. It's impossible to join the two directories without already knowing for certain that edit contains resource.
If you're looking for purely string operations, simply remove the leading or trailing slashes from all parts, then join them with / as the glue.

The way to do it using URI.join is:
URI.join('http://example.com', '/foo/', 'bar')
Pay attention to the trailing slashes. You can find the complete documentation here:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/uri/rdoc/URI.html#method-c-join

As you noticed, URI::join won't combine paths with repeated slashes, so it doesn't fit the part.
Turns out it doesn't require a lot of Ruby code to achieve this:
module GluePath
def self.join(*paths, separator: '/')
paths = paths.compact.reject(&:empty?)
last = paths.length - 1
paths.each_with_index.map { |path, index|
_expand(path, index, last, separator)
}.join
end
def self._expand(path, current, last, separator)
if path.start_with?(separator) && current != 0
path = path[1..-1]
end
unless path.end_with?(separator) || current == last
path = [path, separator]
end
path
end
end
The algorithm takes care of consecutive slashes, preserves start and end slashes, and ignores nil and empty strings.
puts GluePath::join('resource/', '/edit', '12?option=test')
outputs
resource/edit/12?option=test

Use this code:
File.join('resource/', '/edit', '12?option=test').
gsub(File::SEPARATOR, '/').
sub(/^\//, '')
# => resource/edit/12?option=test
example with empty strings:
File.join('', '/edit', '12?option=test').
gsub(File::SEPARATOR, '/').
sub(/^\//, '')
# => edit/12?option=test
Or use this if possible to use segments like resource/, edit/, 12?option=test and where http: is only a placeholder to get a valid URI. This works for me.
URI.
join('http:', 'resource/', 'edit/', '12?option=test').
path.
sub(/^\//, '')
# => "resource/edit/12"

A not optimized solution. Note that it doesn't take query params into account. It only handles paths.
class URL
def self.join(*str)
str.map { |path|
new_path = path
# Check the first character
if path[0] == "/"
new_path = new_path[1..-1]
end
# Check the last character
if path[-1] != "/"
new_path += "/"
end
new_path
}.join
end
end

This question is nearly a decade old, yet it seems that there is no perfect solution posted.
A handful of posted answers fail to handle multiple //, e.g. stuff like path = path[1..-1] if path.start_with?('/')
Answers that simply call File.join(*paths) seem to be the accepted "Ruby way," yet they fail in cases where you pass a URI object, e.g. File.join(URI.join('some/path')) fails with TypeError: no implicit conversion of URI::Generic into String.
Below is what I ended up using:
module UrlHelper
def self.join(*paths)
# yes, Ruby's stdlib really does lack a functional join method for URLs
File.join(*paths.map(&:to_s))
end
end

You can use File.join('resource/', '/edit', '12?option=test')

I improved #Maximo Mussini's script to make it works gracefully:
SmartURI.join('http://example.com/subpath', 'hello', query: { token: secret })
=> "http://example.com/subpath/hello?token=secret"
https://gist.github.com/zernel/0f10c71f5a9e044653c1a65c6c5ad697
require 'uri'
module SmartURI
SEPARATOR = '/'
def self.join(*paths, query: nil)
paths = paths.compact.reject(&:empty?)
last = paths.length - 1
url = paths.each_with_index.map { |path, index|
_expand(path, index, last)
}.join
if query.nil?
return url
elsif query.is_a? Hash
return url + "?#{URI.encode_www_form(query.to_a)}"
else
raise "Unexpected input type for query: #{query}, it should be a hash."
end
end
def self._expand(path, current, last)
if path.starts_with?(SEPARATOR) && current != 0
path = path[1..-1]
end
unless path.ends_with?(SEPARATOR) || current == last
path = [path, SEPARATOR]
end
path
end
end

You can use this:
URI.join('http://exemple.com', '/a/', 'b/', 'c/', 'd')
=> #<URI::HTTP http://exemple.com/a/b/c/d>
URI.join('http://exemple.com', '/a/', 'b/', 'c/', 'd').to_s
=> "http://exemple.com/a/b/c/d"
See: http://ruby-doc.org/stdlib-2.4.1/libdoc/uri/rdoc/URI.html#method-c-join-label-Synopsis

My understanding of URI::join is that it thinks like a web browser does.
To evaluate it, point your mental web browser to the first parameter, and keep clicking links until you browse to the last parameter.
For example, URI::join('http://example.com/resource/', '/edit', '12?option=test'), you would browse like this:
http://example.com/resource/, click a link to /edit (a file at the root of the site)
http://example.com/edit, click a link to 12?option=test (a file in the same directory as edit)
http://example.com/12?option=test
If the first link were /edit/ (with a trailing slash), or /edit/foo, then the next link would be relative to /edit/ rather than /.
This page possibly explains it better than I can: Why is URI.join so counterintuitive?

This is my simple take on this problem, just splitting up all the path segments and join them together again. This only works if you're only working with relative path segments, but if that's all you want to do this is handy.
def join_paths *paths
paths.map{|p| p.split('/')}
.flatten
.reject(&:empty?)
.compact
.join('/')
end
Then you can use it like so:
join_paths 'foo/', '/bar', 'a/b/c', 'd' #=> "foo/bar/a/b/c/d"

Related

Normalize HTTP URI

I get URIs from Akamai's log files that include entries such as the following:
/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam
I would like to normalize all of these to the same entry, under the rules:
If there is a query string, strip autho and file from it.
If the query string is empty, remove the trailing ?.
Directory entries for ./ should be removed.
Directory entries for <fulldir>/../ should be removed.
I would have thought that the URI library for Ruby would cover this, but:
It does not provide any mechanism for parsing parts of the query string. (Not that this is hard to do, nor standard.)
It does not remove a trailing ? if the query string is emptied.
URI.parse('/foo?jim').tap{ |u| u.query='' }.to_s #=> "/foo?"
The normalize method does not clean up . or .. in the path.
So, failing an official library, I find myself writing a regex-based solution.
def normalize(path)
result = path.dup
path.sub! /(?<=\?).+$/ do |query|
query.split('&').reject do |kv|
%w[ autho file ].include?(kv[/^[^=]+/])
end.join('&')
end
path.sub! /\?$/, ''
path.sub!(/^[^?]+/){ |path| path.gsub(%r{[^/]+/\.\.},'').gsub('/./','/') }
end
It happens to work for the test cases I've listed above, but with 450,000 paths to clean up I cannot hand check them all.
Is there any glaring error with the above, considering likely log file entries?
Is there a better way to accomplish the same that leans on proven parsing techniques instead of my hand-rolled regex?
The addressable gem will normalize these for you:
require 'addressable/uri'
# normalize relative paths
uri = Addressable::URI.parse('http://example.com/foo/bar/../jim/jam')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
# removes trailing ?
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
# leaves empty parameters alone
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?jim')
puts uri.normalize.to_s #=> "http://example.com/foo/jim/jam?jim"
# remove specific query parameters
uri = Addressable::URI.parse('http://example.com/foo/jim/jam?autho=<randomstring>&file=jam')
cleaned_query = uri.query_values
cleaned_query.delete('autho')
cleaned_query.delete('file')
uri.query_values = cleaned_query
uri.normalize.to_s #=> "http://example.com/foo/jim/jam"
Something that is REALLY important, like, ESSENTIAL to remember, is that a URL/URI is a protocol, a host, a file-path to a resource, followed by options/parameters being passed to the resource being referenced. (For the pedantic, there are other, optional, things in there too but this is sufficient.)
We can extract the path from a URL by parsing it using the URI class, and using the path method. Once we have the path, we have either an absolute path or a relative path based on the root of the site. Dealing with absolute paths is easy:
require 'uri'
%w[
/foo/jim/jam
/foo/jim/jam?
/foo/./jim/jam
/foo/bar/../jim/jam
/foo/jim/jam?autho=<randomstring>&file=jam
].each do |url|
uri = URI.parse(url)
path = uri.path
puts File.absolute_path(path)
end
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
# >> /foo/jim/jam
Because the paths are file paths based on the root of the server, we can play games using Ruby's File.absolute_path method to normalize the '.' and '..' away and get a true absolute path. This will break if there are more .. (parent directory) than the chain of directories, but you shouldn't find that in extracted paths since that would also break the server/browser ability to serve/request/receive resources.
It gets a bit more "interesting" when dealing with relative paths but File is still our friend then, but that's a different question.

Get directory of file that instantiated a class ruby

I have a gem that has code like this inside:
def read(file)
#file = File.new file, "r"
end
Now the problem is, say you have a directory structure like so:
app/main.rb
app/templates/example.txt
and main.rb has the following code:
require 'mygem'
example = MyGem.read('templates/example.txt')
It comes up with File Not Found: templates/example.txt. It would work if example.txt was in the same directory as main.rb but not if it's in a directory. To solve this problem I've added an optional argument called relative_to in read(). This takes an absolute path so the above could would need to be:
require 'mygem'
example = MyGem.read('templates/example.txt', File.dirname(__FILE__))
That works fine, but I think it's a bit ugly. Is there anyway to make it so the class knows what file read() is being called in and works out the path based on that?
There is an interesting library - i told you it was private. One can protect their methods with it from being called from outside. The code finds the caller method's file and removes it. The offender is found using this line:
offender = caller[0].split(':')[0]
I guess you can use it in your MyGem.read code:
def read( file )
fpath = Pathname.new(file)
if fpath.relative?
offender = caller[0].split(':')[0]
fpath = File.join( File.dirname( offender ), file )
end
#file = File.new( fpath, "r" )
end
This way you can use paths, relative to your Mygem caller and not pwd. Exactly the way you tried in your app/main.rb
Well, you can use caller, and a lot more reliably than what the other people said too.
In your gem file, outside of any class or module, put this:
c = caller
req_file = nil
c.each do |s|
if(s =~ /(require|require_relative)/)
req_file = File.dirname(File.expand_path(s.split(':')[0])) #Does not work for filepaths with colons!
break
end
end
REQUIRING_FILE_PATH = req_file
This will work 90% of the time, unless the requiring script executed a Dir.chdir. The File.expand_path depends on that. I'm afraid that unless your requirer passes their __FILE__, there's nothing you can do if they change the working dir.
Also you may check for caller:
def read(file)
if /^(?<file>.+?):.*?/ =~ caller(1).first
caller_dir, caller_file = Pathname.new(Regexp.last_match[:file]).split
file_with_path = File.join caller_dir, file
#file = File.new "#{file_with_path}", "r"
end
end
I would not suggest you to do so (the code above will break being called indirectly, because of caller(1), see reference to documentation on caller). Furthermore, the regex above should be tuned more accurately if the caller path is intended to contain colons.
This should work for typical uses (I'm not sure how resistant it is to indirect use, as mentioned by madusobwa above):
def read_relative(file)
#file = File.new File.join(File.dirname(caller.first), file)
end
On a side note, consider adding a block form of your method that closes the file after yielding. In the current form you're forcing clients to wrap their use of your gem with an ensure.
Accept a file path String as an argument. Convert to a Pathname object. Check if the path is relative. If yes, then convert to absolute.
def read(file)
fpath = Pathname.new(file)
if fpath.relative?
fpath = File.expand_path(File.join(File.dirname(__FILE__),file))
end
#file = File.new(fpath,"r")
end
You can make this code more succinct (less verbose).

How to change the value of a url parameter in ruby?

What is a better way to do this?
Ideally using a URI parsing class of some sort, rather than relying on my own regex
url = "http://example.com" //or "http://example.com?after=111"
next_url = url.gsub(/after=\d+/,"666")
if !next_url.eql?(url)
if (new2.include?('?') == false)
next_url = url + "?after=666"
else
next_url = url + "&after=666"
end
end
puts next_url
I recommend using the Addressable gem when you are taking URLs apart or putting them together. It's very comprehensive, and has query_values(options = {}) and query_values=(new_query_values) to extract all the query components into a hash, or to rebuild it from a hash. It will also handle decoding and encoding the parameters as needed, things that URI will not do for you.
Not sure about your question, but you probably want something like this?
path, query = url.split('?')
query = (query||'').scan(/(.+)=(.+)/).map{|k, v| "#{k}=#{k == 'after' ? 666 : v}"}.join('&')
puts [path, query].join('?')
There's a Ruby library called URI that handles parsing and building of URLs.

In Ruby/Rails, how can I encode/escape special characters in URLs?

How do I encode or 'escape' the URL before I use OpenURI to open(url)?
We're using OpenURI to open a remote url and return the xml:
getresult = open(url).read
The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.
We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.
Don't use URI.escape as it has been deprecated in 1.9.
Rails' Active Support adds Hash#to_query:
{foo: 'asd asdf', bar: '"<#$dfs'}.to_query
# => "bar=%22%3C%23%24dfs&foo=asd+asdf"
Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.
Ruby Standard Library to the rescue:
require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read
See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)
The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.
All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).
The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.
Please, consider the following:
require 'cgi'
def encode_component(s)
# The space-encoding is a problem:
CGI.escape(s).gsub('+','%20')
end
def url_with_params(path, args = {})
return path if args.empty?
path + "?" + args.map do |k,v|
"#{encode_component(k.to_s)}=#{encode_component(v.to_s)}"
end.join("&")
end
def params_from_url(url)
path,query = url.split('?',2)
return [path,{}] unless query
q = query.split('&').inject({}) do |memo,p|
k,v = p.split('=',2)
memo[CGI.unescape(k)] = CGI.unescape(v)
memo
end
return [path, q]
end
u = url_with_params( "http://example.com",
"x[1]" => "& ?=/",
"2+2=4" => "true" )
# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"
params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI
I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.
I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:
http://osdir.com/ml/ruby-core/2010-06/msg00324.html
http://osdir.com/ml/lang-ruby-core/2009-06/msg00350.html
http://osdir.com/ml/ruby-core/2011-06/msg00748.html

How to check if a URL is valid

How can I check if a string is a valid URL?
For example:
http://hello.it => yes
http:||bra.ziz, => no
If this is a valid URL how can I check if this is relative to a image file?
Notice:
As pointed by #CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http or https), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
Similar to the answers above, I find using this regex to be slightly more accurate:
URI::DEFAULT_PARSER.regexp[:ABS_URI]
That will invalidate URLs with spaces, as opposed to URI.regexp which allows spaces for some reason.
I have recently found a shortcut that is provided for the different URI rgexps. You can access any of URI::DEFAULT_PARSER.regexp.keys directly from URI::#{key}.
For example, the :ABS_URI regexp can be accessed from URI::ABS_URI.
The problem with the current answers is that a URI is not an URL.
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
Since URLs are a subset of URIs, it is clear that matching specifically for URIs will successfully match undesired values. For example, URNs:
"urn:isbn:0451450523" =~ URI::regexp
=> 0
That being said, as far as I know, Ruby doesn't have a default way to parse URLs , so you'll most likely need a gem to do so. If you need to match URLs specifically in HTTP or HTTPS format, you could do something like this:
uri = URI.parse(my_possible_url)
if uri.kind_of?(URI::HTTP) or uri.kind_of?(URI::HTTPS)
# do your stuff
end
I prefer the Addressable gem. I have found that it handles URLs more intelligently.
require 'addressable/uri'
SCHEMES = %w(http https)
def valid_url?(url)
parsed = Addressable::URI.parse(url) or return false
SCHEMES.include?(parsed.scheme)
rescue Addressable::URI::InvalidURIError
false
end
This is a fairly old entry, but I thought I'd go ahead and contribute:
String.class_eval do
def is_valid_url?
uri = URI.parse self
uri.kind_of? URI::HTTP
rescue URI::InvalidURIError
false
end
end
Now you can do something like:
if "http://www.omg.wtf".is_valid_url?
p "huzzah!"
end
For me, I use this regular expression:
/\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
Option:
i - case insensitive
x - ignore whitespace in regex
You can set this method to check URL validation:
def valid_url?(url)
return false if url.include?("<script")
url_regexp = /\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
url =~ url_regexp ? true : false
end
To use it:
valid_url?("http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby")
Testing with wrong URLs:
http://ruby3arabi - result is invalid
http://http://ruby3arabi.com - result is invalid
http:// - result is invalid
http://test.com\n<script src=\"nasty.js\"> (Just simply check "<script")
127.0.0.1 - not support IP address
Test with correct URLs:
http://ruby3arabi.com - result is valid
http://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com/article/1 - result is valid
https://www.ruby3arabi.com/websites/58e212ff6d275e4bf9000000?locale=en - result is valid
In general,
/^#{URI::regexp}$/
will work well, but if you only want to match http or https, you can pass those in as options to the method:
/^#{URI::regexp(%w(http https))}$/
That tends to work a little better, if you want to reject protocols like ftp://.
This is a little bit old but here is how I do it. Use Ruby's URI module to parse the URL. If it can be parsed then it's a valid URL. (But that doesn't mean accessible.)
URI supports many schemes, plus you can add custom schemes yourself:
irb> uri = URI.parse "http://hello.it" rescue nil
=> #<URI::HTTP:0x10755c50 URL:http://hello.it>
irb> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"http",
"query"=>nil,
"port"=>80,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
irb> uri = URI.parse "http:||bra.ziz" rescue nil
=> nil
irb> uri = URI.parse "ssh://hello.it:5888" rescue nil
=> #<URI::Generic:0x105fe938 URL:ssh://hello.it:5888>
[26] pry(main)> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"ssh",
"query"=>nil,
"port"=>5888,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
See the documentation for more information about the URI module.
You could also use a regex, maybe something like http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm assuming this regex is correct (I haven't fully checked it) the following will show the validity of the url.
url_regex = Regexp.new("((https?|ftp|file):((//)|(\\\\))+[\w\d:\##%/;$()~_?\+-=\\\\.&]*)")
urls = [
"http://hello.it",
"http:||bra.ziz"
]
urls.each { |url|
if url =~ url_regex then
puts "%s is valid" % url
else
puts "%s not valid" % url
end
}
The above example outputs:
http://hello.it is valid
http:||bra.ziz not valid

Resources