How to parse a URL and extract the required substring - ruby

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.

I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.

You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.

Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.

with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

Related

How to replace string in URL with captured regex pattern

I want to replace 'hoge' to 'foo' with regex. But the user's value is dynamic so I can't use str.gsub('hoge', 'foo').
str = '?user=hoge&tab=fuga'
What should I do?
Don't do this with a regular expression.
This is how to manipulate URIs using the existing wheels:
require 'uri'
str = 'http://example.com?user=hoge&tab=fuga'
uri = URI.parse(str)
query = URI.decode_www_form(uri.query).to_h # => {"user"=>"hoge", "tab"=>"fuga"}
query['user'] = 'foo'
uri.query = URI.encode_www_form(query)
uri.to_s # => "http://example.com?user=foo&tab=fuga"
Alternately:
require 'addressable'
uri = Addressable::URI.parse('http://example.com?tab=fuga&user=hoge')
query = uri.query_values # => {"tab"=>"fuga", "user"=>"hoge"}
query['user'] = 'foo'
uri.query_values = query
uri.to_s # => "http://example.com?tab=fuga&user=foo"
Note that in the examples the order of the parameters changed, but the code handled the difference without problems.
The reason you want to use URI or Addressable is because parameters and values have to be correctly encoded when they contain illegal characters. URI and Addressable know the rules and will follow them, whereas naive code assumes it's OK to not bother with encoding, causing broken URIs.
URI is part of the Ruby Standard Library, and Addressable is more full-featured. Take your pick.
You can try below regex
([?&]user=)([^&]+)
DEMO
You probably want to find out what the user query maps to first before using a .gsub to replace whatever value it is.
First, parse the URL string into an URI object using the URI module. And then, you can use the CGI query methods to get the key value pairs of the query params off the URI object using the CGI module. And finally, you can .gsub off the values in that hash.

Regular expression in ruby?

I have a URL like below.
/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db"
I need to extract only the id of the play (i.e. 5b35a825-d372-4375-b2f0-f641a38067db) using regular expression. How can I do it?
I would not use a regexp to parse a url. I would use Ruby's libraries to handle URLs:
require 'uri'
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
uri = URI.parse(url)
params = URI::decode_www_form(uri.query).to_h
params['play']
# => 5b35a825-d372-4375-b2f0-f641a38067db
You can do:
str = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
match = str.match(/.*\?play=([^&]+)/)
puts match[1]
=> "5b35a825-d372-4375-b2f0-f641a38067db"
The regex /.*\?play=([^&]+)/ will match everything up until ?play=, and then capture anything that is not a & (the query string parameter separator)
A match will create a MatchData object, represented here by match variable, and captures will be indices of the object, hence your matched data is available at match[1].
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
url.split("play=")[1] #=> "5b35a825-d372-4375-b2f0-f641a38067db"
Ruby's built-in URI class has everything needed to correctly parse, split and decode URLs:
require 'uri'
uri = URI.parse('/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db')
URI::decode_www_form(uri.query).to_h['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
If you're using an older Ruby that doesn't support to_h, use:
Hash[URI::decode_www_form(uri.query)]['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
You should use URI, rather than try to split/extract using a regexp, because the query of a URI will be encoded if any values are not within the characters allowed by the spec. URI, or Addressable::URI, will decode those back to their original values for you.

How can I remove Google tracking parameters (UTM) from an URL?

I have a bunch of URLs which I would like to clean. They all contain UTM parameters, which are not necessary, or rather harmful in this case. Example:
http://houseofbuttons.tumblr.com/post/22326009438?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+HouseOfButtons+%28House+of+Buttons%29
All potential parameters begin with utm_.
How can I remove them easily with a ruby script / structure without destroying other potentialy "good" URL parameters?
You can apply a regex to the urls to clean them up. Something like this should do the trick:
url = 'http://houseofbuttons.tumblr.com/post/22326009438?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+HouseOfButtons+%28House+of+Buttons%29&normal_param=1'
url.gsub(/&?utm_.+?(&|$)/, '') => "http://houseofbuttons.tumblr.com/post/22326009438?normal_param=1"
This uses the URI lib to deconstruct and change the querystring (no regex):
require 'uri'
str ='http://houseofbuttons.tumblr.com/post/22326009438?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+HouseOfButtons+%28House+of+Buttons%29&normal_param=1'
uri = URI.parse(str)
clean_key_vals = URI.decode_www_form(uri.query).reject{|k, _| k.start_with?('utm_')}
uri.query = URI.encode_www_form(clean_key_vals)
p uri.to_s #=> "http://houseofbuttons.tumblr.com/post/22326009438?normal_param=1"

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

How to check if a URL is valid

How can I check if a string is a valid URL?
For example:
http://hello.it => yes
http:||bra.ziz, => no
If this is a valid URL how can I check if this is relative to a image file?
Notice:
As pointed by #CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http or https), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
Similar to the answers above, I find using this regex to be slightly more accurate:
URI::DEFAULT_PARSER.regexp[:ABS_URI]
That will invalidate URLs with spaces, as opposed to URI.regexp which allows spaces for some reason.
I have recently found a shortcut that is provided for the different URI rgexps. You can access any of URI::DEFAULT_PARSER.regexp.keys directly from URI::#{key}.
For example, the :ABS_URI regexp can be accessed from URI::ABS_URI.
The problem with the current answers is that a URI is not an URL.
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
Since URLs are a subset of URIs, it is clear that matching specifically for URIs will successfully match undesired values. For example, URNs:
"urn:isbn:0451450523" =~ URI::regexp
=> 0
That being said, as far as I know, Ruby doesn't have a default way to parse URLs , so you'll most likely need a gem to do so. If you need to match URLs specifically in HTTP or HTTPS format, you could do something like this:
uri = URI.parse(my_possible_url)
if uri.kind_of?(URI::HTTP) or uri.kind_of?(URI::HTTPS)
# do your stuff
end
I prefer the Addressable gem. I have found that it handles URLs more intelligently.
require 'addressable/uri'
SCHEMES = %w(http https)
def valid_url?(url)
parsed = Addressable::URI.parse(url) or return false
SCHEMES.include?(parsed.scheme)
rescue Addressable::URI::InvalidURIError
false
end
This is a fairly old entry, but I thought I'd go ahead and contribute:
String.class_eval do
def is_valid_url?
uri = URI.parse self
uri.kind_of? URI::HTTP
rescue URI::InvalidURIError
false
end
end
Now you can do something like:
if "http://www.omg.wtf".is_valid_url?
p "huzzah!"
end
For me, I use this regular expression:
/\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
Option:
i - case insensitive
x - ignore whitespace in regex
You can set this method to check URL validation:
def valid_url?(url)
return false if url.include?("<script")
url_regexp = /\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
url =~ url_regexp ? true : false
end
To use it:
valid_url?("http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby")
Testing with wrong URLs:
http://ruby3arabi - result is invalid
http://http://ruby3arabi.com - result is invalid
http:// - result is invalid
http://test.com\n<script src=\"nasty.js\"> (Just simply check "<script")
127.0.0.1 - not support IP address
Test with correct URLs:
http://ruby3arabi.com - result is valid
http://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com/article/1 - result is valid
https://www.ruby3arabi.com/websites/58e212ff6d275e4bf9000000?locale=en - result is valid
In general,
/^#{URI::regexp}$/
will work well, but if you only want to match http or https, you can pass those in as options to the method:
/^#{URI::regexp(%w(http https))}$/
That tends to work a little better, if you want to reject protocols like ftp://.
This is a little bit old but here is how I do it. Use Ruby's URI module to parse the URL. If it can be parsed then it's a valid URL. (But that doesn't mean accessible.)
URI supports many schemes, plus you can add custom schemes yourself:
irb> uri = URI.parse "http://hello.it" rescue nil
=> #<URI::HTTP:0x10755c50 URL:http://hello.it>
irb> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"http",
"query"=>nil,
"port"=>80,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
irb> uri = URI.parse "http:||bra.ziz" rescue nil
=> nil
irb> uri = URI.parse "ssh://hello.it:5888" rescue nil
=> #<URI::Generic:0x105fe938 URL:ssh://hello.it:5888>
[26] pry(main)> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"ssh",
"query"=>nil,
"port"=>5888,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
See the documentation for more information about the URI module.
You could also use a regex, maybe something like http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm assuming this regex is correct (I haven't fully checked it) the following will show the validity of the url.
url_regex = Regexp.new("((https?|ftp|file):((//)|(\\\\))+[\w\d:\##%/;$()~_?\+-=\\\\.&]*)")
urls = [
"http://hello.it",
"http:||bra.ziz"
]
urls.each { |url|
if url =~ url_regex then
puts "%s is valid" % url
else
puts "%s not valid" % url
end
}
The above example outputs:
http://hello.it is valid
http:||bra.ziz not valid

Resources