Extract all urls inside a string in Ruby - ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks

Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]

A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }

I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.

just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI

The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Related

What is a regex to check to see if some text contains only URLs?

I'm trying to make a regular expression that checks if some text only contains urls and whitespaces and nothing else so:
http://www.google.com http://www.stackoverflow.com
would match, but:
http://www.google.com and http://www.stackoverflow.com
would not match.
Is this possible?
you can use this regex (only test if that is between spaces begin with http://):
/^(?:https?:\/\/\S++\s*+)++$/ =~ text
Ruby already has a method to extract URLs, so that's a great starting place, rather than reinventing a working wheel:
require 'uri'
[
'http://www.google.com http://www.stackoverflow.com',
'http://www.google.com and http://www.stackoverflow.com'
].each do |url|
print url
if url.split.all? { |u| !URI.extract(u).empty? }
puts " contains only URLs"
else
puts " doesn't contain only URLs"
end
end
Which, after running, is:
http://www.google.com http://www.stackoverflow.com contains only URLs
http://www.google.com and http://www.stackoverflow.com doesn't contain only URLs
This doesn't support all the recognized URL schemes, but it is a starting point. You can specify which you want by passing an array of schemes to extract. You can get the IANA's permanent list using:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.iana.org/assignments/uri-schemes.html'))
schemes = doc.at('table table').search('tr').map{ |tr| tr.at('td').text }[1..-1]
words.split.all? { |word| word.match(/^http:/) }
This will check for any URL and the string should be URLs with single white-space as URLs separator only
Look at this live demo
(((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)\s){1,}((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)$
Reference:
http://www.regular-expressions.info/reference.html
http://regexlib.com/Search.aspx?k=URL&AspxAutoDetectCookieSupport=1
If you really want to use regex, please try this:
(?< protocol>\w+):\/\/(?< domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*
Please remove the space before 'protocol' and 'domain'.
Split the string with the whitespaces, and check each string if it is match with the regex above.
Hope it helps!

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

In Ruby/Rails, how can I encode/escape special characters in URLs?

How do I encode or 'escape' the URL before I use OpenURI to open(url)?
We're using OpenURI to open a remote url and return the xml:
getresult = open(url).read
The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.
We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.
Don't use URI.escape as it has been deprecated in 1.9.
Rails' Active Support adds Hash#to_query:
{foo: 'asd asdf', bar: '"<#$dfs'}.to_query
# => "bar=%22%3C%23%24dfs&foo=asd+asdf"
Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.
Ruby Standard Library to the rescue:
require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read
See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)
The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.
All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).
The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.
Please, consider the following:
require 'cgi'
def encode_component(s)
# The space-encoding is a problem:
CGI.escape(s).gsub('+','%20')
end
def url_with_params(path, args = {})
return path if args.empty?
path + "?" + args.map do |k,v|
"#{encode_component(k.to_s)}=#{encode_component(v.to_s)}"
end.join("&")
end
def params_from_url(url)
path,query = url.split('?',2)
return [path,{}] unless query
q = query.split('&').inject({}) do |memo,p|
k,v = p.split('=',2)
memo[CGI.unescape(k)] = CGI.unescape(v)
memo
end
return [path, q]
end
u = url_with_params( "http://example.com",
"x[1]" => "& ?=/",
"2+2=4" => "true" )
# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"
params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI
I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.
I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:
http://osdir.com/ml/ruby-core/2010-06/msg00324.html
http://osdir.com/ml/lang-ruby-core/2009-06/msg00350.html
http://osdir.com/ml/ruby-core/2011-06/msg00748.html

Getting portion of href attribute using hpricot

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Regex to remove text before "http://"?

I have a ruby app parsing a bunch of URLs from strings:
#text = "a string with a url http://example.com"
#text.split.grep(/http[s]?:\/\/\w/)
#text[0] = "http://example.com"
This works fine ^^
But sometimes the URLs have text before the HTTP:// for example
#text = "What's a spacebar? ...http://example.com"
#text[0] = "...http://example.com"
Is there a regex that can select just the text before "http://" in a string so I can strip it out?
Perhaps a nicer way to achieve the same result is to use the URI standard library.
require 'uri'
text = "a string with a url http://example.com and another URL here:http://2.example.com and this here"
URI.extract(text, ['http', 'https'])
# => ["http://example.com", "http://2.example.com"]
Documentation: URI.extract
Spliting and then grepping is an odd way to do this. Why don't you just use String#scan:
#text = "a string with a url http://example.com"
urls = #text.scan(/http[s]?:\/\/\S+/)
url[0] # => "http://example.com"
.*(?=http://)
or you could combine the two.
.*(?=(f|ht)tp[s]://)
Just search for http://, then remove the parts of the string before that (as the =~ returns the offset into the string)

Resources