Regex to remove text before "http://"? - ruby

I have a ruby app parsing a bunch of URLs from strings:
#text = "a string with a url http://example.com"
#text.split.grep(/http[s]?:\/\/\w/)
#text[0] = "http://example.com"
This works fine ^^
But sometimes the URLs have text before the HTTP:// for example
#text = "What's a spacebar? ...http://example.com"
#text[0] = "...http://example.com"
Is there a regex that can select just the text before "http://" in a string so I can strip it out?

Perhaps a nicer way to achieve the same result is to use the URI standard library.
require 'uri'
text = "a string with a url http://example.com and another URL here:http://2.example.com and this here"
URI.extract(text, ['http', 'https'])
# => ["http://example.com", "http://2.example.com"]
Documentation: URI.extract

Spliting and then grepping is an odd way to do this. Why don't you just use String#scan:
#text = "a string with a url http://example.com"
urls = #text.scan(/http[s]?:\/\/\S+/)
url[0] # => "http://example.com"

.*(?=http://)

or you could combine the two.
.*(?=(f|ht)tp[s]://)

Just search for http://, then remove the parts of the string before that (as the =~ returns the offset into the string)

Related

Regular expression in ruby?

I have a URL like below.
/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db"
I need to extract only the id of the play (i.e. 5b35a825-d372-4375-b2f0-f641a38067db) using regular expression. How can I do it?
I would not use a regexp to parse a url. I would use Ruby's libraries to handle URLs:
require 'uri'
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
uri = URI.parse(url)
params = URI::decode_www_form(uri.query).to_h
params['play']
# => 5b35a825-d372-4375-b2f0-f641a38067db
You can do:
str = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
match = str.match(/.*\?play=([^&]+)/)
puts match[1]
=> "5b35a825-d372-4375-b2f0-f641a38067db"
The regex /.*\?play=([^&]+)/ will match everything up until ?play=, and then capture anything that is not a & (the query string parameter separator)
A match will create a MatchData object, represented here by match variable, and captures will be indices of the object, hence your matched data is available at match[1].
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
url.split("play=")[1] #=> "5b35a825-d372-4375-b2f0-f641a38067db"
Ruby's built-in URI class has everything needed to correctly parse, split and decode URLs:
require 'uri'
uri = URI.parse('/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db')
URI::decode_www_form(uri.query).to_h['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
If you're using an older Ruby that doesn't support to_h, use:
Hash[URI::decode_www_form(uri.query)]['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
You should use URI, rather than try to split/extract using a regexp, because the query of a URI will be encoded if any values are not within the characters allowed by the spec. URI, or Addressable::URI, will decode those back to their original values for you.

What is a regex to check to see if some text contains only URLs?

I'm trying to make a regular expression that checks if some text only contains urls and whitespaces and nothing else so:
http://www.google.com http://www.stackoverflow.com
would match, but:
http://www.google.com and http://www.stackoverflow.com
would not match.
Is this possible?
you can use this regex (only test if that is between spaces begin with http://):
/^(?:https?:\/\/\S++\s*+)++$/ =~ text
Ruby already has a method to extract URLs, so that's a great starting place, rather than reinventing a working wheel:
require 'uri'
[
'http://www.google.com http://www.stackoverflow.com',
'http://www.google.com and http://www.stackoverflow.com'
].each do |url|
print url
if url.split.all? { |u| !URI.extract(u).empty? }
puts " contains only URLs"
else
puts " doesn't contain only URLs"
end
end
Which, after running, is:
http://www.google.com http://www.stackoverflow.com contains only URLs
http://www.google.com and http://www.stackoverflow.com doesn't contain only URLs
This doesn't support all the recognized URL schemes, but it is a starting point. You can specify which you want by passing an array of schemes to extract. You can get the IANA's permanent list using:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.iana.org/assignments/uri-schemes.html'))
schemes = doc.at('table table').search('tr').map{ |tr| tr.at('td').text }[1..-1]
words.split.all? { |word| word.match(/^http:/) }
This will check for any URL and the string should be URLs with single white-space as URLs separator only
Look at this live demo
(((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)\s){1,}((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)$
Reference:
http://www.regular-expressions.info/reference.html
http://regexlib.com/Search.aspx?k=URL&AspxAutoDetectCookieSupport=1
If you really want to use regex, please try this:
(?< protocol>\w+):\/\/(?< domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*
Please remove the space before 'protocol' and 'domain'.
Split the string with the whitespaces, and check each string if it is match with the regex above.
Hope it helps!

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

How to get the file extension from a url?

New to ruby, how would I get the file extension from a url like:
http://www.example.com/asdf123.gif
Also, how would I format this string, in c# I would do:
string.format("http://www.example.com/{0}.{1}", filename, extension);
Use File.extname
File.extname("test.rb") #=> ".rb"
File.extname("a/b/d/test.rb") #=> ".rb"
File.extname("test") #=> ""
File.extname(".profile") #=> ""
To format the string
"http://www.example.com/%s.%s" % [filename, extension]
This works for files with query string
file = 'http://recyclewearfashion.com/stylesheets/page_css/page_css_4f308c6b1c83bb62e600001d.css?1343074150'
File.extname(URI.parse(file).path) # => '.css'
also returns "" if file has no extension
url = 'http://www.example.com/asdf123.gif'
extension = url.split('.').last
Will get you the extension for a URL(in the most simple manner possible). Now, for output formatting:
printf "http://www.example.com/%s.%s", filename, extension
You could use Ruby's URI class like this to get the fragment of the URI (i.e. the relative path of the file) and split it at the last occurrence of a dot (this will also work if the URL contains a query part):
require 'uri'
your_url = 'http://www.example.com/asdf123.gif'
fragment = URI.split(your_url)[5]
extension = fragment.match(/\.([\w+-]+)$/)
I realize this is an ancient question, but here's another vote for using Addressable. You can use the .extname method, which works as desired even with a query string:
Addressable::URI.parse('http://www.example.com/asdf123.gif').extname # => ".gif"
Addressable::URI.parse('http://www.example.com/asdf123.gif?foo').extname # => ".gif"

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Resources