Regular Expression find usage of word after "/" in URL - ruby

I am trying to parse through URLs using Ruby and return the URLs that match a word after the "/" in .com , .org , etc.
If I am trying to capture "questions" in a URL such as
https://stackoverflow.com/questions I also want to be able to capture https://stackoverflow.com/blah/questions. But I do not want to capture https://stackoverflow.com/queStioNs.
Currently my expression can match https://stackoverflow.com/questions but cannot match with "questions" after another "/", or 2 "/"s, etc.
The end of my regular expression is using \bquestions\.
I tried doing ([a-zA-Z]+\W{1}+\bjob\b|\bjob\b) but this only gets me URLs with /questions and /blah/questions but not /blah/bleh/questions.
What am I doing wrong and how do I match what I need?

You don't actually need a regex for this, you can instead use the URI module:
require 'uri'
urls = ['https://stackoverflow.com/blah/questions', 'https://stackoverflow.com/queStioNs']
urls.each do |url|
the_path = URI(url).path
puts the_path if the_path.include?'questions'
end

I don't know whether there is any simple way around, here is my solution:
regexp = '^(https|http)?:\/\/[\w]+\.(com|org|edu)(\/{1}[a-z]+)*$'
group_length = "https://stackoverflow.com/blah/questions".match(regexp).length
"https://stackoverflow.com/blah/questions".match(regexp)[group_length - 1].gsub("/","")
It will return 'questions'.
Update as per you comments below:
use [\S]*(\/questions){1}$
Hope it helps :)

Related

Cleanest way to inject into string

We are looking to optimize images with a thumbnail version, which are stored under a funky version of the existing URL:
Original Image:
https://image.s3-us-west-2.amazonaws.com/8/flower.jpg
Thumbnail Image:
https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg
I was going to look from the end of the string for the last '/' and replacing it with '/thumbnails/medium_'. In my case this always safe, but I can't figure out this kind of mutation in Ruby on Rails.
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.split('/')[-1] // should give 'flower.jpg'
The issue is to get everything before the last '/' to inject in 'thumbnails/medium_'. Any ideas?
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.insert(s.rindex('/')+1, 'thumbnails/medium_')
# The above approach modifies the original string, if this is unsatisfactory, use:
img_url = s.dup.insert(s.rindex('/')+1, 'thumbnails/medium_')
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = "#{File.dirname(s)}/thumbnails/medium_#{File.basename(s)}"
# => "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
I would probably use URI and Pathname to work with URLs and file paths:
require 'uri'
require 'pathname'
url = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
uri = URI.new(url)
path = Pathname.new(uri.path)
uri.path = "#{path.dirname}/thumbnails/medium_#{path.basename}"
uri.to_s
#=> "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
s.sub /([^\/]+)$/, 'thumbnails/medium_\1'
The s.sub's 2nd argument should be quoted with single quotation mark, or you have to escape the backslash in the \1 part.
UPDATE
s.sub /([^\/]+?)(?=$|\?|#)$/, 'thumbnails/medium_\1'
In case there's a query string or a fragment or both, behind the path, which contains slashes.
It's #[Range] method what you need:
# a little performance optimization - no need to split split string twice
parts = s.split('/')
img_url = parts[0..-2].join('/') + "/thumbnails/medium_" + parts[-1]
On a side note. If you are using some Rails plugin for handling images (CarrierWave or Paperclip), you should use built-in mechanisms for URL interpolation.

Insert a string into an URL after a specific character

I have a URL string:
http://ip-address/user/reset/1/1379631719drush/owad_yta75
into which I need to insert a short string:
help after the third occurance of "/" so that the new string will be:
http://ip-address/**help**/user/reset/1/1379631719drush/owad_yta75
I cannot use Ruby's string.insert(#, 'string') since the IP address is not always the same length.
I'm looking at using a regex to do that, but I am not exactly sure how to find the third '/' occurance.
One way to do this:
url = "http://ip-address/user/reset/1/1379631719drush/owad_yta75"
url.split("/").insert(3, 'help').join('/')
# => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
You can do that with capturing groups using a regex like this:
(.*?//.*?/)(.*)
^ ^ ^- Everything after 3rd slash
| \- domain
\- protocol
Working demo
And use the following replacement:
\1help/\2
If you check the Substitution section you can see your expected output
The thing you're forgetting is that a URL is just a host plus file path to a resource, so you should take advantage of tools designed to work with those. While it'll seem unintuitive at first, in the long run it'll work better.
require 'uri'
url = 'http://ip-address/user/reset/1/1379631719drush/owad_yta75'
uri = URI.parse(url)
path = uri.path # => "/user/reset/1/1379631719drush/owad_yta75"
dirs = path.split('/') # => ["", "user", "reset", "1", "1379631719drush", "owad_yta75"]
uri.path = (dirs[0,1] + ['help'] + dirs[1 .. -1]).join('/')
uri.to_s # => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"

Proper gsub regular expression for this URL?

Say I have a string representing a URL:
http://www.mysite.com/somepage.aspx?id=33
..I'd like to escape the forward slashes and the question mark:
http:\/\/www.mysite.com\/somepage.aspx\?id=33
How can I do this via gsub? I've been playing with some regular expressions in there but haven't hit on the winning formula yet.
I suggest you use
url = url.gsub(/(?=[\/?])/, '\\')
As shown here
url = 'http://www.mysite.com/somepage.aspx?id=33'
url = url.gsub(/(?=[\/?])/, '\\')
puts url
output
http:\/\/www.mysite.com\/somepage.aspx\?id=33
How about this one result = searchText.gsub(/(\/|\?)/, "\\\\$1")
I will suggest using a block to make it more readable:
url.gsub(/[\/?]/) { |c| "\\#{c}" }

How do I get just the sitename from url in ruby?

I have a url such as:
http://www.relevantmagazine.com/life/relationship/blog/23317-pursuing-singleness
And would like to extract just relevantmagazine from it.
Currently I have:
#urlroot = URI.parse(#link.url).host
But it returns www.relevantmagazine.com can anyone help me?
Using a gem for this might be overkill, but anyway: There's a handy gem called domainatrix that can extract the sitename for your while dealing with things like two element top-level domains and more.
url = Domainatrix.parse("http://www.pauldix.net")
url.url # => "http://www.pauldix.net" (the original url)
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
how about
#urlroot = URI.parse(#link.url).host.gsub("www.", "").split(".")[0]
Try this regular expression:
regex = %r{http://[w]*[\.]*[^/|$]*}
If you had the following url strings, it gives the following:
url = 'http://www.google.com/?q=blah'
url.scan(regex) => ["http://www.google.com"]
url = 'http://google.com/?q=blah'
url.scan(regex) => ["http://google.com"]
url = 'http://google.com'
url.scan(regex) => ["http://google.com"]
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo.bar.pauldix.co.uk"]
It's not perfect, but it will strip out everything but the prefix and the host name. You can then easily clean up the prefix with some other code knowing now you only need to look for an http:// or http://www. at the beginning of the string. Another thought is you may need to tweak the regex I gave you a little if you are also going to parse https://. I hope this helps you get started!
Edit:
I reread the question, and realized my answer doesn't really do what you asked. I suppose it might be helpful to know if you know if the urls you're parsing will have a set format like always have the www. If it does, you could use a regular expression that extracts everything between the first and second period in the url. If not, perhaps you could tweak my regex so that it's everything between the / or www. and the first period. That might be the easiest way to get just the site name with none of the www. or the .com or .au.uk and such.
Revised regex:
regex = %r{http://[w]*[\.]*[^\.]*}
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo"]
It'll be weird. If you use the regex stuff, you'll probably have to do it incrementally to clean up the url to extract the part you want.
Maybe you can just split it?
URI.parse(#link.url).host.split('.')[1]
Keep in mind that some registered domains may have more than one component to the registered country domain, like .co.uk or .co.jp or .com.au for example.
I found the answer inspired by tadman's answer and the answer in another question
#urlroot = URI.parse(item.url).host
#urlroot = #urlroot.start_with?('www.') ? #urlroot[4..-1] : #urlroot
#urlroot = #urlroot.split('.')[0]
First line get the host, second line gets removes the www. if they is one and third line get everything before the next dot.

Extract URLs from text using Ruby while handling matched parens

URI.extract claims to do this, but it doesn't handle matched parens:
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
What's the best way to extract URLs from text without breaking parenthesized URLs (which users like to use)?
If the URLs are always bound by parentheses a Regular Expression might be a better solution.
text = "text here (http://foo.example.org/bla) and here and here is (http://yet.another.url/with/parens) and some more text"
text.scan /\(([^\)]*)\)/
Before using this
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
You need to add this
require 'uri'
You could use this regexp to extract URL's from a string
"some thing http://abcd.com/ and http://google.com are great".scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)

Resources