I have a url such as:
http://www.relevantmagazine.com/life/relationship/blog/23317-pursuing-singleness
And would like to extract just relevantmagazine from it.
Currently I have:
#urlroot = URI.parse(#link.url).host
But it returns www.relevantmagazine.com can anyone help me?
Using a gem for this might be overkill, but anyway: There's a handy gem called domainatrix that can extract the sitename for your while dealing with things like two element top-level domains and more.
url = Domainatrix.parse("http://www.pauldix.net")
url.url # => "http://www.pauldix.net" (the original url)
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
how about
#urlroot = URI.parse(#link.url).host.gsub("www.", "").split(".")[0]
Try this regular expression:
regex = %r{http://[w]*[\.]*[^/|$]*}
If you had the following url strings, it gives the following:
url = 'http://www.google.com/?q=blah'
url.scan(regex) => ["http://www.google.com"]
url = 'http://google.com/?q=blah'
url.scan(regex) => ["http://google.com"]
url = 'http://google.com'
url.scan(regex) => ["http://google.com"]
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo.bar.pauldix.co.uk"]
It's not perfect, but it will strip out everything but the prefix and the host name. You can then easily clean up the prefix with some other code knowing now you only need to look for an http:// or http://www. at the beginning of the string. Another thought is you may need to tweak the regex I gave you a little if you are also going to parse https://. I hope this helps you get started!
Edit:
I reread the question, and realized my answer doesn't really do what you asked. I suppose it might be helpful to know if you know if the urls you're parsing will have a set format like always have the www. If it does, you could use a regular expression that extracts everything between the first and second period in the url. If not, perhaps you could tweak my regex so that it's everything between the / or www. and the first period. That might be the easiest way to get just the site name with none of the www. or the .com or .au.uk and such.
Revised regex:
regex = %r{http://[w]*[\.]*[^\.]*}
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo"]
It'll be weird. If you use the regex stuff, you'll probably have to do it incrementally to clean up the url to extract the part you want.
Maybe you can just split it?
URI.parse(#link.url).host.split('.')[1]
Keep in mind that some registered domains may have more than one component to the registered country domain, like .co.uk or .co.jp or .com.au for example.
I found the answer inspired by tadman's answer and the answer in another question
#urlroot = URI.parse(item.url).host
#urlroot = #urlroot.start_with?('www.') ? #urlroot[4..-1] : #urlroot
#urlroot = #urlroot.split('.')[0]
First line get the host, second line gets removes the www. if they is one and third line get everything before the next dot.
Related
We are looking to optimize images with a thumbnail version, which are stored under a funky version of the existing URL:
Original Image:
https://image.s3-us-west-2.amazonaws.com/8/flower.jpg
Thumbnail Image:
https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg
I was going to look from the end of the string for the last '/' and replacing it with '/thumbnails/medium_'. In my case this always safe, but I can't figure out this kind of mutation in Ruby on Rails.
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.split('/')[-1] // should give 'flower.jpg'
The issue is to get everything before the last '/' to inject in 'thumbnails/medium_'. Any ideas?
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.insert(s.rindex('/')+1, 'thumbnails/medium_')
# The above approach modifies the original string, if this is unsatisfactory, use:
img_url = s.dup.insert(s.rindex('/')+1, 'thumbnails/medium_')
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = "#{File.dirname(s)}/thumbnails/medium_#{File.basename(s)}"
# => "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
I would probably use URI and Pathname to work with URLs and file paths:
require 'uri'
require 'pathname'
url = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
uri = URI.new(url)
path = Pathname.new(uri.path)
uri.path = "#{path.dirname}/thumbnails/medium_#{path.basename}"
uri.to_s
#=> "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
s.sub /([^\/]+)$/, 'thumbnails/medium_\1'
The s.sub's 2nd argument should be quoted with single quotation mark, or you have to escape the backslash in the \1 part.
UPDATE
s.sub /([^\/]+?)(?=$|\?|#)$/, 'thumbnails/medium_\1'
In case there's a query string or a fragment or both, behind the path, which contains slashes.
It's #[Range] method what you need:
# a little performance optimization - no need to split split string twice
parts = s.split('/')
img_url = parts[0..-2].join('/') + "/thumbnails/medium_" + parts[-1]
On a side note. If you are using some Rails plugin for handling images (CarrierWave or Paperclip), you should use built-in mechanisms for URL interpolation.
I have a URL string:
http://ip-address/user/reset/1/1379631719drush/owad_yta75
into which I need to insert a short string:
help after the third occurance of "/" so that the new string will be:
http://ip-address/**help**/user/reset/1/1379631719drush/owad_yta75
I cannot use Ruby's string.insert(#, 'string') since the IP address is not always the same length.
I'm looking at using a regex to do that, but I am not exactly sure how to find the third '/' occurance.
One way to do this:
url = "http://ip-address/user/reset/1/1379631719drush/owad_yta75"
url.split("/").insert(3, 'help').join('/')
# => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
You can do that with capturing groups using a regex like this:
(.*?//.*?/)(.*)
^ ^ ^- Everything after 3rd slash
| \- domain
\- protocol
Working demo
And use the following replacement:
\1help/\2
If you check the Substitution section you can see your expected output
The thing you're forgetting is that a URL is just a host plus file path to a resource, so you should take advantage of tools designed to work with those. While it'll seem unintuitive at first, in the long run it'll work better.
require 'uri'
url = 'http://ip-address/user/reset/1/1379631719drush/owad_yta75'
uri = URI.parse(url)
path = uri.path # => "/user/reset/1/1379631719drush/owad_yta75"
dirs = path.split('/') # => ["", "user", "reset", "1", "1379631719drush", "owad_yta75"]
uri.path = (dirs[0,1] + ['help'] + dirs[1 .. -1]).join('/')
uri.to_s # => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
I have the following mod_rewrite code for lighttpd, but it does not properly foreward the user:
$SERVER["socket"] == ":3041" {
server.document-root = server_root + "/paste"
url.rewrite-once = ( "^/([^/\.]+)/?$" => "?page=paste&id=$1")
}
It should turn the url domain.com/H839jec into domain.com/index.php?page=paste&id=H839jec however it is not doing that, instead it is redirecting everything to domain.com. I dont know much about mod_rewrite and would appreciate some input on why it is doing this.
Use the following :
url.rewrite-once = ("^/(.*)$" => "/?page=paste&id=$1")
I don't know the exact issue in your code, but first the regex looks unnecessarily complicated and may not match what you expected it to match, and second you're redirecting to a query string where as I would expect you still need to redirect to a valid path before the query string, that's why I redirect to /?page... instead of just ?page....
I am trying to parse through URLs using Ruby and return the URLs that match a word after the "/" in .com , .org , etc.
If I am trying to capture "questions" in a URL such as
https://stackoverflow.com/questions I also want to be able to capture https://stackoverflow.com/blah/questions. But I do not want to capture https://stackoverflow.com/queStioNs.
Currently my expression can match https://stackoverflow.com/questions but cannot match with "questions" after another "/", or 2 "/"s, etc.
The end of my regular expression is using \bquestions\.
I tried doing ([a-zA-Z]+\W{1}+\bjob\b|\bjob\b) but this only gets me URLs with /questions and /blah/questions but not /blah/bleh/questions.
What am I doing wrong and how do I match what I need?
You don't actually need a regex for this, you can instead use the URI module:
require 'uri'
urls = ['https://stackoverflow.com/blah/questions', 'https://stackoverflow.com/queStioNs']
urls.each do |url|
the_path = URI(url).path
puts the_path if the_path.include?'questions'
end
I don't know whether there is any simple way around, here is my solution:
regexp = '^(https|http)?:\/\/[\w]+\.(com|org|edu)(\/{1}[a-z]+)*$'
group_length = "https://stackoverflow.com/blah/questions".match(regexp).length
"https://stackoverflow.com/blah/questions".match(regexp)[group_length - 1].gsub("/","")
It will return 'questions'.
Update as per you comments below:
use [\S]*(\/questions){1}$
Hope it helps :)
I have links like this:
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
And I'm scraping them like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value
The problem is that it takes the whole URL and I want to just get the ID:
B000O3GCFU
I think I need to do something like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value[ReGEX_HERE]
What's the simplest regex I can use in this case?
EDIT:
Strange the link URL doesn't appear complete:
http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
Use /\w+$/:
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
/\w+$/ matches trailing alphabets, digits, _.
require 'nokogiri'
s = <<EOF
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
EOF
doc = Nokogiri::HTML(s)
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
# => "B000O3GCFU"
Given that the product code is always preceded by /dp/ and followed by a /:
url[/(?<=\/dp\/)[^\/]+/]
Or, perhaps more readable:
url[%r{(?<=/dp/)[^/]+}]
Alternatively, without using regular expressions:
parts = url.split('/')
parts[parts.index('dp') + 1]
An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)
require 'uri'
product_uri = product.xpath('//div[#class="zg_title"]/a/#href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce",
# "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]
# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch
product_asin = product_path[2]
# => "B000O3GCFU"