We are looking to optimize images with a thumbnail version, which are stored under a funky version of the existing URL:
Original Image:
https://image.s3-us-west-2.amazonaws.com/8/flower.jpg
Thumbnail Image:
https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg
I was going to look from the end of the string for the last '/' and replacing it with '/thumbnails/medium_'. In my case this always safe, but I can't figure out this kind of mutation in Ruby on Rails.
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.split('/')[-1] // should give 'flower.jpg'
The issue is to get everything before the last '/' to inject in 'thumbnails/medium_'. Any ideas?
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.insert(s.rindex('/')+1, 'thumbnails/medium_')
# The above approach modifies the original string, if this is unsatisfactory, use:
img_url = s.dup.insert(s.rindex('/')+1, 'thumbnails/medium_')
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = "#{File.dirname(s)}/thumbnails/medium_#{File.basename(s)}"
# => "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
I would probably use URI and Pathname to work with URLs and file paths:
require 'uri'
require 'pathname'
url = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
uri = URI.new(url)
path = Pathname.new(uri.path)
uri.path = "#{path.dirname}/thumbnails/medium_#{path.basename}"
uri.to_s
#=> "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
s.sub /([^\/]+)$/, 'thumbnails/medium_\1'
The s.sub's 2nd argument should be quoted with single quotation mark, or you have to escape the backslash in the \1 part.
UPDATE
s.sub /([^\/]+?)(?=$|\?|#)$/, 'thumbnails/medium_\1'
In case there's a query string or a fragment or both, behind the path, which contains slashes.
It's #[Range] method what you need:
# a little performance optimization - no need to split split string twice
parts = s.split('/')
img_url = parts[0..-2].join('/') + "/thumbnails/medium_" + parts[-1]
On a side note. If you are using some Rails plugin for handling images (CarrierWave or Paperclip), you should use built-in mechanisms for URL interpolation.
Related
I have a URL string:
http://ip-address/user/reset/1/1379631719drush/owad_yta75
into which I need to insert a short string:
help after the third occurance of "/" so that the new string will be:
http://ip-address/**help**/user/reset/1/1379631719drush/owad_yta75
I cannot use Ruby's string.insert(#, 'string') since the IP address is not always the same length.
I'm looking at using a regex to do that, but I am not exactly sure how to find the third '/' occurance.
One way to do this:
url = "http://ip-address/user/reset/1/1379631719drush/owad_yta75"
url.split("/").insert(3, 'help').join('/')
# => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
You can do that with capturing groups using a regex like this:
(.*?//.*?/)(.*)
^ ^ ^- Everything after 3rd slash
| \- domain
\- protocol
Working demo
And use the following replacement:
\1help/\2
If you check the Substitution section you can see your expected output
The thing you're forgetting is that a URL is just a host plus file path to a resource, so you should take advantage of tools designed to work with those. While it'll seem unintuitive at first, in the long run it'll work better.
require 'uri'
url = 'http://ip-address/user/reset/1/1379631719drush/owad_yta75'
uri = URI.parse(url)
path = uri.path # => "/user/reset/1/1379631719drush/owad_yta75"
dirs = path.split('/') # => ["", "user", "reset", "1", "1379631719drush", "owad_yta75"]
uri.path = (dirs[0,1] + ['help'] + dirs[1 .. -1]).join('/')
uri.to_s # => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
I am trying to parse through URLs using Ruby and return the URLs that match a word after the "/" in .com , .org , etc.
If I am trying to capture "questions" in a URL such as
https://stackoverflow.com/questions I also want to be able to capture https://stackoverflow.com/blah/questions. But I do not want to capture https://stackoverflow.com/queStioNs.
Currently my expression can match https://stackoverflow.com/questions but cannot match with "questions" after another "/", or 2 "/"s, etc.
The end of my regular expression is using \bquestions\.
I tried doing ([a-zA-Z]+\W{1}+\bjob\b|\bjob\b) but this only gets me URLs with /questions and /blah/questions but not /blah/bleh/questions.
What am I doing wrong and how do I match what I need?
You don't actually need a regex for this, you can instead use the URI module:
require 'uri'
urls = ['https://stackoverflow.com/blah/questions', 'https://stackoverflow.com/queStioNs']
urls.each do |url|
the_path = URI(url).path
puts the_path if the_path.include?'questions'
end
I don't know whether there is any simple way around, here is my solution:
regexp = '^(https|http)?:\/\/[\w]+\.(com|org|edu)(\/{1}[a-z]+)*$'
group_length = "https://stackoverflow.com/blah/questions".match(regexp).length
"https://stackoverflow.com/blah/questions".match(regexp)[group_length - 1].gsub("/","")
It will return 'questions'.
Update as per you comments below:
use [\S]*(\/questions){1}$
Hope it helps :)
I have links like this:
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
And I'm scraping them like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value
The problem is that it takes the whole URL and I want to just get the ID:
B000O3GCFU
I think I need to do something like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value[ReGEX_HERE]
What's the simplest regex I can use in this case?
EDIT:
Strange the link URL doesn't appear complete:
http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
Use /\w+$/:
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
/\w+$/ matches trailing alphabets, digits, _.
require 'nokogiri'
s = <<EOF
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
EOF
doc = Nokogiri::HTML(s)
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
# => "B000O3GCFU"
Given that the product code is always preceded by /dp/ and followed by a /:
url[/(?<=\/dp\/)[^\/]+/]
Or, perhaps more readable:
url[%r{(?<=/dp/)[^/]+}]
Alternatively, without using regular expressions:
parts = url.split('/')
parts[parts.index('dp') + 1]
An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)
require 'uri'
product_uri = product.xpath('//div[#class="zg_title"]/a/#href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce",
# "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]
# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch
product_asin = product_path[2]
# => "B000O3GCFU"
Say I have a string representing a URL:
http://www.mysite.com/somepage.aspx?id=33
..I'd like to escape the forward slashes and the question mark:
http:\/\/www.mysite.com\/somepage.aspx\?id=33
How can I do this via gsub? I've been playing with some regular expressions in there but haven't hit on the winning formula yet.
I suggest you use
url = url.gsub(/(?=[\/?])/, '\\')
As shown here
url = 'http://www.mysite.com/somepage.aspx?id=33'
url = url.gsub(/(?=[\/?])/, '\\')
puts url
output
http:\/\/www.mysite.com\/somepage.aspx\?id=33
How about this one result = searchText.gsub(/(\/|\?)/, "\\\\$1")
I will suggest using a block to make it more readable:
url.gsub(/[\/?]/) { |c| "\\#{c}" }
In Ruby, I want to replace a given URL in an HTML string.
Here is my unsuccessful attempt:
escaped_url = url.gsub(/\//,"\/").gsub(/\./,"\.").gsub(/\?/,"\?")
path_regexp = Regexp.new(escaped_url)
html.gsub!(path_regexp, new_url)
Note: url is actually a Google Chart request URL I wrote, which will not have more special characters than /?|.=%:
The gsub method can take a string or a Regexp as its first argument, same goes for gsub!. For example:
>> 'here is some ..text.. xxtextxx'.gsub('..text..', 'pancakes')
=> "here is some pancakes xxtextxx"
So you don't need to bother with a regex or escaping at all, just do a straight string replacement:
html.gsub!(url, new_url)
Or better, use an HTML parser to find the particular node you're looking for and do a simple attribute assignment.
I think you're looking for something like:
path_regexp = Regexp.new(Regexp.escape(url))