I have links like this:
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
And I'm scraping them like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value
The problem is that it takes the whole URL and I want to just get the ID:
B000O3GCFU
I think I need to do something like this:
product_asin = product.xpath('//div[#class="zg_title"]/a/#href').first.value[ReGEX_HERE]
What's the simplest regex I can use in this case?
EDIT:
Strange the link URL doesn't appear complete:
http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
Use /\w+$/:
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
/\w+$/ matches trailing alphabets, digits, _.
require 'nokogiri'
s = <<EOF
<div class="zg_title">
Thermos Foogo Leak-Proof Stainless St...
</div>
EOF
doc = Nokogiri::HTML(s)
p doc.xpath('//div[#class="zg_title"]/a/#href').first.value[/\w+$/]
# => "B000O3GCFU"
Given that the product code is always preceded by /dp/ and followed by a /:
url[/(?<=\/dp\/)[^\/]+/]
Or, perhaps more readable:
url[%r{(?<=/dp/)[^/]+}]
Alternatively, without using regular expressions:
parts = url.split('/')
parts[parts.index('dp') + 1]
An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)
require 'uri'
product_uri = product.xpath('//div[#class="zg_title"]/a/#href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce",
# "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]
# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch
product_asin = product_path[2]
# => "B000O3GCFU"
Related
We are looking to optimize images with a thumbnail version, which are stored under a funky version of the existing URL:
Original Image:
https://image.s3-us-west-2.amazonaws.com/8/flower.jpg
Thumbnail Image:
https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg
I was going to look from the end of the string for the last '/' and replacing it with '/thumbnails/medium_'. In my case this always safe, but I can't figure out this kind of mutation in Ruby on Rails.
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.split('/')[-1] // should give 'flower.jpg'
The issue is to get everything before the last '/' to inject in 'thumbnails/medium_'. Any ideas?
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.insert(s.rindex('/')+1, 'thumbnails/medium_')
# The above approach modifies the original string, if this is unsatisfactory, use:
img_url = s.dup.insert(s.rindex('/')+1, 'thumbnails/medium_')
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = "#{File.dirname(s)}/thumbnails/medium_#{File.basename(s)}"
# => "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
I would probably use URI and Pathname to work with URLs and file paths:
require 'uri'
require 'pathname'
url = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
uri = URI.new(url)
path = Pathname.new(uri.path)
uri.path = "#{path.dirname}/thumbnails/medium_#{path.basename}"
uri.to_s
#=> "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
s.sub /([^\/]+)$/, 'thumbnails/medium_\1'
The s.sub's 2nd argument should be quoted with single quotation mark, or you have to escape the backslash in the \1 part.
UPDATE
s.sub /([^\/]+?)(?=$|\?|#)$/, 'thumbnails/medium_\1'
In case there's a query string or a fragment or both, behind the path, which contains slashes.
It's #[Range] method what you need:
# a little performance optimization - no need to split split string twice
parts = s.split('/')
img_url = parts[0..-2].join('/') + "/thumbnails/medium_" + parts[-1]
On a side note. If you are using some Rails plugin for handling images (CarrierWave or Paperclip), you should use built-in mechanisms for URL interpolation.
I have a URL string:
http://ip-address/user/reset/1/1379631719drush/owad_yta75
into which I need to insert a short string:
help after the third occurance of "/" so that the new string will be:
http://ip-address/**help**/user/reset/1/1379631719drush/owad_yta75
I cannot use Ruby's string.insert(#, 'string') since the IP address is not always the same length.
I'm looking at using a regex to do that, but I am not exactly sure how to find the third '/' occurance.
One way to do this:
url = "http://ip-address/user/reset/1/1379631719drush/owad_yta75"
url.split("/").insert(3, 'help').join('/')
# => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
You can do that with capturing groups using a regex like this:
(.*?//.*?/)(.*)
^ ^ ^- Everything after 3rd slash
| \- domain
\- protocol
Working demo
And use the following replacement:
\1help/\2
If you check the Substitution section you can see your expected output
The thing you're forgetting is that a URL is just a host plus file path to a resource, so you should take advantage of tools designed to work with those. While it'll seem unintuitive at first, in the long run it'll work better.
require 'uri'
url = 'http://ip-address/user/reset/1/1379631719drush/owad_yta75'
uri = URI.parse(url)
path = uri.path # => "/user/reset/1/1379631719drush/owad_yta75"
dirs = path.split('/') # => ["", "user", "reset", "1", "1379631719drush", "owad_yta75"]
uri.path = (dirs[0,1] + ['help'] + dirs[1 .. -1]).join('/')
uri.to_s # => "http://ip-address/help/user/reset/1/1379631719drush/owad_yta75"
I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?
Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.
Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>
This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")
I have a url such as:
http://www.relevantmagazine.com/life/relationship/blog/23317-pursuing-singleness
And would like to extract just relevantmagazine from it.
Currently I have:
#urlroot = URI.parse(#link.url).host
But it returns www.relevantmagazine.com can anyone help me?
Using a gem for this might be overkill, but anyway: There's a handy gem called domainatrix that can extract the sitename for your while dealing with things like two element top-level domains and more.
url = Domainatrix.parse("http://www.pauldix.net")
url.url # => "http://www.pauldix.net" (the original url)
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
how about
#urlroot = URI.parse(#link.url).host.gsub("www.", "").split(".")[0]
Try this regular expression:
regex = %r{http://[w]*[\.]*[^/|$]*}
If you had the following url strings, it gives the following:
url = 'http://www.google.com/?q=blah'
url.scan(regex) => ["http://www.google.com"]
url = 'http://google.com/?q=blah'
url.scan(regex) => ["http://google.com"]
url = 'http://google.com'
url.scan(regex) => ["http://google.com"]
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo.bar.pauldix.co.uk"]
It's not perfect, but it will strip out everything but the prefix and the host name. You can then easily clean up the prefix with some other code knowing now you only need to look for an http:// or http://www. at the beginning of the string. Another thought is you may need to tweak the regex I gave you a little if you are also going to parse https://. I hope this helps you get started!
Edit:
I reread the question, and realized my answer doesn't really do what you asked. I suppose it might be helpful to know if you know if the urls you're parsing will have a set format like always have the www. If it does, you could use a regular expression that extracts everything between the first and second period in the url. If not, perhaps you could tweak my regex so that it's everything between the / or www. and the first period. That might be the easiest way to get just the site name with none of the www. or the .com or .au.uk and such.
Revised regex:
regex = %r{http://[w]*[\.]*[^\.]*}
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo"]
It'll be weird. If you use the regex stuff, you'll probably have to do it incrementally to clean up the url to extract the part you want.
Maybe you can just split it?
URI.parse(#link.url).host.split('.')[1]
Keep in mind that some registered domains may have more than one component to the registered country domain, like .co.uk or .co.jp or .com.au for example.
I found the answer inspired by tadman's answer and the answer in another question
#urlroot = URI.parse(item.url).host
#urlroot = #urlroot.start_with?('www.') ? #urlroot[4..-1] : #urlroot
#urlroot = #urlroot.split('.')[0]
First line get the host, second line gets removes the www. if they is one and third line get everything before the next dot.
In Ruby, I want to replace a given URL in an HTML string.
Here is my unsuccessful attempt:
escaped_url = url.gsub(/\//,"\/").gsub(/\./,"\.").gsub(/\?/,"\?")
path_regexp = Regexp.new(escaped_url)
html.gsub!(path_regexp, new_url)
Note: url is actually a Google Chart request URL I wrote, which will not have more special characters than /?|.=%:
The gsub method can take a string or a Regexp as its first argument, same goes for gsub!. For example:
>> 'here is some ..text.. xxtextxx'.gsub('..text..', 'pancakes')
=> "here is some pancakes xxtextxx"
So you don't need to bother with a regex or escaping at all, just do a straight string replacement:
html.gsub!(url, new_url)
Or better, use an HTML parser to find the particular node you're looking for and do a simple attribute assignment.
I think you're looking for something like:
path_regexp = Regexp.new(Regexp.escape(url))