get the ASIN number from amazon URL - ruby

Given a amazon product URL, which can either be
http://amazon.com/gp/product/ASIN/*
http://amazon.com/*/dp/ASIN/*
http://amazon.com/dp/ASIN/*
how do i scrap the ASIN number from the URL in Ruby ? I am not good at writing regular expressions.

Use should find match by:
scan(/https?:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)

If you are going to do a lot of URL parsing, I'd recommend looking at the Addressable::URI gem. It will make it a lot easier to maintain than parsing URLs with regex. Take a look at its Template module too, which is designed just for this purpose.
Look at the examples on the main Addressable page for more information.
You could also use Ruby's built-in URI module, to get the path using path, along with a simple string split and some logic to look at which element has the "dp" and then take the next element in the array or "gp" and take the second following element.

Related

How can I find the content between multiple parameter strings (E.G. /contact-group/{ID}/member/{CONTACT-ID})

I had searched around but couldn't find something for it specifically. I was looking for a way to find the content of a URL (In this case these are URIs in a rest API)
A few examples of these look like:
/currency/{currency-id}
Or
/contact-group/{ID}/member/{CONTACT-ID}
The parameters always can be different, however they always are between {}, in different forms within the string. I know how I can replace these when there is only one in the URI without issue, but at runtime the programmer won't know these, and I'm trying to prevent having to define them and because of this when URIs contain multiple parameters I'm not sure how to obtain each case of them.
Happy for any ideas on how to get around this!
Seems like you're looking for a basic example of routing:
# in config/routes.rb
get "/:param_1/:param_2", to: "MyController#some_action"
Then in the controller you'd be able to get params[:param_1] and such.
You can see Rails' routing guide for more info
Maybe I'm not totally understanding your question, though. If you're looking to be able to capture a variable number of params, there's a special syntax for passing arrays in the query param.
See this: Passing array of parameters through get in rails
The answer to this was here
Basically using parameterset = url.scan(/{.+?}/) (replace url with your string name), and what's in scan with your parameter list, I can use this to do
parameterset.each { |x| x.... etc}

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

How do I extract a HTML topic heading from a web page?

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].
The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.
The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.
In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.
Net::HTTP.get 'http://www.target-site.com', '/target-page.html'
You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.

Extracting a Hostname's TLD with a Regular Expression

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.
Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat
I have two specific questions:
Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.
If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.
Here's my take on converting this file into a usable Regexp in Ruby:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')$'
)
You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.
From the docs:
Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to the relevant RFCs and adds support for IRIs and URI templates. Additionally, it provides extensive support for URI templates.
With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.
That answer mentions the public TLD list. Using the information there you could quickly use Ruby's Regexp.escape and Regexp.union methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't so union will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)
There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.
There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.

Resources