How do I extract a HTML topic heading from a web page? - ruby

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?

First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

Related

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

get the ASIN number from amazon URL

Given a amazon product URL, which can either be
http://amazon.com/gp/product/ASIN/*
http://amazon.com/*/dp/ASIN/*
http://amazon.com/dp/ASIN/*
how do i scrap the ASIN number from the URL in Ruby ? I am not good at writing regular expressions.
Use should find match by:
scan(/https?:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
If you are going to do a lot of URL parsing, I'd recommend looking at the Addressable::URI gem. It will make it a lot easier to maintain than parsing URLs with regex. Take a look at its Template module too, which is designed just for this purpose.
Look at the examples on the main Addressable page for more information.
You could also use Ruby's built-in URI module, to get the path using path, along with a simple string split and some logic to look at which element has the "dp" and then take the next element in the array or "gp" and take the second following element.

Partial Markdown parsing

I have an application that needs to parse a subset of Markdown. I basically only want to support inline elements (bold, italic, links, etc), not block level elements (p, h1, h2, etc).
There are a lot of different libraries, so I need some help narrowing it down (and a code sample would be helpful). I started using RedCarpet until I realized that I can't specify which elements I want to parse.
What Ruby Markdown library can I use to achieve this?
I haven't found a library that allows you to specify on a granular level what parts of Markdown syntax are allowed. RDiscount has some configurability, however it doesn't take into account block level elements.
You could also give Sanitize a try (I know, parsing twice isn't exactly an ideal solution) and strip out the elements you don't want afterward.

How can I scrape, parse and crawl files in Ruby?

I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI makes it easy to retrieve pages.
URI.extract makes it easy to find links in pages. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
Simple, untested, logic to start might look like:
require "openuri"
require "uri"
urls_to_scan = %w[
http://www.example.com/page1
http://www.example.com/page2
]
loop do
break if urls_to_scan.empty?
url = urls_to_scan.shift
html = open(url).read
# you probably want to do something to make sure the URLs are not
# pointing outside the site you're walking.
#
# Something like:
#
# URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
#
new_urls = URI.extract(html)
if (new_urls.any?)
urls_to_scan += new_urls
else
; # parse your file as data using the content in html
end
end
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.
There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.
Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!

find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].
The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.
The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.
In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.
Net::HTTP.get 'http://www.target-site.com', '/target-page.html'
You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.

Resources