How can I scrape, parse and crawl files in Ruby? - ruby

I have a number of data files to process from a data warehouse that have the following format:
:header 1 ...
:header n
# remarks 1 ...
# remarks n
# column header 1
# column header 2
DATA ROWS
(Example: "#### ## ## ##### ######## ####### ###afp## ##e###")
The data is separated by white spaces and has both numbers and other ASCII chars. Some of those pieces of data will be split up and made more meaningful.
All of the data will go into a database, initially an SQLite db for development, and then pushed up to another, more permanent, storage.
These files will be pulled in actually via HTTP from the remote server and I will have to crawl a bit to get some of it as they span folders and many files.
I was hopeful to get some input what the best tools and methods may be to accomplish this the "Ruby way", as well as to abstract out some of this. Otherwise, I'll tackle it probably similar to how I would in Perl or other such approaches I've taken before.
I was thinking along the lines of using OpenURI to open each url, then if input is HTML collect links to crawl, otherwise process the data. I would use String.scan to break apart the file appropriately each time into a multi-dimensional array parsing each component based on the established formatting by the data provider. Upon completion, push the data into the database. Move on to next input file/uri. Rinse and repeat.
I figure I must be missing some libs that those with more experience would use to clean/quicken this process up dramatically and make the script much more flexible for reuse on other data sets.
Additionally, I will be graphing and visualizing this data as well as generating reports, so perhaps that should too be considered.
Any input as to perhaps a better approach or libs to simply this?

Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.

I've written probably a bajillion spiders and site analyzers and find that Ruby has some nice tools that should make this an easy process.
OpenURI makes it easy to retrieve pages.
URI.extract makes it easy to find links in pages. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
Simple, untested, logic to start might look like:
require "openuri"
require "uri"
urls_to_scan = %w[
http://www.example.com/page1
http://www.example.com/page2
]
loop do
break if urls_to_scan.empty?
url = urls_to_scan.shift
html = open(url).read
# you probably want to do something to make sure the URLs are not
# pointing outside the site you're walking.
#
# Something like:
#
# URI.extract(html).select{ |u| u[%r{^http://www\.example\.com}i] }
#
new_urls = URI.extract(html)
if (new_urls.any?)
urls_to_scan += new_urls
else
; # parse your file as data using the content in html
end
end
Unless you own the site you're crawling, you want to be kind and gentle: Don't run as fast as possible because it's not your pipe. Pay attention to the site's robot.txt file or risk being banned.
There are true web-crawler gems for Ruby, but the basic task is so simple I never bother with them. If you want to check out other alternatives, visit some of the links to the right for other questions on SO that touch on this subject.
If you need more power or flexibility, the Nokogiri gem makes short work of parsing HTML, allowing you to use CSS accessors to search for tags of interest. There are some pretty powerful gems for making it easy to grab pages such as typhoeus.
Finally, while ActiveRecord, which is recommended in some comments, is nice, finding documentation for using it outside of Rails can be difficult or confusing. I recommend using Sequel. It is a great ORM, very flexible, and well documented.

Hi I would start by taking a very close look at the gem called Mechanize before firing up any basic open-uri stuff - cause it's build into mechanize. It's a brilliant, fast, and easy to use gem for automating web-crawling. Since your data-format is pretty strange (at least compared to json, xml or html) I don't think you will make any use of the build-in parser - but you could still take a look at it. it's called nokogiri and is extremely smart as well. But in the last end, after crawling and fetching the resources, you will probably have to go with some good old regular expression stuff.
Good luck!

Related

Accessing and scraping sporadically available Wikipedia sections

I need to fetch some data but I'm completely stumped after trying a few things.
I want to access Airlines & Destinations from the Albuquerque_International_Sunport's wiki page - keep in mind, I'll be going through a prepopulated list of airports with this data.
There are multiple "types" of Airlines: Passenger, Cargo, sometimes there's other (sub?)sections; other times there are none:
Articles for multiple airports will be accessed automatically - including some less known airports. This means I need to:
Check if "Airlines & Destinations" section exists
Take all data inside of any table
Scrape it; otherwise do nothing
I've tried using the ruby wikipedia-client gem however, the .raw_data method isn't even returning the section data:
Next, I went to Wikipedia's API: unless I am mistaken, but it doesn't return "section" names! This doesn't seem right but I wasn't able to get it working.
So I suppose that leaves Nokogiri. I can grab and parse the pages fine, but:
How would I go about detecting "Airlines & Destinations" section presence, getting all table data BEFORE end of section? I have a suspicion I need some tricky Xpath for this.
Seems to be the only viable solution.
Any thoughts welcome. Putting a bounty on this question when I can.
Edit: Perhaps it's better to simply somehow grab a list of all airlines in the world and hit them against HTML? Seems like it could be computationally expensive.
Well, I'm not an expert user of Nokogiri but maybe this can give you some idea.
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Albuquerque_International_Sunport"))
# this is the passenger table
page.xpath('//*[#id="mw-content-text"]/div/table[2]/tr').each do |tr|
p tr.text()
puts "-"*50
end
# this is the cargo table
page.xpath('//*[#id="mw-content-text"]/div/table[3]/tr').each do |tr|
p tr.text()
puts "-"*50
end

How do I extract a HTML topic heading from a web page?

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

How do I scrape a site, with multiple pages, and create one single html page with Ruby?

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.
EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.
For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.
The other sites will be listed in a different way but each with a repeated pattern.
Does anyone know how, or have any experience of this sort of thing?
i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?
Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.
Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".
Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].
The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.
The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.
Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.
In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.
Net::HTTP.get 'http://www.target-site.com', '/target-page.html'
You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.

Extracting a Hostname's TLD with a Regular Expression

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.
Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat
I have two specific questions:
Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.
If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.
Here's my take on converting this file into a usable Regexp in Ruby:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')$'
)
You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.
From the docs:
Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to the relevant RFCs and adds support for IRIs and URI templates. Additionally, it provides extensive support for URI templates.
With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.
That answer mentions the public TLD list. Using the information there you could quickly use Ruby's Regexp.escape and Regexp.union methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't so union will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)
There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.
There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.

Resources