Ruby/Rails HTML page parsing - ruby

I want to parse web page (catalog) using some Ruby libraries for that and store it to the database. Currently it is hard for me what to choose what kind of library is the best for such kind of purposes. I'm familiar with Hpricot but I'm not really sore that nowadays it is on the edge.
P.S - Or any kind of data to parse URL-s?
Thank you!

I think for HTML parsing nokogiri with open-uri is best.

Why do you care about a library, that "nowadays is on the edge"? If you feel yourself confidently with Hpricot, then use it. Don't waste your time on endless seeks: merely start writing a program. That is my answer.

Hehe, I was looking to quote Hpricot author on this matter, and I've found this comment:
Hpricot was the work of the hacker _why who has now disappeared. But
even before he disappeared nokogiri overtook hpricot in performance.
He even tweeted "caller asks, “should i use hpricot or nokogiri?” if
you're NOT me: use nokogiri. and if you're me: well cut it out, stop
being me"
And here is a link to a comment I've quoted:
http://news.ycombinator.com/item?id=1955644
Summing this up: go with Nokogiri.

Related

What is the difference between Ruby's 'open-uri' and 'Net:HTTP' gems?

It seems like both of these gems perform very similar tasks. Can anyone give examples of where one gem would be more useful than the other? I don't have specific code that I'm referring to, I'm more wondering about general use cases for each gem. I know this is a short question, I will fill in the blanks upon request. Thanks.
The reason they look like they perform similar tasks is OpenURI is a wrapper for Net::HTTP, Net::HTTPS, and Net::FTP.
Usually, unless you feel you need a lower level interface, using OpenURI is better as you can get by with less code. Using OpenURI you can open a URL/URI and treat it as a file.
See: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/open-uri/rdoc/OpenURI.html
and http://ruby-doc.org/stdlib-1.9.3//libdoc/net/http/rdoc/Net.html
I just found out that open does follow redirections, while Net::HTTP doesn't, which is an important difference.
For example, open('http://www.stackoverflow.com') { |content| puts content.read } will display the proper HTML after following the redirection, while Net::HTTP.get(URI('http://www.stackoverflow.com')) will show the redirection message and 302 status code.

Using a modified Nokogiri to parse Wikitext?

Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".
My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (<b>bold text</b>).
Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.
I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to <x>/</x>.
Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?
My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.
This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to <x>/</x>". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.
You might have better luck trying to patch REXML, since that is written in pure Ruby.

How do I write a web scraper in Ruby?

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[#class='posted']")
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>

Any Ruby models to traverse DOM's quickly?

Does anyone know of any Ruby libraries/gems that allow you to traverse a DOM quickly?
I need something which is fast, and doesn't have a lot of dependencies. I've been trying to use Nokogiri, but I'm concerned with the number of 'bug segmentation faults' I've been getting.
Hpricot is a personal favourite of mine.

Better ruby markdown interpreter?

I'm trying to find a markdown interpreter class/module that I can use in a rakefile.
So far I've found maruku, but I'm a bit wary of beta releases.
Has anyone had any issues with maruku? Or, do you know of a better alternative?
I use Maruku to process 100,000 - 200,000 documents per day. Mostly forum posts but I also use it on large documents like wiki pages. Maruku is much faster than BlueCloth and it doesn't choke on large documents. It's all Ruby and although the code isn't especially easy to extend and augment, it is doable. We have a few tweaks and extras in our dialect of Markdown.
If you want something that is pure Ruby, I definitely recommend Maruku.
For the fastest option out there, you probably want RDiscount. The guts are implemented in C.
See also: "Moving Past BlueCloth" on Ryan Tomayko's blog.
Ryan's post includes the following benchmark of 100 iterations of a markdown test:
BlueCloth: 13.029987s total time, 00.130300s average
Maruku: 08.424132s total time, 00.084241s average
RDiscount: 00.082019s total time, 00.000820s average
Update August 2009
BlueCloth2 was released (http://www.deveiate.org/projects/BlueCloth)
It's speed is on par with RDiscount because it is based on RDiscount - it is not pure Ruby.
(Thanks Jim)
Update November 2009
Kramdown 1.0 was just released. I haven't tried it yet, but it is a pure-Ruby Markdown parser that claims to be 5x faster than Maruku.
Update April 2011
Maruku hasn't seen a commit since June 2010. You may want to look into Kramdown instead.
A new fast option that is not pure Ruby: GitHub has released Redcarpet, which is based on libupskirt: https://github.com/blog/832-rolling-out-the-redcarpet
Update August 2013
Kramdown is still a very healthy project (based on recent commits, outstanding issues, pull requests) and a great choice for a pure Ruby Markdown engine https://github.com/gettalong/kramdown
Redcarpet is probably still the most commonly used and actively maintained option for people that don't need or want pure Ruby.
The listing at http://ruby-toolbox.com/categories/markup_processors.html would be a good place to start looking.
RDiscount is Fast and simple to use.
Try RDiscount. BlueCloth is slow and buggy.
The benchmark in the answer given by casey use BlueCloth 1. BlueCloth 2 is the fastest these days : http://www.deveiate.org/projects/BlueCloth
I believe BlueCloth is the most prominent one.
Looks like a lot of these answers are outdated.
Best thing I've found out there as of now (summer 2013) is the Redcarpet gem: https://github.com/vmg/redcarpet
To ensure you're getting BlueCloth 2, install like this:
gem install bluecloth
Note that "bluecloth" should be in all lowercase, not camel case.
Source: http://rubygems.org/gems/bluecloth
If you need a fair example for how to use something like Kramdown in a rakefile there is a repo on github with code and articles in markdown.md that can be converted to html with Ruby code syntax highlighting but alas line numbers as well.(I would prefer to turn off line numbering)
If anyone knows how to shut off the line numbering default please tell us.
Anyway the link is https://github.com/elm-city-craftworks/practicing-ruby-manuscripts

Resources