Getting HTML table values using XPath in Nokogiri? - ruby

I'm trying to get some values from a table using the XPath of this table but it only returns [] (empty):
require 'nokogiri'
require 'open-uri'
url = "http://riopretrans.com.br/linhas.php?ln=106"
doc = Nokogiri::HTML(open(url))
doc.xpath("html/body/table[1]/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table[1]/tbody/tr[3]/td/div/div/center/font/table").each do |lines|
puts lines.content
end
I found the table's XPath using Firebug so I think it's correct.
Can anyone help me?

Remove tbody/ from your XPath.
The tbody tag is part of the HTML spec for table tags, but it's rarely actually implemented in the HTML. Some browsers insert it, though it's not in the HTML for the page. Firebug then sees it, which you see, and think it must be so.
Even using "view source" can confuse you, because you expect that to be accurate, but the browser has already munged the content to include "tbody", so, well, basically they're lying to you.
You can confirm this by looking at the HTML that Nokogiri is getting. Use puts doc.to_html['tbody'] and see if you get "tbody" or nil.
...Because in html file all of them were specified(written by programmer)
If you are positive they actually belong there, because they exist in the HTML source, then you'll need to take apart your XPath. Start with a broad path, and slowly add to it to narrow down your search.
The server is unreachable for me right now, so I can't confirm that, or dig into what the hierarchy should be, and show an example. (That's why actually giving us REAL HTML in your question is SO much better than a link which might not work.)
An alternate is to use XPath's // (search anywhere) with a less restrictive path, or CSS selectors. Either way, actually examine the HTML, instead of relying on Firebug's XPath, and determine what "landmarks" you can use in the source to navigate to your desired table. Today's HTML is chock-full of id and class parameters, or a particular series of tags that act as a finger-print for the table you want. Search for the minimum needed to pin-point that table.
If the table is something like <table id="foo">, then use doc.at('table#foo'). If it's in a <div class="bar"><table> use doc.at('div.bar table'). In any case, use the smallest sized accessor necessary to get the job done. That will increase your chances of success if anything in the HTML changes in the future.

Related

Accessing and scraping sporadically available Wikipedia sections

I need to fetch some data but I'm completely stumped after trying a few things.
I want to access Airlines & Destinations from the Albuquerque_International_Sunport's wiki page - keep in mind, I'll be going through a prepopulated list of airports with this data.
There are multiple "types" of Airlines: Passenger, Cargo, sometimes there's other (sub?)sections; other times there are none:
Articles for multiple airports will be accessed automatically - including some less known airports. This means I need to:
Check if "Airlines & Destinations" section exists
Take all data inside of any table
Scrape it; otherwise do nothing
I've tried using the ruby wikipedia-client gem however, the .raw_data method isn't even returning the section data:
Next, I went to Wikipedia's API: unless I am mistaken, but it doesn't return "section" names! This doesn't seem right but I wasn't able to get it working.
So I suppose that leaves Nokogiri. I can grab and parse the pages fine, but:
How would I go about detecting "Airlines & Destinations" section presence, getting all table data BEFORE end of section? I have a suspicion I need some tricky Xpath for this.
Seems to be the only viable solution.
Any thoughts welcome. Putting a bounty on this question when I can.
Edit: Perhaps it's better to simply somehow grab a list of all airlines in the world and hit them against HTML? Seems like it could be computationally expensive.
Well, I'm not an expert user of Nokogiri but maybe this can give you some idea.
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Albuquerque_International_Sunport"))
# this is the passenger table
page.xpath('//*[#id="mw-content-text"]/div/table[2]/tr').each do |tr|
p tr.text()
puts "-"*50
end
# this is the cargo table
page.xpath('//*[#id="mw-content-text"]/div/table[3]/tr').each do |tr|
p tr.text()
puts "-"*50
end

Wiki quotes API?

I would want to get a structured version of a Wikiquote page via JSON (basically I need all phrases)
Example: http://en.wikiquote.org/wiki/Fight_Club_(film)
I tried with: http://en.wikiquote.org/w/api.php?format=xml&action=parse&page=Fight_Club_(film)&prop=text
but I get all HTML source code. I need each pharse as an element of an Array
How could I achieve that with DBPEDIA?
For one thing Iam not sure whether you can query wiki quotes using DBpedia and secondly, DBpedia gives you only info box data in a structured way, it does not in a any way the article content in a structured way. Instead with a little bit of trouble you can use the Media wiki api to get the data
EDIT
The URI you are trying gives you a text so this will make things easier, but not completely.
Try this piece of code in your console:
require 'Nokogiri'
content = JSON.parse(open("http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_%28film%29&prop=text").read)
data = content['parse']['text']['*']
xpath_data = Nokogiri::HTML data
xpath_data.xpath("//ul/li").map{|data_node| data_node.text}
This is the closest I have come to an answer, of course this is not completely right because you will get a lot on unnecessary data. But if you dig into Nokogiri and xpath and find out how to pin point the nodes you need you can get a solution which will give you correct quotes at least 90% of the time.
Just change the format to JSON. Look up the Wikipedia API for more details.
http://en.wikiquote.org/w/api.php?format=json&action=parse&page=Fight_Club_(film)&prop=text

Letting Nokogiri decide whether to use #fragment or #parse

I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis
Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.
The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?

How do I scrape a site, with multiple pages, and create one single html page with Ruby?

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?

I'm using HPricot's css search to identify a table within a web page. Here's a sample html snippet I'm parsing:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
There are lots of tables in the page. I want to find the table which contains the A Name=a1 reference.
Right now, the way I'm doing it is
(page/"a[#name=a1]")[0].parent.parent.parent.parent.parent
I don't like this because
It is ugly
It is error prone (what if the folks who maintain the web page remove the tbody?)
Is there a way to tell hpricot to get me the table ancestor of the specified element?
Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm
The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.
Rohith is right. It is ugly and it is error prone (more than it needs to be). Again as he says it is much more clear with the intent to say "find the closest parent that is a table", and this could go for any child/parent relationship.
If it's "not possible" to do that with hpricot then just say so. But don't just say "it's hopeless to try to do that anyway what's the point". That's a bogus answer. It also doesn't help the next person who comes along (myself) looking for the answer to the same question but for different reasons, which is parsing many pages where differences are ASSUMED and not just feared.
To actually answer the question... I don't know, yet. And I don't have much hope of finding out with hpricot. The documentation is absolutely horridly nonexistent.
But here's a workaround that does about the same thing.
table = (page%"a[#name=a1]").parent
table = table.parent while table.name != "table"
Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the % (AKA "at") instead of / (AKA search). % returns only the first occurrence so you can drop the [0] index.
(page%"a[#name=a1]").parent.parent.parent.parent.parent
or
page%'//a[#name="a1"]/../../../../../..'
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
page%'//table[#height=61 and #width=700]'
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil
Basically the XPath pattern means:
Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.
Note: Firefox automatically adds the <tbody> tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.
The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3] according to Firefox so you have to strip the tbody. Also you don't need to anchor at /html.

Resources