Best way to parse a table in Ruby - ruby

I'd like to parse a simple table into a Ruby data structure. The table looks like this:
alt text http://img232.imageshack.us/img232/446/picture5cls.png http://img232.imageshack.us/img232/446/picture5cls.png
Edit: Here is the HTML
and I'd like to parse it into an array of hashes. E.g.,:
schedule[0]['NEW HAVEN'] == '4:12AM'
schedule[0]['Travel Time In Minutes'] == '95'
Any thoughts on how to do this? Perl has HTML::TableExtract, which I think would do the job, but I can't find any similar library for Ruby.

You might like to try Hpricot (gem install hpricot, prepend the usual sudo for *nix systems)
I placed your HTML into input.html, then ran this:
require 'hpricot'
doc = Hpricot.XML(open('input.html'))
table = doc/:table
(table/:tr).each do |row|
(row/:td).each do |cell|
puts cell.inner_html
end
end
which, for the first row, gives me
<span class="black">12:17AM </span>
<span class="black">
</span>
<span class="black">1:22AM </span>
<span class="black">
</span>
<span class="black">65</span>
<span class="black">TRANSFER AT STAMFORD (AR 1:01AM & LV 1:05AM) </span>
<span class="black">
N
</span>
So already we're down to the content of the TD tags. A little more work and you're about there.
(BTW, the HTML looks a little malformed: you have <th> tags in <tbody>, which seems a bit perverse: <tbody> is fairly pointless if it's just going to be another level within <table>. It makes much more sense if your <tr><th>...</th></tr> stuff is in a separate <thead> section within the table. But it may not be "your" HTML, of course!)

In case there isn't a library to do that for ruby, here's some code to get you started writing this yourself:
require 'nokogiri'
doc=Nokogiri("<table><tr><th>la</th><th><b>lu</b></th></tr><tr><td>lala</td><td>lulu</td></tr><tr><td><b>lila</b></td><td>lolu</td></tr></table>")
header, *rest = (doc/"tr").map do |row|
row.children.map do |c|
c.text
end
end
header.map! do |str| str.to_sym end
item_struct = Struct.new(*header)
table = rest.map do |row|
item_struct.new(*row)
end
table[1].lu #=> "lolu"
This code is far from perfect, obviously, but it should get you started.

Related

Why is the following Nokogiri/XPath code removing tags inside the node?

The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link

ruby selenium xpath td css

I am testing a webapp using Ruby and Selenium web-driver. I have not been able to examine the contents of a cell in the displayed webpage.
What I would like to get is the IP in the td.
<td class="multi_select_column"><input name="object_ids" type="checkbox"
value="adcf0467-2756-4c02-9edd-bb83c40b8685" /></td>
<td class="sortable normal_column">Core</td>
<td class="sortable nowrap-col normal_column">r1-c4-b4</td>
<td class="sortable anchor normal_column"><a href="/horizon/admin/instances
/adcf0467-2756-4c02-9edd-bb83c40b8685/detail" class="">pg-gtmpg--675</a></td>
<td class="sortable normal_column">column_name</td><td class="sortable normal_column">
<tr
<ul>
<li>172.25.1.12</li>
</ul>
I used the Firefox addon firepath to get the Xpath of the IP.
It gives "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1]/td[6]/ul/li", which looks correct.
However I have not been able to display the IP.
Here is my test code;
#usr/bin/env ruby
#
# Sample Ruby script using the Selenium client API
#
require "rubygems"
require "selenium/client"
require "test/unit"
require "selenium/client"
begin
driver = Selenium::WebDriver.for(:remote, :url =>"http://dog.dog.jump.acme.com:4444/wd/hub")
driver.navigate.to "http://10.87.252.37/acme/auth/login/"
g_user_name = driver.find_element(:id, 'id_username')
g_user_name.send_keys("user")
g_user_name.submit
g_password = driver.find_element(:id, 'id_password')
g_password.send_keys("password")
g_password.submit
g_instance_1 = driver.find_element(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[4]/a")
puts g_instance_1.text() <- here, I see the can see text
g_instance_2 = driver.find_elements(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[6]/ul/li[1]")
puts g_instance_2
output is <Selenium::WebDriver::Element:0x000000023c1700
puts g_instance_2.inspect
output is :[#<Selenium::WebDriver::Element:0x22f3b7c6e7724d4a id="4">]
puts g_instance_2.class
Output: Array
puts g_instance_2.count
Output:1
When there is no /a in the td it doesn't seem to work.
I have tried puts g_instance_2.text, g_instance_2.text() and many others with no success.
I must be missing something obvious, but I am not seeing it
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux] on
Linux ubuntu 3.8.0-34-generic #49~precise1-Ubuntu
I decided to try a different apporach using the css selector instead of xpath.
When I insert the following css selector into the FirePath window the desired html section is selected.
g_instance_2 = driver.find_elements(:css, "table#instances tbody tr:nth-of-type(1) td:nth-of-type(6) ul li:nth-of-type(1)" )
The problem is the same as before, I dont seem to be able to access the contents of g_instance_2
I have tried;
puts g_instance_2
g_instance_22 = [g_instance_2]
puts g_instance_22
Both return;
#<Selenium::WebDriver::Element:0x000000028a6ba8>
#<Selenium::WebDriver::Element:0x000000028a6ba8>
How can I check the value returned from the remote web-server?
Would Python be a better choice to do this?
The HTML code fragment you are trying to test is not valid HTML. It might be worth filing a bug report for it.
With the given code, the following CSS selector retrieves the <a> you want:
[href^="/horizon/admin/instances"]
Translated into: any element that has the "href" attribute starting with "/horizon/admin/instances"
For XPATH this is the selector
("//a[contains(#href,'/horizon/admin/instances')]")
Same translation just an uglier syntax.
The problem was that I was not accessing the returned array properly.
puts g_instance_2[0].text()
works for css and xpath

Formatting HTML into CSV

I'm scraping a website using Ruby with Nokogiri.
This script creates a local text file, opens a URL, and writes to the file if the expression tr td is met. It is working fine.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
DOC_URL_FILE = "doc.csv"
url = "http://www.SuperSecretWebSite.com"
data = Nokogiri::HTML(open(url))
all_data = data.xpath('//tr/td').text
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
Each line has five fields which I would like to run horizontally then go to the next line after five cells are filled. The data is all there but isn't usable.
I was hoping to learn or get the code from someone that knows how to create a CSV formatting code that:
While the script is reading the code, dump every new td /td x5 into its own cells horizontally.
Go to the next line, etc.
The layout of the HTML is:
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
What the final product should look like.
http://picpaste.com/pics/Screenshot-KRnqRGrP.1361813552.png
current output
john Smith I live here 123 phone ### Birthday Other Data,
This is pretty standard code to walk a table and extract its cells into an array of arrays. What you do with the data at that point is up to you, but it's a very easy to pass it to CSV.
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(<<EOT)
<table>
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
<tr>
<td>John Smyth</td>
<td>I live here 456</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
</table>
EOT
data = []
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
pp data
Which outputs:
[["John Smith", "I live here 123", "phone ###", "Birthday", "Other Data"],
["John Smyth", "I live here 456", "phone ###", "Birthday", "Other Data"]]
The code uses at to locate the first <table>, then iterates over each <tr> using search. For each row, it iterates over the cells and extracts their text.
Nokogiri's at finds the first occurrence of something, and returns a Node. search finds all occurrences and returns a NodeSet, which acts like an array. I'm using CSS accessors, instead of XPath, for simplicity.
As a FYI:
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
can be written more succinctly as:
File.write(DOC_URL_FILE, all_data)
I've been working on this problem for awhile. Can you give me any more help?
Sigh...
Did you read the CSV documents, especially the examples? What happens if, instead of defining data = [] we replace it with:
CSV.open("path/to/file.csv", "wb") do |data|
and wrap the loop with the CSV block, like:
CSV.open("path/to/file.csv", "wb") do |data|
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
end
That's not tested, but it's really that simple. Go and fiddle with that.

Nokogiri Xpath Double Looping

What I'm trying to do is pul the code block that contains the td with the class default. This works perfectly fine. But then I need to sort out the different parts of the code block. When I try to do this with the second xpath call what it does is each time it prints all the comheads in each of the blocks
def HeaderProcessor(doc)
doc.xpath("//td[#class='default']").each do |block|
puts block.xpath("//span[#class='comhead']").text
end
end
When I just print out block each block prints out once and contains the comment header and the comment. When I try to run the xpath it prints out EVERY comhead found in doc and seems to be ignoring the block variable.
Any ideas on how I can make this work? What am I miss understanding about xpath?
UPDATE:
<td class="default">
<div style="margin-top:2px; margin-bottom:-10px; ">
<span class="comhead">
#some data
</span></div>
<br><span class="comment"><font color="#000000">#some more data</span>
</td>
You're telling Nokogiri to search from the root when you say //span[#class='comhead'], you just want */span[#class='comhead']:
doc.xpath("//td[#class='default']").each do |block|
block.xpath("*/span[#class='comhead']").each do |span|
puts span.text
end
end
or even just this:
doc.xpath('//td[#class="default"]/*/span[#class="comhead"]').each do |span|
puts span.text
end
if you don't need to do anything with the <td> elements.

How does Nokogiri handle unclosed HTML tags like <br>?

When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br> is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br> separately.

Resources