Extracting text from a webtable in Watir / Ruby - ruby

I'm trying to extract the data from an Income Statement, url is http://finance.yahoo.com/q/is?s=LMT+Income+Statement&annual
I was unable to find the table using the browser.table(:name, 'blah') or (:id, 'blah'), but had some luck using the xpath with Nokogiri using this code, which picks up after I've initialized everything and browsed to the page:
page_html = Nokogiri::HTML.parse(browser.html)
tobj = page_html.xpath('//*[#id="yfncsumtab"]').inner_text
Now I'm able to take tobj and pull the data out, but it doesn't do me any good for trying to manipulate the object as a table. Any suggestions on how to go about storing the table as a variable would help. I can probably figure out iterating through the rows/columns from there, but I wouldn't mind if you tacked on some code that would do that.

Do you know Watir has xpath support?
browser.element(:xpath => '//*[#id="yfncsumtab"]')

Look at it this way:
doc = Nokogiri::HTML.parse(browser.html)
table = doc.at('table#yfncsumtab')
# iterate through tr's
table.search('tr').each do |tr|
# do something with tr
end

Try browser.element(id: "yfncsumtab").text

Related

Translating jQuery to Watir

It's entirely possible that I'm missing something fundamental, but this is a new realm for me and I could use some pointers. I'm getting started using Ruby and Watir to drive/test a web application that's all AJAX-built. Many of the items don't have explicit classes/ids, and the dev team of course uses jQuery to get to them. I'm looking for a way to translate their jQuery into Watir to use/modify/check values of the same objects.
For example, they use this to see if there are values in a data grid's fifth column:
$("div.dataTable table tbody tr").has("td:eq(4):not(:empty)").length > 0
How would I go about doing something similar?
You could make the same check in Watir using:
#Get the rows of the table (assuming there is just one dataTable)
table_trs = browser.div(:class, 'dataTable').table.tbody.trs
#Find how many rows have data in the 5th cell
# Note that both jQuery and Watir are 0-based index (ie 4 means 5th cell)
rows_with_data = table_trs.count{ |tr| tr.td(:index, 4).text != '' }
#Do your comparison
rows_with_data > 0
You can write it all as one line, but I broke it up here for readability.
You could also use Pincers. It's a small ruby gem, like Watir, but offers an API similar to jQuery on top of Webdriver.
Example:
require('selenium-webdriver')
require('pincers')
driver = Selenium::WebDriver.for :firefox
pincers = Pincers.for_webdriver driver
pincers.goto 'www.somesite.com'
pincers.css('a#link-id').click
(Disclosure: I work at Platanus.)

performance issue of watir table object processing. How to make Nokogiri html table into array?

The following works but is always very slow, seemingly halting my scraping program and its Firefox or Chrome browser for even whole minutes per page:
pp recArray = $browser.table(:id,"recordTable").to_a
Getting the HTML table's text or html source is fast though:
htmlcode = $browser.table(:id,"recordTable").html # .text shows only plaintext portion like lynx
How might I be able to create the same recArray (each element from a <TR>) using for example a Nokogiri object holding only that table's html?
recArray = Nokogiri::HTML(htmlcode). ??
I wrote a blog post about that a few days ago: http://zeljkofilipin.com/watir-nokogiri/
If you have further questions, ask.
You want each tr in the table?
Nokogiri::HTML($browser.html).css('table[#id="recordTable"] > tr')
This gives a NodeSet which can be more useful than Array. Of course there's still to_a
Thought it would be useful to sum up all the steps here and there:
The question was how to produce the same array object filled with strings from the page's text content that a Watir::Webdriver Table #to_a might produce, but much faster:
recArray = Nokogiri::HTML(htmlcode). **??**
So instead of this as I was doing before:
recArray=$browser.table(:class, 'detail-table w-Positions').to_a
I send the whole page's html as a string to Nokogiri to let it do the parsing:
recArray=Nokogiri::HTML($browser.html).css('table[#class="detail-table w-Positions"] tr').to_a
Which found me the rows of the table I want and put them into an array.
Not done yet since the elements of that array are still Nokogiri (Table Row?) types, which barfed when I attempted things like .join(",") (useful for writing into a .CSV file or database for instance)
So the following iterates through each row element, turning each into an array of pure Ruby String types, containing only the text content of each table cell stripped of html tags:
recArray= recArray.map {|row| row.css("td").map {|c| c.text}.to_a } # Could of course be merged with above to even longer, nastier one-liner
Each cell had previously also been a Nokogiri Element type, done away with the .text mapping.
Significant speedup achieved.
Next I wonder what it would take to simply override the #to_a method of every Watir::Webdriver Table object globally in my Ruby code files....
(I realize that may not be 100% compatible but it would spare me so much code rewriting. Am willing to try in my personal.lib.rb include file.)

Read Dates as Strings with Spreadsheet Ruby Gem

I have been looking for a way to read out an Excel spreadsheet with keeping the dates that are in them being kept as a string. Unfortunately I can't see if this is possible or not, has anyone managed to do this or know how?
You may want to have a look at the Row class of the spreadsheet gem:
http://spreadsheet.rubyforge.org/Spreadsheet/Row.html
There's a lot that you can get there, but the Row#formatted method is probably what you want:
row = sheet.to_a[row_index] # Get row object
value = row.formatted[column_index]
The formatted method takes all the Excel formatting data for you and gives you an array of Ruby-classed objects
I think you can try row.at(col_index) method..
You can refer to this page

Using Ruby CSV to extract one column

I've been trying to work with getting a single column out of a csv file.
I've gone through the documentation, http://www.ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
but still don't really understand how to use it.
If I use CSV.table, the response is incredibly slow compared to CSV.read. I admit the dataset I'm loading is quite large, which is exactly the reason I only want to get a single column from it.
My request is simply currently looks like this
#dataTable = CSV.table('path_to_csv.csv')
and when I debug I get a response of
#<CSV::Table mode:col_or_row row_count:2104 >
The documentation says I should be able to use by_col(), but when I try to output
<%= debug #dataTable.by_col('col_name or index') %>
It gives me "undefined method 'col' error"
Can somebody explain to me how I'm supposed to use CSV? and if there is a way to get columns faster using 'read' instead of 'table'?
I'm using Ruby 1.92, which says that it is using fasterCSV, so I don't need to use the FasterCSV gem.
To pluck a column out of a csv I'd probably do something like the following:
col_data = []
CSV.foreach(FILENAME) {|row| col_data << row[COL_INDEX]}
That should be substantially faster than any operations on CSV.Table
You can get the values from single column of the csv files using the following snippet.
#dataTable = CSV.table('path_to_csv.csv')
#dataTable[:columnname]

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!

Resources