It's entirely possible that I'm missing something fundamental, but this is a new realm for me and I could use some pointers. I'm getting started using Ruby and Watir to drive/test a web application that's all AJAX-built. Many of the items don't have explicit classes/ids, and the dev team of course uses jQuery to get to them. I'm looking for a way to translate their jQuery into Watir to use/modify/check values of the same objects.
For example, they use this to see if there are values in a data grid's fifth column:
$("div.dataTable table tbody tr").has("td:eq(4):not(:empty)").length > 0
How would I go about doing something similar?
You could make the same check in Watir using:
#Get the rows of the table (assuming there is just one dataTable)
table_trs = browser.div(:class, 'dataTable').table.tbody.trs
#Find how many rows have data in the 5th cell
# Note that both jQuery and Watir are 0-based index (ie 4 means 5th cell)
rows_with_data = table_trs.count{ |tr| tr.td(:index, 4).text != '' }
#Do your comparison
rows_with_data > 0
You can write it all as one line, but I broke it up here for readability.
You could also use Pincers. It's a small ruby gem, like Watir, but offers an API similar to jQuery on top of Webdriver.
Example:
require('selenium-webdriver')
require('pincers')
driver = Selenium::WebDriver.for :firefox
pincers = Pincers.for_webdriver driver
pincers.goto 'www.somesite.com'
pincers.css('a#link-id').click
(Disclosure: I work at Platanus.)
Related
I'm using Selenium in Ruby ( a language that I am currently learning) and I have a drop down menu that I want to iterate though, select each option, do some stuff, and then move onto the next option.
I have looked at several answers that are somewhat similar. Only one Stack Overflow question had to similar idea in mind as mine but it's in Python and I just don't know the syntax for Ruby.
I have read through the documentation for Ruby and haven't found anything that does anything similar to the Python way.
Essentially what I want to do is:
select first option
click a button
navigate to a different page
download a csv
return back to the previous page
select second option
do the same thing
etc...until all the options are done
Is this possible? I can figure out returning to the previous page and clicking the csv option but I would like some help on the syntax part.
Thank you
The ruby bindings for selenium-webdriver have a Select class for manipulating select lists.
Here's a contrived example that locates a select_list element, passed the element to a Select object, and prints the text of each option in the list. YMMV...
require "selenium-webdriver"
driver = Selenium::WebDriver.for :firefox
driver.navigate.to "https://www.seleniumeasy.com/test/basic-select-dropdown-demo.html"
element = driver.find_element(id: 'select-demo')
select_list = Selenium::WebDriver::Support::Select.new(element)
select_list.options.each { |option| puts option.text}
#=> Please select
#=> Sunday
#=> Monday
#=> Tues
...
I want to remake the Olympic medals count on London2012 to better reflect the value of the medals. Currently it is only sorted by gold medals. I'd like to relist it by points, so gold=4, silver=2 and bronze=1 to make a new more rational list. I probably want to remember the previous rank then add a new rank column as well.
I'd like to try mechanize to get raw data from site, then parse the data into rows and cols, apply the new counts, then remake the list.
From source at http://www.london2012.com/medals/medal-count/ each country has a block with medals like so:
<span class="countryName">Canada</span></a></div></div></td><td class="gold c">0</td><td class="silver c">2</td><td class="bronze c">5</td>
If I use agent.get('http://www.london2012.com/medals/medal-count') It shows the whole list. How to parse specific spans and table data?
I also need to remember the rank, then when I make the new page put the new rank beside it.
Any tips on mechanize parsing and remembering data would be really helpful. More importantly your thinking process in doing something like this, I'd appreciate the help to get me started. This doesn't have to be a code answer
Thanks
First to identify the table. In chrome load the page and right click anywhere on the table. Go to inspect element. Go up the heirarchy until you're on the table. Now select it and you'll see it looks like this:
<table class="or-tbl overall_medals sortable" summary="Schedule">
The overall_medals class looks like it will be unique so that's a good one to use. Now start irb and do:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.london2012.com/medals/medal-count/'
double check that the table is unique:
page.search('table.overall_medals').size
#=> 1 (good, it is)
You can get all the data from the table into an array with:
page.search('table.overall_medals tr').map{|tr| tr.search('td').map(&:text)}
Notice that the first 2 rows are empty, let's get rid of them by using a range:
data = page.search('table.overall_medals tr')[2..-1].map{|tr| tr.search('td').map(&:text)}
The second row isn't really empty, it has the column names (in th's instead of td's). You can get those with:
columns = page.search('table.overall_medals tr[2] th').map{|th| th.text.strip}
You can get these into hashes with:
rows = data.map{|row| Hash[columns.zip row]}
Now you can do
rows[0]['Country']
#=> "United States of America"
Or even one big hash:
countries = rows.map{|row| {row['Country'] => row}}.reduce &:merge
now:
countries['France']['Gold']
#=> "8"
You might find this Medals API useful (Assuming your question is not specifically about Mechanize)
http://apify.heroku.com/resources/5014626da8cdbb0002000006
It uses Nokogiri to parse the site and the output is available as JSON:
http://apify.heroku.com/api/olympics2012_medals.json
I'm trying to extract the data from an Income Statement, url is http://finance.yahoo.com/q/is?s=LMT+Income+Statement&annual
I was unable to find the table using the browser.table(:name, 'blah') or (:id, 'blah'), but had some luck using the xpath with Nokogiri using this code, which picks up after I've initialized everything and browsed to the page:
page_html = Nokogiri::HTML.parse(browser.html)
tobj = page_html.xpath('//*[#id="yfncsumtab"]').inner_text
Now I'm able to take tobj and pull the data out, but it doesn't do me any good for trying to manipulate the object as a table. Any suggestions on how to go about storing the table as a variable would help. I can probably figure out iterating through the rows/columns from there, but I wouldn't mind if you tacked on some code that would do that.
Do you know Watir has xpath support?
browser.element(:xpath => '//*[#id="yfncsumtab"]')
Look at it this way:
doc = Nokogiri::HTML.parse(browser.html)
table = doc.at('table#yfncsumtab')
# iterate through tr's
table.search('tr').each do |tr|
# do something with tr
end
Try browser.element(id: "yfncsumtab").text
The following works but is always very slow, seemingly halting my scraping program and its Firefox or Chrome browser for even whole minutes per page:
pp recArray = $browser.table(:id,"recordTable").to_a
Getting the HTML table's text or html source is fast though:
htmlcode = $browser.table(:id,"recordTable").html # .text shows only plaintext portion like lynx
How might I be able to create the same recArray (each element from a <TR>) using for example a Nokogiri object holding only that table's html?
recArray = Nokogiri::HTML(htmlcode). ??
I wrote a blog post about that a few days ago: http://zeljkofilipin.com/watir-nokogiri/
If you have further questions, ask.
You want each tr in the table?
Nokogiri::HTML($browser.html).css('table[#id="recordTable"] > tr')
This gives a NodeSet which can be more useful than Array. Of course there's still to_a
Thought it would be useful to sum up all the steps here and there:
The question was how to produce the same array object filled with strings from the page's text content that a Watir::Webdriver Table #to_a might produce, but much faster:
recArray = Nokogiri::HTML(htmlcode). **??**
So instead of this as I was doing before:
recArray=$browser.table(:class, 'detail-table w-Positions').to_a
I send the whole page's html as a string to Nokogiri to let it do the parsing:
recArray=Nokogiri::HTML($browser.html).css('table[#class="detail-table w-Positions"] tr').to_a
Which found me the rows of the table I want and put them into an array.
Not done yet since the elements of that array are still Nokogiri (Table Row?) types, which barfed when I attempted things like .join(",") (useful for writing into a .CSV file or database for instance)
So the following iterates through each row element, turning each into an array of pure Ruby String types, containing only the text content of each table cell stripped of html tags:
recArray= recArray.map {|row| row.css("td").map {|c| c.text}.to_a } # Could of course be merged with above to even longer, nastier one-liner
Each cell had previously also been a Nokogiri Element type, done away with the .text mapping.
Significant speedup achieved.
Next I wonder what it would take to simply override the #to_a method of every Watir::Webdriver Table object globally in my Ruby code files....
(I realize that may not be 100% compatible but it would spare me so much code rewriting. Am willing to try in my personal.lib.rb include file.)
I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!