ruby watir to get html of a page

ruby watir to get html of a page - ruby

I have looked through the examples on these pages
http://watir.com/examples/
http://wiki.openqa.org/display/WTR/Examples
I still don't see a simple example of getting html of a page.
browser = Watir::Browser.new
browser.goto 'mysite.com'
I have tried
puts browser.text
It seems not working.
Thanks

This should do it:
puts browser.html

puts browser.html
Will return all of the html, in case you only want to print the active objects, you can use:
puts browser.show_active
Similarly if you only want the links to be printed, you can use:
puts browser.show_links

IE8, Ruby 1.9.3, Watir 3.0, WindowsXP
I need to grab the text in a cell with id="numberCovered".
<table cellpadding="0" cellspacing="0" class="thisThemeBodyColor"><tr style="height:22px;"><td id="numberCoveredlabel" style="cursor:default;" class="smallHeadingBlack" width="200">Number of individuals to be covered</td><td id="numberCovered" class="smallHeadingBlack" style="font-weight:bold;">1</td><input type="hidden" name="numberCovered" tooltip="" value="1" onpropertychange="variableAsTextChanged(this);"/></tr><tr><td id="numberSpouseslabel" style="cursor:default;" class="smallHeadingBlack" width="200">Number of spouses to be covered</td><td id="numberSpouses" class="smallHeadingBlack" style="font-weight:bold;">0</td><input type="hidden" name="numberSpouses" tooltip="" value="0" onpropertychange="variableAsTextChanged(this);"/></tr></table>
As #icn mentioned, a raw page source dump is sometimes nice to have as a fallback when you can't find an appropriate Watir builtin method.
--Update--
The above mentioned $browser.html was spewing empty lines, but this seeems to be working:
require 'nokogiri'
page_html = Nokogiri::HTML.parse($browser.html)
entry = page_html.css('td[id=numberCovered]')

puts browser.html will return all the objects on the page. If you want only the active objects then you can use puts browser.show_active similarly if you want only the links to be displayed you can use puts browser.show_links which will show all the links on the page.

Related

How to iterate a table with watir when no html element has identifiers

I have an html table which has a table with unequal number of columns for each row. The table and cells/columns have no identifiers such as id, name, class etc. How do I iterate over such a table and print it in tabular form ? I am using ruby 1.8 for now.
Html -
<table>
<tr><td colspan="2">Student Info</td></tr>
<tr><td>Age:</td> <td>15</td></tr>
<tr><td>Home:</td> <td>251 Palm Avenue</td></tr>
<tr><td>City:</td> <td>New York</td></tr>
<tr><td colspan="2">Parent Info</td></tr>
<tr><td>Parent Phone:</td> <td>231-1234-123</td></tr>
<tr><td>More parent info</td> <td><a href="http://www.school.com>school</a><br></td></tr>
</table>
Ruby code -
require 'rubygems'
require 'watir-webdriver'
url = "url has tables with no identifiable attributes. Just a table tag"
browser = Watir::Browser.new :firefox
browser.goto url
browser.table.trs.each do |tr|
tr.each do |td|
puts td.to_s
end
end
Trace -
C:/ruby/lib/ruby/gems/1.8/gems/watir-webdriver-0.6.2/lib/watir-webdriver/elements/element.rb:553:in `method_missing': undefined method `each' for #<Watir::TableRow:0x517bf9c> (NoMethodError)
from tables.rb:10
from C:/ruby/lib/ruby/gems/1.8/gems/watir-webdriver-0.6.2/lib/watir-webdriver/element_collection.rb:29:in `each'
from C:/ruby/lib/ruby/gems/1.8/gems/watir-webdriver-0.6.2/lib/watir-webdriver/element_collection.rb:29:in `each'
from tables.rb:9

Just grab the table, and send it to a file (or variable) iterating over the rows and placing a tab between the elements
browser = Watir::Browser.new :firefox
browser.goto url
f = File.new('table.txt', 'w+')
t = browser.table
t.trs.each do |trd|
trd.tds.each do |td|
f.print "#{td.text}\t"
end
f.print "\n"
end
f.close
EDIT** in answer to the question in the comments:
Well, don't be hard on yourself, I don't think the documentation is beginner friendly. I had to extrapolate from what Justin_Ko said and the docs to see that was referenced by tr and the collection of was ref'd by trs. The thing to remember is that those collections, and most everything returned by the WATIR methods are objects, but they might no behave like you think. trs is an Enumerator, but it only returns objects, not the text of the row itself. Same with td. That's why I had to iterate through the collection of rows then iterate through each row's td objects, then call .text on that object. Think about WATIR this way, you can reference anything by a class or identifier, or as in this case just by HTML elements. browser reads everything in the page, from there you can target any element(s) using the WATIR methods.
The cheat sheet is very handy:
https://github.com/watir/watir/wiki/Cheat-Sheet

ruby selenium xpath td css

I am testing a webapp using Ruby and Selenium web-driver. I have not been able to examine the contents of a cell in the displayed webpage.
What I would like to get is the IP in the td.
<td class="multi_select_column"><input name="object_ids" type="checkbox"
value="adcf0467-2756-4c02-9edd-bb83c40b8685" /></td>
<td class="sortable normal_column">Core</td>
<td class="sortable nowrap-col normal_column">r1-c4-b4</td>
<td class="sortable anchor normal_column"><a href="/horizon/admin/instances
/adcf0467-2756-4c02-9edd-bb83c40b8685/detail" class="">pg-gtmpg--675</a></td>
<td class="sortable normal_column">column_name</td><td class="sortable normal_column">
<tr
<ul>
<li>172.25.1.12</li>
</ul>
I used the Firefox addon firepath to get the Xpath of the IP.
It gives "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1]/td[6]/ul/li", which looks correct.
However I have not been able to display the IP.
Here is my test code;
#usr/bin/env ruby
#
# Sample Ruby script using the Selenium client API
#
require "rubygems"
require "selenium/client"
require "test/unit"
require "selenium/client"
begin
driver = Selenium::WebDriver.for(:remote, :url =>"http://dog.dog.jump.acme.com:4444/wd/hub")
driver.navigate.to "http://10.87.252.37/acme/auth/login/"
g_user_name = driver.find_element(:id, 'id_username')
g_user_name.send_keys("user")
g_user_name.submit
g_password = driver.find_element(:id, 'id_password')
g_password.send_keys("password")
g_password.submit
g_instance_1 = driver.find_element(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[4]/a")
puts g_instance_1.text() <- here, I see the can see text
g_instance_2 = driver.find_elements(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[6]/ul/li[1]")
puts g_instance_2
output is <Selenium::WebDriver::Element:0x000000023c1700
puts g_instance_2.inspect
output is :[#<Selenium::WebDriver::Element:0x22f3b7c6e7724d4a id="4">]
puts g_instance_2.class
Output: Array
puts g_instance_2.count
Output:1
When there is no /a in the td it doesn't seem to work.
I have tried puts g_instance_2.text, g_instance_2.text() and many others with no success.
I must be missing something obvious, but I am not seeing it
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux] on
Linux ubuntu 3.8.0-34-generic #49~precise1-Ubuntu
I decided to try a different apporach using the css selector instead of xpath.
When I insert the following css selector into the FirePath window the desired html section is selected.
g_instance_2 = driver.find_elements(:css, "table#instances tbody tr:nth-of-type(1) td:nth-of-type(6) ul li:nth-of-type(1)" )
The problem is the same as before, I dont seem to be able to access the contents of g_instance_2
I have tried;
puts g_instance_2
g_instance_22 = [g_instance_2]
puts g_instance_22
Both return;
#<Selenium::WebDriver::Element:0x000000028a6ba8>
#<Selenium::WebDriver::Element:0x000000028a6ba8>
How can I check the value returned from the remote web-server?
Would Python be a better choice to do this?

The HTML code fragment you are trying to test is not valid HTML. It might be worth filing a bug report for it.
With the given code, the following CSS selector retrieves the <a> you want:
[href^="/horizon/admin/instances"]
Translated into: any element that has the "href" attribute starting with "/horizon/admin/instances"
For XPATH this is the selector
("//a[contains(#href,'/horizon/admin/instances')]")
Same translation just an uglier syntax.

The problem was that I was not accessing the returned array properly.
puts g_instance_2[0].text()
works for css and xpath

Clicking the "Show more" link on a LinkedIn group page using Ruby Mechanize

I have logged in to Linkedin and reached my groups page using Ruby Mechanize. I am also able to retrieve the list of questions on the page. However, I am unable to click the "Show more" link at the bottom so that I can the entire page and hence all the questions:
require 'rubygems'
require 'mechanize'
require 'open-uri'
a = Mechanize.new { |agent|
# LinkedIn probably refreshes after login
agent.follow_meta_refresh = true
}
a.get('http://linkedin.com/') do |home_page|
my_page = home_page.form_with(:name => 'login') do |form|
form.session_key = '********' #put you email ID
form.session_password = '********' #put your password here
end.submit
mygroups_page = a.click(my_page.link_with(:text => /Groups/))
#puts mygroups_page.links
link_to_analyse = a.click(mygroups_page.link_with(:text => 'Semantic Web'))
link_to_test = link_to_analyse.link_with(:text => 'Show more...')
puts link_to_test.class
# link_to_analyse.search(".user-contributed .groups a").each do |item|
# puts item['href']
# end
end
Although a link exists with text 'Show more...' in the page, I am somehow not able to click it.the link_to_test.class shows NilClass What is the possible problem?
The part of the page I need to reach is:
<div id="inline-pagination">
<span class="running-count">20</span>
<span class="total-count">1134</span>
<a href="groups?mostPopularList=&gid=49970&split_page=2&ajax=ajax" class="btn-quaternary show-more-comments" title="Show more...">
<span>Show more...</span>
<img src="http://static01.linkedin.com/scds/common/u/img/anim/anim_loading_16x16.gif" width="16" height="16" alt="">
</a>
</div>
I need to click the show more... I can use links_with(:href => ..) but doesnt seem to work.

NEW ANSWER:
I just inspected the page source of the group and it seems that for the "Show more" link they actually use the three full stop characters and not an ellipsis.
Have you tried targeting the link by it's title attribute?
link_to_analyse.link_with(:title => 'Show more...')
If that's still not working, have you tried dumping the text of all the links on the page with
link_to_analyse.links.each do |link|
puts link.text
end
---- OLD ANSWER INCORRECT ----
LinkedIn use the "Horizontal Ellipsis" Unicode character (code U+2026) for their links that "look" like they have "..." at the end. So your code is not actually finding the link.
Character you need: http://www.fileformat.info/info/unicode/char/2026/index.htm
Sneaky :)
EDIT: and to get the link ofcourse you need to insert an appropriate Unicode character in your link text like so:
link_to_analyse.link_with(:text => 'Show more\u2026')

The tags inside the anchor will create some white space around the anchor text. You can account for that with:
link_to_analyse.link_with :text => /\A\s*Show more...\s*\Z/
But it's probably good enough to just do:
link_to_analyse.link_with :text => /Show more.../

Why does this Nokogiri XPath have a null return?

I'm XPath-ing through a web page with NOKOGIRI. I'm familiar with XPath, but I cannot figure out why my XPath fails to pick up the specific row. See the ruby code.
I used FireBug XML to validate my XPath, so I am 99% sure my XPath is correct.
require 'nokogiri'
require 'open-uri'
#searchURL = 'http://www.umn.edu/lookup?UID=smit4562'
#xpath = '//html/body/p/table/tbody/tr/td[2]/table/tbody/tr[2]'
doc = Nokogiri::HTML(open(#searchURL))
puts 'row should be = Email Address: smit4562#umn.edu'
puts '=> ' + doc.xpath(#xpath).to_s
puts 'is row emppty?'
puts '=> ' + doc.xpath(#xpath).empty?().to_s

The <tbody> tag is an optional tag which is implicit if it is omitted. This means the <tbody> tags are inserted automatically by the browser when not present. They are not in the source code in your example, so nokogiri doesn't know about them.
Firebug uses the generated DOM, which does contains the tbody elements, so the statement does match inside a browser.
Remove both the tbody selectors and you should be fine.

How does Nokogiri handle unclosed HTML tags like <br>?

When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).

Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br> is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"

You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.

As far as I can remember from doing some HTML parsing last year it'll view them as separate.
EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including <br> separately.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ruby watir to get html of a page - ruby

I have looked through the examples on these pages http://watir.com/examples/ http://wiki.openqa.org/display/WTR/Examples I still don't see a simple example of getting html of a page. browser = Watir::Browser.new browser.goto 'mysite.com' I have tried puts browser.text It seems not working. Thanks

This should do it: puts browser.html

puts browser.html Will return all of the html, in case you only want to print the active objects, you can use: puts browser.show_active Similarly if you only want the links to be printed, you can use: puts browser.show_links

puts browser.html will return all the objects on the page. If you want only the active objects then you can use puts browser.show_active similarly if you want only the links to be displayed you can use puts browser.show_links which will show all the links on the page.

Related

How to iterate a table with watir when no html element has identifiers

ruby selenium xpath td css

Clicking the "Show more" link on a LinkedIn group page using Ruby Mechanize

Why does this Nokogiri XPath have a null return?

How does Nokogiri handle unclosed HTML tags like <br>?

Categories

Resources