select tr>3 with nokogiri - ruby

i want to get row which it contains more than 3 columns
how to write xpath with nokogiri
require 'rubygems'
require 'nokogiri'
item='sometext'
doc = Nokogiri::HTML.parse(open(item))
data=doc.xpath('/html/body/table/tr[#td.size>3]')
puts data
it can not run , help and advices appreciated.

The correct XPath will be something like this.
doc.xpath('/html/body/table/tr[count(td)>3]')
However, in my test program, I can't get Nokogiri to like absolute XPaths like this. I have to use the double-slash XPath instead.
require 'rubygems'
require 'nokogiri'
html = %{
<table>
<tr class=wrong><td><td></tr>
<tr class=right><td><td><td></tr>
</table>
}
doc = Nokogiri::HTML.parse(html)
data = doc.xpath('//table/tr[count(td)>2]')
puts data.attribute('class')

Related

ruby selenium xpath td css

I am testing a webapp using Ruby and Selenium web-driver. I have not been able to examine the contents of a cell in the displayed webpage.
What I would like to get is the IP in the td.
<td class="multi_select_column"><input name="object_ids" type="checkbox"
value="adcf0467-2756-4c02-9edd-bb83c40b8685" /></td>
<td class="sortable normal_column">Core</td>
<td class="sortable nowrap-col normal_column">r1-c4-b4</td>
<td class="sortable anchor normal_column"><a href="/horizon/admin/instances
/adcf0467-2756-4c02-9edd-bb83c40b8685/detail" class="">pg-gtmpg--675</a></td>
<td class="sortable normal_column">column_name</td><td class="sortable normal_column">
<tr
<ul>
<li>172.25.1.12</li>
</ul>
I used the Firefox addon firepath to get the Xpath of the IP.
It gives "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1]/td[6]/ul/li", which looks correct.
However I have not been able to display the IP.
Here is my test code;
#usr/bin/env ruby
#
# Sample Ruby script using the Selenium client API
#
require "rubygems"
require "selenium/client"
require "test/unit"
require "selenium/client"
begin
driver = Selenium::WebDriver.for(:remote, :url =>"http://dog.dog.jump.acme.com:4444/wd/hub")
driver.navigate.to "http://10.87.252.37/acme/auth/login/"
g_user_name = driver.find_element(:id, 'id_username')
g_user_name.send_keys("user")
g_user_name.submit
g_password = driver.find_element(:id, 'id_password')
g_password.send_keys("password")
g_password.submit
g_instance_1 = driver.find_element(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[4]/a")
puts g_instance_1.text() <- here, I see the can see text
g_instance_2 = driver.find_elements(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[6]/ul/li[1]")
puts g_instance_2
output is <Selenium::WebDriver::Element:0x000000023c1700
puts g_instance_2.inspect
output is :[#<Selenium::WebDriver::Element:0x22f3b7c6e7724d4a id="4">]
puts g_instance_2.class
Output: Array
puts g_instance_2.count
Output:1
When there is no /a in the td it doesn't seem to work.
I have tried puts g_instance_2.text, g_instance_2.text() and many others with no success.
I must be missing something obvious, but I am not seeing it
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux] on
Linux ubuntu 3.8.0-34-generic #49~precise1-Ubuntu
I decided to try a different apporach using the css selector instead of xpath.
When I insert the following css selector into the FirePath window the desired html section is selected.
g_instance_2 = driver.find_elements(:css, "table#instances tbody tr:nth-of-type(1) td:nth-of-type(6) ul li:nth-of-type(1)" )
The problem is the same as before, I dont seem to be able to access the contents of g_instance_2
I have tried;
puts g_instance_2
g_instance_22 = [g_instance_2]
puts g_instance_22
Both return;
#<Selenium::WebDriver::Element:0x000000028a6ba8>
#<Selenium::WebDriver::Element:0x000000028a6ba8>
How can I check the value returned from the remote web-server?
Would Python be a better choice to do this?
The HTML code fragment you are trying to test is not valid HTML. It might be worth filing a bug report for it.
With the given code, the following CSS selector retrieves the <a> you want:
[href^="/horizon/admin/instances"]
Translated into: any element that has the "href" attribute starting with "/horizon/admin/instances"
For XPATH this is the selector
("//a[contains(#href,'/horizon/admin/instances')]")
Same translation just an uglier syntax.
The problem was that I was not accessing the returned array properly.
puts g_instance_2[0].text()
works for css and xpath

How to create an array scraping HTML?

I have a Rake task set-up, and it works almost how I want it to.
I'm scraping information from a site and want to get all of the player ratings into an array, ordered by how they appear in the HTML. I have player_ratings and want to do exactly what I did with the player_names variable.
I only want the fourth <td> within a <tr> in the specified part of the doc because that corresponds to the ratings. If I use Nokogiri's text, I only get the first player rating when I really want an array of all of them.
task :update => :environment do
require "nokogiri"
require "open-uri"
team_ids = [7689, 7679, 7676, 7680]
player_names = []
for team_id in team_ids do
url = URI.encode("http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=#{team_id}")
doc = Nokogiri::HTML(open(url))
player_names = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td a').map(&:content)
player_ratings = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td')[3]
puts player_ratings
player_names.map{|player| puts player}
end
end
Any advice on how to do this?
I think changing your xpath might help. Here is the xpath
nodes = doc.xpath "//table[#class='table table-bordered table-striped table-condensed'][2]//tr/td[4]"
data = nodes.each {|node| node.text }
Iterating the nodes with node.text gives me
4.682200 
5.439000 
5.568400 
5.133700 
4.480800 
4.368700 
2.768100 
3.814300 
5.103400 
4.567000 
5.103900 
3.804400 
3.737100 
4.742400 
I'd recommend using Wombat (https://github.com/felipecsl/wombat), where you can specify that you want to retrieve a list of elements matched by your css selector and it will do all the hard work for you
It's not well known, but Nokogiri implements some of jQuery's JavaScript extensions for searching using CSS selectors. In your case, the :eq(n) method will be useful:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<html>
<body>
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
</table>
</body>
</html>
EOT
doc.at('td:eq(4)').text # => "4"

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Accessing XPath query results when all that is returned is a LibXML object in Ruby

require 'net/http'; require 'libxml'
data = Net::HTTP.get_response(URI.parse('http://myurl.com')).body
source = LibXML::XML::Parser.string(data).parse
tables = source.find('//table')
returns
=> #<LibXML::XML::XPath::Object:0x1f4f50>
How do I access this? There are at least 11 tables there.
p.s. I can't use Nokogiri on my current setup.
You access the XPath results by asking for the node item like this.
require 'net/http'
require 'libxml'
# Sample text with a few tables
xml=<<END
<html>
<table id="t1"><tr><td>foo</td></tr></table>
<table id="t2"><tr><td>goo</td></tr></table>
<table id="t3"><tr><td>hoo</td></tr></table>
</html>
END
# Parse the text into tables
source = LibXML::XML::Parser.string(xml).parse
tables = source.find('//table')
# The XPath #each iterator does each XML node
tables.each {|node|
puts node["id"]
}
If you have an older version of libxml:
- puts node["id"]
+ puts node.property("id")
Manage to work it out with Hpricot!

How to get text from 'td' tags from 'table' tag on html page using Mechanize

How to get texts from 'td' tags from 'table' on html page by using Mechanize gem?
I almost always use mechanize with nokogiri. This guide helped me get started.
Something like this should work (Untested):
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
page = agent.get("http://www.google.com/")
doc = Nokogiri::HTML(page.body, "UTF-8")
doc.xpath('//td').each do |node|
puts node.text
end
More information on nokogiri here

Resources