Nokogiri Xpath Double Looping - ruby

What I'm trying to do is pul the code block that contains the td with the class default. This works perfectly fine. But then I need to sort out the different parts of the code block. When I try to do this with the second xpath call what it does is each time it prints all the comheads in each of the blocks
def HeaderProcessor(doc)
doc.xpath("//td[#class='default']").each do |block|
puts block.xpath("//span[#class='comhead']").text
end
end
When I just print out block each block prints out once and contains the comment header and the comment. When I try to run the xpath it prints out EVERY comhead found in doc and seems to be ignoring the block variable.
Any ideas on how I can make this work? What am I miss understanding about xpath?
UPDATE:
<td class="default">
<div style="margin-top:2px; margin-bottom:-10px; ">
<span class="comhead">
#some data
</span></div>
<br><span class="comment"><font color="#000000">#some more data</span>
</td>

You're telling Nokogiri to search from the root when you say //span[#class='comhead'], you just want */span[#class='comhead']:
doc.xpath("//td[#class='default']").each do |block|
block.xpath("*/span[#class='comhead']").each do |span|
puts span.text
end
end
or even just this:
doc.xpath('//td[#class="default"]/*/span[#class="comhead"]').each do |span|
puts span.text
end
if you don't need to do anything with the <td> elements.

Related

Why is the following Nokogiri/XPath code removing tags inside the node?

The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link

ruby selenium xpath td css

I am testing a webapp using Ruby and Selenium web-driver. I have not been able to examine the contents of a cell in the displayed webpage.
What I would like to get is the IP in the td.
<td class="multi_select_column"><input name="object_ids" type="checkbox"
value="adcf0467-2756-4c02-9edd-bb83c40b8685" /></td>
<td class="sortable normal_column">Core</td>
<td class="sortable nowrap-col normal_column">r1-c4-b4</td>
<td class="sortable anchor normal_column"><a href="/horizon/admin/instances
/adcf0467-2756-4c02-9edd-bb83c40b8685/detail" class="">pg-gtmpg--675</a></td>
<td class="sortable normal_column">column_name</td><td class="sortable normal_column">
<tr
<ul>
<li>172.25.1.12</li>
</ul>
I used the Firefox addon firepath to get the Xpath of the IP.
It gives "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1]/td[6]/ul/li", which looks correct.
However I have not been able to display the IP.
Here is my test code;
#usr/bin/env ruby
#
# Sample Ruby script using the Selenium client API
#
require "rubygems"
require "selenium/client"
require "test/unit"
require "selenium/client"
begin
driver = Selenium::WebDriver.for(:remote, :url =>"http://dog.dog.jump.acme.com:4444/wd/hub")
driver.navigate.to "http://10.87.252.37/acme/auth/login/"
g_user_name = driver.find_element(:id, 'id_username')
g_user_name.send_keys("user")
g_user_name.submit
g_password = driver.find_element(:id, 'id_password')
g_password.send_keys("password")
g_password.submit
g_instance_1 = driver.find_element(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[4]/a")
puts g_instance_1.text() <- here, I see the can see text
g_instance_2 = driver.find_elements(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[6]/ul/li[1]")
puts g_instance_2
output is <Selenium::WebDriver::Element:0x000000023c1700
puts g_instance_2.inspect
output is :[#<Selenium::WebDriver::Element:0x22f3b7c6e7724d4a id="4">]
puts g_instance_2.class
Output: Array
puts g_instance_2.count
Output:1
When there is no /a in the td it doesn't seem to work.
I have tried puts g_instance_2.text, g_instance_2.text() and many others with no success.
I must be missing something obvious, but I am not seeing it
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux] on
Linux ubuntu 3.8.0-34-generic #49~precise1-Ubuntu
I decided to try a different apporach using the css selector instead of xpath.
When I insert the following css selector into the FirePath window the desired html section is selected.
g_instance_2 = driver.find_elements(:css, "table#instances tbody tr:nth-of-type(1) td:nth-of-type(6) ul li:nth-of-type(1)" )
The problem is the same as before, I dont seem to be able to access the contents of g_instance_2
I have tried;
puts g_instance_2
g_instance_22 = [g_instance_2]
puts g_instance_22
Both return;
#<Selenium::WebDriver::Element:0x000000028a6ba8>
#<Selenium::WebDriver::Element:0x000000028a6ba8>
How can I check the value returned from the remote web-server?
Would Python be a better choice to do this?
The HTML code fragment you are trying to test is not valid HTML. It might be worth filing a bug report for it.
With the given code, the following CSS selector retrieves the <a> you want:
[href^="/horizon/admin/instances"]
Translated into: any element that has the "href" attribute starting with "/horizon/admin/instances"
For XPATH this is the selector
("//a[contains(#href,'/horizon/admin/instances')]")
Same translation just an uglier syntax.
The problem was that I was not accessing the returned array properly.
puts g_instance_2[0].text()
works for css and xpath

How do you traverse a HTML document, search, and skip to the next item using Nokogiri?

How do you traverse up to a certain found element and then continue to the next found item? In my example I am trying to search for the first element, grab the text, and then continue until I find the next tag or until I hit a specific tag. The reason I need to also take into account the tag is because I want to do something there.
Html
<table border=0>
<tr>
<td width=180>
<font size=+1><b>apple</b></font>
</td>
<td>Description of an apple</td>
</tr>
<tr>
<td width=180>
<font size=+1><b>banana</b></font>
</td>
<td>Description of a banana</td>
</tr>
<tr>
<td><img vspace=4 hspace=0 src="common/dot_clear.gif"></td>
</tr>
...Then this repeats itself in a similar format
Current scrape.rb
#...
document.at_css("body").traverse do |node|
#if <font> is found
#puts text in font
#else if <img> is found then
#puts img src and continue loop until end of document
end
Thank you!
Interesting. You basically want to traverse through all the children in your tree and perform some operations on basis of the nodes obtained.
So here is how we can do that:
#Acquiring dummy page
page = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Ruby_%28programming_language%29'))
Now, if you want to start traversing all body elements, we can employ XPath for our rescue. XPath expression: //body//* will give back all the children and grand-children in body.
This would return the array of elements with class Nokogiri::XML::Element
page.xpath('//body//*')
page.xpath('//body//*').first.node_name
#=> "div"
So, you can now traverse on that array and perform your operations:
page.xpath('//body//*').each do |node|
case node.name
when 'div' then #do this
when 'font' then #do that
end
end
Something like this perhaps:
document.at_css("body").traverse do |node|
if node.name == 'font'
puts node.content
elsif node.name == 'img'
puts node.attribute("src")
end

how to get src attribute of <img> tag using ruby watir

<table>
<tr>
<td>hello</td>
<td><img src="xyz.png" width="100" height="100"></td>
</tr>
</table>
tabledata.rows.each do |row|
row.cells.each do |cell|
puts cell.text
end
end
puts "end"
getting output ->
hello
end
what should i do for output like this ->
hello
xyz.png
end
without using Nokogiri.
Getting an attribute
You can get the attribute of an element using the Element#attribute_value method. For example,
element.attribute_value('attribute')
For many standard attributes, you can also do:
element.attribute
Output cell text or image text
Assuming that a cell either has text or an image:
You can iterate through the cells
Check if an image exists
Output the image src if it exists
Else output the cell text
This would look like:
tabledata.rows.each do |row|
row.cells.each do |cell|
if cell.image.exists?
puts cell.image.src #or cell.image.attribute_value('src')
else
puts cell.text
end
end
end
puts "end"

Best way to parse a table in Ruby

I'd like to parse a simple table into a Ruby data structure. The table looks like this:
alt text http://img232.imageshack.us/img232/446/picture5cls.png http://img232.imageshack.us/img232/446/picture5cls.png
Edit: Here is the HTML
and I'd like to parse it into an array of hashes. E.g.,:
schedule[0]['NEW HAVEN'] == '4:12AM'
schedule[0]['Travel Time In Minutes'] == '95'
Any thoughts on how to do this? Perl has HTML::TableExtract, which I think would do the job, but I can't find any similar library for Ruby.
You might like to try Hpricot (gem install hpricot, prepend the usual sudo for *nix systems)
I placed your HTML into input.html, then ran this:
require 'hpricot'
doc = Hpricot.XML(open('input.html'))
table = doc/:table
(table/:tr).each do |row|
(row/:td).each do |cell|
puts cell.inner_html
end
end
which, for the first row, gives me
<span class="black">12:17AM </span>
<span class="black">
</span>
<span class="black">1:22AM </span>
<span class="black">
</span>
<span class="black">65</span>
<span class="black">TRANSFER AT STAMFORD (AR 1:01AM & LV 1:05AM) </span>
<span class="black">
N
</span>
So already we're down to the content of the TD tags. A little more work and you're about there.
(BTW, the HTML looks a little malformed: you have <th> tags in <tbody>, which seems a bit perverse: <tbody> is fairly pointless if it's just going to be another level within <table>. It makes much more sense if your <tr><th>...</th></tr> stuff is in a separate <thead> section within the table. But it may not be "your" HTML, of course!)
In case there isn't a library to do that for ruby, here's some code to get you started writing this yourself:
require 'nokogiri'
doc=Nokogiri("<table><tr><th>la</th><th><b>lu</b></th></tr><tr><td>lala</td><td>lulu</td></tr><tr><td><b>lila</b></td><td>lolu</td></tr></table>")
header, *rest = (doc/"tr").map do |row|
row.children.map do |c|
c.text
end
end
header.map! do |str| str.to_sym end
item_struct = Struct.new(*header)
table = rest.map do |row|
item_struct.new(*row)
end
table[1].lu #=> "lolu"
This code is far from perfect, obviously, but it should get you started.

Resources