Nokogiri use of DOT '.' notation for XPATH - ruby

All my searches indicate that using the './/' in xpath should start the next search at the current node. In code below I thought I should return only the first h3 element "Top Level" and the search should terminate, instead I also return the second h3 tag which is in another node entirely. What am I missing.
=begin Sample HTML contains
<DIV><DIV><DIV id ='1'><h3> Top Level </h3></DIV></DIV></DIV>
<DIV><DIV><DIV id ='2'><h3> Bottom Level </h3></DIV></DIV></DIV>
=end
require 'rubygems'
require 'nokogiri'
page = Nokogiri::HTML(open("Sample.html"))
el = page.xpath("html/body/div/div/div[#id='1']") # set postion in tree
puts el.inspect
=begin
[#<Nokogiri::XML::Element:0x1990410 name="div" attributes= [#<Nokogiri::XML::Attr:0x1990200 name="id" value="1">]`
children=[#<Nokogiri::XML::Text:0x197dda4 " \r\n\t\t\t">, #<Nokogiri::XML::Element:0x197dcc0 name="h3"
children=[#<Nokogiri::XML::Text:0x197da74 " Top Level ">]>, #<Nokogiri::XML::Text:0x197d768 "\r\n
=end
el = page.xpath(".//h3")
puts el.inspect
=begin
[ #<Nokogiri::XML::Element:0x197dcc0 name="h3" children=[#<Nokogiri::XML::Text:0x197da74 " Top Level ">]>,
#<Nokogiri::XML::Element:0x197c37c name="h3" children=[#<Nokogiri::XML::Text:0x197c190 " Bottom Level ">]>]
=end

Related

Nokogiri: non-destructively print node without its children

I have a piece of ruby code to replace the value of an attribute:
# -*- coding: utf-8 -*-
require "nokogiri"
xml = <<-eos
<a blubb="blah">
<b>irrelevant</b>
<b>also irrelevant</b>
<b blubb="blah">
<c>irrelevant</c>
<c>irrelevant</c>
</b>
<b blubb="foo">
<c>irrelevant</c>
<c>irrelevant</c>
</b>
</a>
eos
doc = Nokogiri::XML(xml) { |config| config.noent }
doc.xpath("//*[#blubb='blah']").each {|node|
puts "Node before:\n#{node.to_s}" ## replace me!
node['blubb'] = "NEW"
puts "Node after:\n#{node.to_s}" ## replace me!
}
When i execute this code, i get the whole node element printed, but I only need to see the start tag to confirm that my script works correctly. Is there a way to display only the start tags of node, or at least only the element itself without its child nodes? The important thing is that the node itself doesn't change when printed (beside the replacement in the attribute), so removing the children is not an option!
We can print name and attribute_nodes of the node
doc.xpath("//*[#blubb='blah']").each {|node|
puts "Node before:\n #{node.name} "+node.attribute_nodes.reduce('') { |out, n| out+="#{n.name}=#{n.value}'"}
node['blubb'] = "NEW"
puts "Node after:\n #{node.name} "+node.attribute_nodes.reduce('') { |out, n| out+="#{n.name}='#{n.value}'"}
}

Nokogiri children method

I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).

How can I perform an action based on the contents of a div with Selenium Webdriver?

I have a Ruby application using Selenium Webdriver and Nokogiri. I want to choose a class, and then for each div corresponding to that class, I want to perform an action based on the contents of the div.
For example, I'm parsing the following page:
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies
It's a page of search results, and I'm looking for the first result with the word "Adoption" in the description. So the bot should look for divs with className: "result", for each one check if its .description div contains the word "adoption", and if it does, click on the .link div. In other words, if the .description does not include that word, then the bot moves on to the next .result.
This is what I have so far, which just clicks on the first result:
require "selenium-webdriver"
require "nokogiri"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
driver.find_element(:class, "link").click
You can get list of elements that contains "adopt" and "Adopt" by XPath using contains() then use union operator (|) to union results from "adopt" and "Adopt". See code below:
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
sleep 5
items = driver.find_elements(:xpath,"//div[#class='g']/div[contains(.,'Adopt')]/h3/a|//div[#class='g']/div[contains(.,'adopt')]/h3/a")
for element in items
linkText = element.text
print linkText
element.click
end
The pattern to handle each iteration will be determined by the type of action executed on each item. If the action is a click, then you can't list all the links to click on each of them since the first click will load a new page, making the elements list obsolete.
So If you wish to click on each link, then one way is to use an XPath containing the position of the link for each iteration:
# iteration 1
driver.find_element(:xpath, "(//h3[#class='r']/a)[1]").click # click first link
# iteration 2
driver.find_element(:xpath, "(//h3[#class='r']/a)[2]").click # click second link
Here is an example that clicks on each link from a result page:
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome
wait = Selenium::WebDriver::Wait.new(timeout: 10000)
driver.navigate.to "https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=puppies"
# define the xpath
search_word = "Puppies"
xpath = ("(//h3[#class='r']/a[contains(.,'%s')]" % search_word) + ")[%s]"
# iterate each result by inserting the position in the XPath
i = 0
while true do
# wait for the results to be loaded
wait.until {driver.find_elements(:xpath, "(//h3[#class='r']/a)[1]").any?}
# get the next link
link = driver.find_elements(:xpath, xpath % [i+=1]).first
break if !link
# click the link
link.click
# wait for a new page
wait.until {driver.find_elements(:xpath, "(//h3[#class='r']/a)[1]").empty?}
# handle the new page
puts "Page #{i}: " + driver.title
# return to the main page
driver.navigate.back
end
puts "The end!"
I don't code in ruby, but one way you could do it in python is:
driver.find_elements
notice how elements is plural, I would grab all the links and put them into an array like.
href = driver.find_elements_by_xpath("//div[#class='rc]/h3/a").getAttribute("href");
Then get all of the descriptions the same way. Do a for loop for every element of description, if the description has the word "Adoption" in it navigate to that website.
for example:
if description[6] has the word adoption find the string href[6] and navigate to href[6].
I hope that makes sense!

How do handle control flow better and nil objects in ruby

I have this script that is a part of a bigger one. I have tree diffrent XML files that looks a litle diffrent from each other and I need some type of control structure to handle nil-object and xpath expressions better
The script that I have right now, outputs nil objects:
require 'open-uri'
require 'rexml/document'
include REXML
#urls = Array.new()
#urls << "http://testnavet.skolverket.se/SusaNavExport/EmilObjectExporter?id=186956355&strId=info.uh.kau.KTADY1&EMILVersion=1.1"
#urls << "http://testnavet.skolverket.se/SusaNavExport/EmilObjectExporter?id=184594606&strId=info.uh.gu.GS5&EMILVersion=1.1"
#urls << "http://testnavet.skolverket.se/SusaNavExport/EmilObjectExporter?id=185978100&strId=info.uh.su.ARO720&EMILVersion=1.1"
#urls.each do |url|
doc = REXML::Document.new(open(url).read)
doc.elements.each("/educationInfo/extensionInfo/nya:textualDescription/nya:textualDescriptionPhrase | /ns:educationInfo/ns:extensionInfo/gu:guInfoExtensions/gu:guSubject/gu:descriptions/gu:description | //*[name()='ct:text']"){
|e| m = e.text
m.gsub!(/<.+?>/, "")
puts "Description: " + m
puts ""
}
end
OUTPUT:
Description: bestrykning, kalandrering, tryckning, kemiteknik
Description: Vill du jobba med internationella och globala frågor med...
Description: The study of globalisation is becoming ever more
important for our understanding of today´s world and the School of
Global Studies is a unique environment for research.
Description:
Description:
Description: Kursen behandlar identifieringen och beskrivningen av
sjukliga förändringar i mänskliga skelett. Kursen ger en
ämneshistorisk bakgrund och skelettförändringars förhållanden till
moderna kliniska data diskuteras.
See this post on how to skip over entries when using a block in ruby. The method each() on doc.elements is being called with a block (which is you code containing gsub and puts calls). The "next" keyword will let you stop executing the block for the current element and move on to the next one.
doc.elements.each("/educationInfo/extensionInfo/nya:textualDescription/nya:textualDescriptionPhrase | /ns:educationInfo/ns:extensionInfo/gu:guInfoExtensions/gu:guSubject/gu:descriptions/gu:description | //*[name()='ct:text']"){
|e| m = e.text
m.gsub!(//, "")
next if m.empty?
puts "Description: " + m
puts ""
}
We know that "m" is a string (and not nil) when using the "next" keyword because we just called gsub! on it, which did not throw an error when executing that line. That means the blank Descriptions are caused by empty strings, not nil objects.

Get text of a paragraph with all the markup (and their content) removed

How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is click here <b>test</b></p>
I only want "hello my website is"
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with node.children.remove, including the text nodes, a proposed solution might look like:
# 1. select all <p> nodes
doc.css('p').
# 2. map children, and flatten
map { |node| node.children }.flatten.
# 3. select text nodes only
select { |node| node.text? }.
# 4. get text and join
map { |node| node.text }.join(' ').strip
This sample returns "hello my website is", but note that doc.css('p') als finds <p> tags within <p> tags.
Update: sorry, misread your question, you only want "hello my website is", see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
Sanitize.clean(html, {}) # => " hello my website is click here test "
FYI, it uses nokogiri internally.
Your test case did not include any interesting text interleaved with the markup.
If you want to turn <p>Hello <b>World</b>!</p> into "Hello !", then removing the children is one way to do it. Simpler (and less destructive) is to just find all the text nodes and join them:
require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p')
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join
#=> "Hello !"
If you want to turn <p>Hello <b>World</b>!</p> into "Hello " (no exclamation point) then you can simply do:
p para.children.first.text # if you know that text is the first child
p para.at('text()').text # if you want to find the first text node
As #Iwe showed, you can use the String#strip method to removing leading/trailing whitespace from the result, if you like.
There's a different way to go about this. Rather than bother with removing nodes, remove the text that those nodes contain:
require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is click here <b>test</b></p>')
text = doc.search('p').map{ |p|
p_text = p.text
a_text = p.at('a').text
p_text[a_text] = ''
p_text
}
puts text
>>hello my website is test
This is a simple example, but the idea is to find the <p> tags, then scan inside those for the tags that contain the text you don't want. For each of those unwanted tags, grab their text and delete it from the surrounding text.
In the sample code, you'd have a list of undesirable nodes at the a_text assignment, loop over them, and iteratively remove the text, like so:
text = doc.search('p').map{ |p|
p_text = p.text
%w[a].each do |bad_nodes|
bad_nodes_text = p.at(bad_nodes).text
p_text[bad_nodes_text] = ''
end
p_text
}
You get back text which is an array of the tweaked text contents of the <p> nodes.

Resources