I am trying to find CSS elements in a page, containing white space at the end of the class name:
#agent = Mechanize.new
page = #agent.get(somepage)
Where the tag is:
<div class="Example ">
When trying:
page.search('.Example')
the element is not found and when trying:
page.search('.Example ') <- space following the name
Nokogiri raises an exception:
Nokogiri::CSS::SyntaxError: unexpected '$' after 'DESCENDANT_SELECTOR'
Your implied premise, that a class cannot be found because it contains a space, is incorrect. Class names do not include spaces. Proof:
require 'nokogiri'
html = <<End
<html>
<span class="Example ">One</span>
<span class="Example foo">Two</span>
</html>
End
doc = Nokogiri::HTML(html)
puts doc.search('.Example')
Output:
<span class="Example ">One</span>
<span class="Example foo">Two</span>
So I think your HTML document simply doesn't have a class containing Example in it. If you provided the sample HTML, this question would have been easier to answer.
To find all elements having class attribute ending in whitespace:
page.search('*').select{|e| e[:class] =~ /\s$/}
If you specifically target the class attribute you can include spaces. In my case the class value had a space:
<p class="Event_CategoryTree category">
Here is how I targeted that element using Nokogiri:
page.at_css("[class='Event_CategoryTree category']")
You can use Xpath instead.
The following code will return all div containers with the class a class with spaces :
doc = Nokogiri::HTML(page)
result = doc.xpath('//div[#class="a class with spaces"]')
Related
I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"
I would like to remove all the new line characters between
<div class="some class"> arbitrary amount of text here with possible new line characters </div>
Is this possible in ruby?
Yes, you can easily do this using the Nokogiri gem. For example:
require "rubygems"
require "nokogiri"
html = %q!
<div class="some class"> arbitrary amount of text
here with possible
new line
characters </div>
!
doc = Nokogiri::HTML::DocumentFragment.parse(html)
div = doc.at('div')
div.inner_html = div.inner_html.gsub(/[\n\r]/, " ").strip
puts html
puts '-' * 60
puts doc.to_s
When run will output this:
<div class="some class"> arbitrary amount of text
here with possible
new line
characters </div>
------------------------------------------------------------
<div class="some class">arbitrary amount of text here with possible new line characters</div>
The document going in has a structure like this:
<span class="footnote">Hello there, link</span>
The XPath search is:
#doc = set_nokogiri(html)
footnotes = #doc.xpath(".//span[#class = 'footnote']")
footnotes.each_with_index do |footnote, index|
puts footnote
end
The above footnote becomes:
<span>Hello there, link</span>
I assume my XPath is wrong but I'm having a hard time figuring out why.
I had the wrong tag in the output and should have been more careful. The point being that the <a> tag is getting stripped but its contents are still included.
I also added the set_nokogiri line in case that's relevant.
I can't duplicate the problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="footnote">Hello there, link</span>
EOT
footnotes = doc.xpath(".//span[#class = 'footnote']")
footnotes.to_xml # => "<span class=\"footnote\">Hello there, link</span>"
footnotes.each do |f|
puts f
end
# >> <span class="footnote">Hello there, link</span>
An additional problem is that the <a> tag has an invalid href URL.
link
should be:
link
For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.
Say I have:
<div class="amt" id="displayFare-1_69-61-0" style="">
<div class="per">per person</div>
<div class="per" id="showTotalSubIndex-1_69-61-0" style="">Total $334</div>
$293
</div>
I want to grab just the $334. It will always have "Total $" but the id showTotalSubIndex... will be dynamic so I can't use that.
You can use a nokogiri xpath expression to iterate over all the div nodes
and scan the string for the 'Total $' Prefix like this
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML.parse( open( "test.xml" ))
doc.xpath("//div/text()").each{ |t|
tmp = t.to_str.strip
puts tmp[7..-1] if tmp.index('Total $') == 0
}
Rather than finding the text:
html = Nokogiri::HTML(html)
html.css("div.amt").children[1].text.gsub(/^Total /, '')
I assume here that the HTML is structured in such a way that the second child of any div.amt tag is the value that you're after, and then we'll just grab the text of that and gsub it.
Both of these work:
require 'nokogiri'
doc = Nokogiri::XML(xml)
doc.search('//div[#id]/text()').select{ |n| n.text['Total'] }.first.text.split.last
and
doc.search('//div/text()').select{ |n| n.text['Total'] }.first.text.split.last
The difference is the first should run a bit faster if you know the div you're looking for always has an id.
If the ID always starts with "showTotalSubIndex" you could use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').first.text.split.last
and if you know there's only going to be one in the document, you can use:
doc.at('//div[starts-with(#id,"showTotalSubIndex")]').text.split.last
EDIT:
Ryan posits the idea the XML structure might be consistent. If so:
doc.at('//div[2]').text[/(\$\d+)/, 1]
:-)