How to select nodes by matching text - ruby

If I have a bunch of elements like:
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
Is there a built-in method in Nokogiri that would get me all p elements that contain the text "Apple"? (The example element above would match, for instance).

Nokogiri can do this (now) using jQuery extensions to CSS:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"

Here is an XPath that works:
require 'nokogiri'
doc = Nokogiri::HTML(DATA)
p doc.xpath('//li[contains(text(), "Apple")]')
__END__
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>

You can also do this very easily with Nikkou:
doc.search('p').text_includes('bar')

Try using this XPath:
p = doc.xpath('//p[//*[contains(text(), "Apple")]]')

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html
this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"
What is the most effective and direct nokogiri (or ruby) way of doing this?
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip.
Sidenote: css('td a') is likely more efficient than css('td').css('a').
It's important to drill in to the closest node containing the text you want. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
doc.at('body').inner_html # => "\n <p>foo</p>\n "
doc.at('body').text # => "\n foo\n "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"
at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"
Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.
Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Parsing webpage with some html tags using Nokogiri

For example:
content=Nokogiri::HTML(open(url)).at_css(".appwindow").text
This example parse text from .appwindow (only text).
How can I parse this text with <p> tag?
I think you want to find either the full HTML of the first element that has an appwindow class, or perhaps the inner HTML. If so:
require 'nokogiri'
html = Nokogiri::HTML <<ENDHTML
<div id='menu'>menu</div>
<div class='appwindow'><p>Hello <b>World</b>!</p></div>
ENDHTML
puts html.at_css('.appwindow').text
#=> Hello World!
puts html.at_css('.appwindow').to_html
#=> <div class="appwindow"><p>Hello <b>World</b>!</p></div>
puts html.at_css('.appwindow').inner_html
#=> <p>Hello <b>World</b>!</p>
See the list of methods on Nokogiri::XML::Node for other options available to you.

How do I select IDs using xpath in Nokogiri?

Using this code:
doc = Nokogiri::HTML(open("text.html"))
doc.xpath("//span[#id='startsWith_']").remove
I would like to select every span#id starting with 'startsWith_' and remove it. I tried searching, but failed.
Here's an example:
require 'nokogiri'
html = '
<html>
<body>
<span id="doesnt_start_with">foo</span>
<span id="startsWith_bar">bar</span>
</body>
</html>'
doc = Nokogiri::HTML(html)
p doc.search('//span[starts-with(#id, "startsWith_")]').to_xml
That's how to select them.
doc.search('//span[starts-with(#id, "startsWith_")]').each do |n|
n.remove
end
That's how to remove them.
p doc.to_xml
# >> "<span id=\"startsWith_bar\">bar</span>"
# >> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n <span id=\"doesnt_start_with\">foo</span>\n \n</body></html>\n"
The page "XPath, XQuery, and XSLT Functions" has a list of the available functions.
Try this xpath expression:
//span[starts-with(#id, 'startsWith_')]

how to find all the child nodes inside the matched elements (including text nodes)?

in jquery its quite simple
for instance
$("br").parent().contents().each(function() {
but for nokogiri, xpath,
its not working out quite well
var = doc.xpath('//br/following-sibling::text()|//br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
fruits = doc.xpath('//br/../text()').map { |text| text.content.strip }
p fruits
__END__
<html>
<body>
<div>
apple<br>
banana<br>
cherry<br>
orange<br>
</div>
</body>
I'm not familiar with nokogiri, but are you trying to find all the children of any element that contains a <br/>? If so, then try:
//*[br]/node()
In any case, using text() will only match text nodes, and not any sibling elements, which may or may not be what you want. If you actually only want text nodes, then
//*[br]/text()
should do the trick.

Resources