Nokogiri text node contents - ruby

Is there any clean way to get the contents of text nodes with Nokogiri? Right now I'm using
some_node.at_xpath( "//whatever" ).first.content
which seems really verbose for just getting text.

You want only the text?
doc.search('//text()').map(&:text)
Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
Edit: It appears you only wanted the text content of a single node:
some_node.at_xpath( "//whatever" ).text

Just look for text nodes:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>This is a text node </p>
<p> This is another text node</p>
</body>
</html>
EOT
doc.search('//text()').each do |t|
t.replace(t.content.strip)
end
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>This is a text node</p>
<p>This is another text node</p>
</body></html>
BTW, your code example doesn't work. at_xpath( "//whatever" ).first is redundant and will fail. at_xpath will find only the first occurrence, returning a Node. first is superfluous at that point, if it would work, but it won't because Node doesn't have a first method.
I have <data><foo>bar</foo></bar>, how I get at the "bar" text without doing doc.xpath_at( "//data/foo" ).children.first.content?
Assuming doc contains the parsed DOM:
doc.to_xml # => "<?xml version=\"1.0\"?>\n<data>\n <foo>bar</foo>\n</data>\n"
Get the first occurrence:
doc.at('foo').text # => "bar"
doc.at('//foo').text # => "bar"
doc.at('/data/foo').text # => "bar"
Get all occurrences and take the first one:
doc.search('foo').first.text # => "bar"
doc.search('//foo').first.text # => "bar"
doc.search('data foo').first.text # => "bar"

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How to extract text from HTML without concatenating it using Nokogiri and Ruby [duplicate]

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.
For instance:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
But what I want is:
["foo", "bar", "baz"]
The same happens when scraping XML:
doc = Nokogiri::XML(<<EOT)
<root>
<block>
<entries>foo</entries>
<entries>bar</entries>
<entries>baz</entries>
</block>
</root>
EOT
doc.search('entries').text # => "foobarbaz"
Why does this happen and how do I avoid it?
This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as: text, inner_text
Returns the contents for this Node.

How to replace outer tags using Nokogiri

Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end
Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.

How to remove white space from HTML text

How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text
Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html
this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"
What is the most effective and direct nokogiri (or ruby) way of doing this?
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip.
Sidenote: css('td a') is likely more efficient than css('td').css('a').
It's important to drill in to the closest node containing the text you want. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
doc.at('body').inner_html # => "\n <p>foo</p>\n "
doc.at('body').text # => "\n foo\n "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"
at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"
Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.
Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Resources