What are some examples of using Nokogiri?

What are some examples of using Nokogiri? - ruby

I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.

Using IRB and Ruby 1.9.2:
Load Nokogiri:
> require 'nokogiri'
#=> true
Parse a document:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
#node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri likes well formed docs. Note that it added the DOCTYPE because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
Search the document to find the first <p> node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
Search the document for all <p> nodes inside the <body> tag, and grab the content of the first one. search returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
This is an important point, and one that trips up almost everyone when first using Nokogiri. search and its css and xpath variants return a NodeSet. NodeSet.text or content concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
Returning back to the original HTML...
Change the content of the node:
> doc.at('p').content = 'bar'
#=> "bar"
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
Remove a node:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.

Related

How to replace outer tags using Nokogiri

Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end

Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end

I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.

Ruby: regex to remove tags if attributes don't have allowed values

I have such a text:
click here!blah-blah-some-text-here-blahclick here!
What's the correct way to remove all <a></a> tags (end everything inside them) if <a href= does NOT have some-good-website?

A possible solution using Nokogiri:
require 'nokogiri'
TEST = 'click here!blah-blah-some-text-here-blahclick here!'
page = Nokogiri::HTML(TEST)
links = page.css("a") # parse all <a></a> elements from content
links.each do |link|
if link['href'] =~ /http:\/\/www\.i-am-hacker\.com\/blah/
link.remove
end
end
puts page # output content for debugging
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>blah-blah-some-text-here-blahclick here!
</body></html>
Useful resource: http://ruby.bastardsbook.com/chapters/html-parsing/
This site helped me a lot understanding how to use nokogiri
If you need to install nokogiri, you can do that by using the following command:
gem install nokogiri

Nokogiri text node contents

Is there any clean way to get the contents of text nodes with Nokogiri? Right now I'm using
some_node.at_xpath( "//whatever" ).first.content
which seems really verbose for just getting text.

You want only the text?
doc.search('//text()').map(&:text)
Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
Edit: It appears you only wanted the text content of a single node:
some_node.at_xpath( "//whatever" ).text

Just look for text nodes:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>This is a text node </p>
<p> This is another text node</p>
</body>
</html>
EOT
doc.search('//text()').each do |t|
t.replace(t.content.strip)
end
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>This is a text node</p>
<p>This is another text node</p>
</body></html>
BTW, your code example doesn't work. at_xpath( "//whatever" ).first is redundant and will fail. at_xpath will find only the first occurrence, returning a Node. first is superfluous at that point, if it would work, but it won't because Node doesn't have a first method.
I have <data><foo>bar</foo></bar>, how I get at the "bar" text without doing doc.xpath_at( "//data/foo" ).children.first.content?
Assuming doc contains the parsed DOM:
doc.to_xml # => "<?xml version=\"1.0\"?>\n<data>\n <foo>bar</foo>\n</data>\n"
Get the first occurrence:
doc.at('foo').text # => "bar"
doc.at('//foo').text # => "bar"
doc.at('/data/foo').text # => "bar"
Get all occurrences and take the first one:
doc.search('foo').first.text # => "bar"
doc.search('//foo').first.text # => "bar"
doc.search('data foo').first.text # => "bar"

Nokogiri equivalent of Hpricot's html method

Hpricot's html method spits out just the HTML in the document:
> Hpricot('<p>a</p>').html
=> "<p>a</p>"
By contrast, the closest I can come with Nokogiri is the inner_html method, which wraps its output in <html> and <body> tags:
> Nokogiri.HTML('<p>a</p>').inner_html
=> "<html><body><p>a</p></body></html>"
How can I get the behavior of Hpricot's html method with Nokogiri? I.e., I want this:
> Nokogiri.HTML('<p>a</p>').some_method_i_dont_know_about
=> "<p>a</p>"

How about:
require 'nokogiri'
puts Nokogiri.HTML('<p>a</p>').to_html #
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>a</p></body></html>
If you don't want Nokogiri to create a HTML document, then you can tell it to parse it as a document fragment:
puts Nokogiri::HTML::DocumentFragment.parse('<p>a</p>').to_html
# >> <p>a</p>
In either case, the to_html method returns the HTML version of the document.

> Nokogiri.HTML('<p>a</p>').xpath('/html/body').inner_html
=> "<p>a</p>"

Preserve structure of an HTML page, removing all text nodes

I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)

require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div> the above may or may not do what you expect. Alternatively, the following may match your needs:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script> element. I've added more text nodes to show how it handles them.
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

What are some examples of using Nokogiri? - ruby

I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.

Related

How to replace outer tags using Nokogiri

Ruby: regex to remove tags if attributes don't have allowed values

Nokogiri text node contents

Nokogiri equivalent of Hpricot's html method

Preserve structure of an HTML page, removing all text nodes

Categories

Resources