I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.
Using IRB and Ruby 1.9.2:
Load Nokogiri:
> require 'nokogiri'
#=> true
Parse a document:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
#node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri likes well formed docs. Note that it added the DOCTYPE because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
Search the document to find the first <p> node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
Search the document for all <p> nodes inside the <body> tag, and grab the content of the first one. search returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
This is an important point, and one that trips up almost everyone when first using Nokogiri. search and its css and xpath variants return a NodeSet. NodeSet.text or content concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
Returning back to the original HTML...
Change the content of the node:
> doc.at('p').content = 'bar'
#=> "bar"
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
Remove a node:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.
Related
Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end
Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.
I have such a text:
click here!blah-blah-some-text-here-blahclick here!
What's the correct way to remove all <a></a> tags (end everything inside them) if <a href= does NOT have some-good-website?
A possible solution using Nokogiri:
require 'nokogiri'
TEST = 'click here!blah-blah-some-text-here-blahclick here!'
page = Nokogiri::HTML(TEST)
links = page.css("a") # parse all <a></a> elements from content
links.each do |link|
if link['href'] =~ /http:\/\/www\.i-am-hacker\.com\/blah/
link.remove
end
end
puts page # output content for debugging
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>blah-blah-some-text-here-blahclick here!
</body></html>
Useful resource: http://ruby.bastardsbook.com/chapters/html-parsing/
This site helped me a lot understanding how to use nokogiri
If you need to install nokogiri, you can do that by using the following command:
gem install nokogiri
Is there any clean way to get the contents of text nodes with Nokogiri? Right now I'm using
some_node.at_xpath( "//whatever" ).first.content
which seems really verbose for just getting text.
You want only the text?
doc.search('//text()').map(&:text)
Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
Edit: It appears you only wanted the text content of a single node:
some_node.at_xpath( "//whatever" ).text
Just look for text nodes:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>This is a text node </p>
<p> This is another text node</p>
</body>
</html>
EOT
doc.search('//text()').each do |t|
t.replace(t.content.strip)
end
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>This is a text node</p>
<p>This is another text node</p>
</body></html>
BTW, your code example doesn't work. at_xpath( "//whatever" ).first is redundant and will fail. at_xpath will find only the first occurrence, returning a Node. first is superfluous at that point, if it would work, but it won't because Node doesn't have a first method.
I have <data><foo>bar</foo></bar>, how I get at the "bar" text without doing doc.xpath_at( "//data/foo" ).children.first.content?
Assuming doc contains the parsed DOM:
doc.to_xml # => "<?xml version=\"1.0\"?>\n<data>\n <foo>bar</foo>\n</data>\n"
Get the first occurrence:
doc.at('foo').text # => "bar"
doc.at('//foo').text # => "bar"
doc.at('/data/foo').text # => "bar"
Get all occurrences and take the first one:
doc.search('foo').first.text # => "bar"
doc.search('//foo').first.text # => "bar"
doc.search('data foo').first.text # => "bar"
Hpricot's html method spits out just the HTML in the document:
> Hpricot('<p>a</p>').html
=> "<p>a</p>"
By contrast, the closest I can come with Nokogiri is the inner_html method, which wraps its output in <html> and <body> tags:
> Nokogiri.HTML('<p>a</p>').inner_html
=> "<html><body><p>a</p></body></html>"
How can I get the behavior of Hpricot's html method with Nokogiri? I.e., I want this:
> Nokogiri.HTML('<p>a</p>').some_method_i_dont_know_about
=> "<p>a</p>"
How about:
require 'nokogiri'
puts Nokogiri.HTML('<p>a</p>').to_html #
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>a</p></body></html>
If you don't want Nokogiri to create a HTML document, then you can tell it to parse it as a document fragment:
puts Nokogiri::HTML::DocumentFragment.parse('<p>a</p>').to_html
# >> <p>a</p>
In either case, the to_html method returns the HTML version of the document.
> Nokogiri.HTML('<p>a</p>').xpath('/html/body').inner_html
=> "<p>a</p>"
I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div> the above may or may not do what you expect. Alternatively, the following may match your needs:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script> element. I've added more text nodes to show how it handles them.
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)