Nokogiri equivalent of Hpricot's html method - ruby

Hpricot's html method spits out just the HTML in the document:
> Hpricot('<p>a</p>').html
=> "<p>a</p>"
By contrast, the closest I can come with Nokogiri is the inner_html method, which wraps its output in <html> and <body> tags:
> Nokogiri.HTML('<p>a</p>').inner_html
=> "<html><body><p>a</p></body></html>"
How can I get the behavior of Hpricot's html method with Nokogiri? I.e., I want this:
> Nokogiri.HTML('<p>a</p>').some_method_i_dont_know_about
=> "<p>a</p>"

How about:
require 'nokogiri'
puts Nokogiri.HTML('<p>a</p>').to_html #
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>a</p></body></html>
If you don't want Nokogiri to create a HTML document, then you can tell it to parse it as a document fragment:
puts Nokogiri::HTML::DocumentFragment.parse('<p>a</p>').to_html
# >> <p>a</p>
In either case, the to_html method returns the HTML version of the document.

> Nokogiri.HTML('<p>a</p>').xpath('/html/body').inner_html
=> "<p>a</p>"

Related

Ruby: regex to remove tags if attributes don't have allowed values

I have such a text:
click here!blah-blah-some-text-here-blahclick here!
What's the correct way to remove all <a></a> tags (end everything inside them) if <a href= does NOT have some-good-website?
A possible solution using Nokogiri:
require 'nokogiri'
TEST = 'click here!blah-blah-some-text-here-blahclick here!'
page = Nokogiri::HTML(TEST)
links = page.css("a") # parse all <a></a> elements from content
links.each do |link|
if link['href'] =~ /http:\/\/www\.i-am-hacker\.com\/blah/
link.remove
end
end
puts page # output content for debugging
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>blah-blah-some-text-here-blahclick here!
</body></html>
Useful resource: http://ruby.bastardsbook.com/chapters/html-parsing/
This site helped me a lot understanding how to use nokogiri
If you need to install nokogiri, you can do that by using the following command:
gem install nokogiri

How to use a regex search phrase in HTTP response body

I am trying to search for a phrase like this in HTTP response body:
>> myvar1
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
When I do this, I do not get any result:
>> myvar.scan(/<HEAD> <TITLE>TestExample [Date]<\/TITLE><\/HEAD>/)
[]
Here, [Date] is a dynamic variable that gets its value via loop iteration.
What should I add/change in the regex?
I am using Nokogiri to scan for keyword in HTTP response body.
Please do not parse any markup like HTML with regular expressions. For such purposes it is much more maintainable to feed it into a proper SAX or DOM parser and just extract what you want that way. The reason for this is that no matter how clever you formulate your regex, there will always be corner cases you probably forgot.
require 'nokogiri'
response = "<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>"
doc = Nokogiri::HTML( response )
doc.css( "title" ).text
This will work
<HEAD> <TITLE>TestExample (.*?)<\/TITLE><\/HEAD>
http://rubular.com/r/latepMqrjx
You probably don't need something as specific as <HEAD> <TITLE> as I doubt that there will be more than one title. Case sensitivity and newlines may also be an issue. I'd probably use
/<title>TestExample (.*?)<\//im
You're making it much too hard. Using Nokogiri, you can easily parse and search HTML and/or XML.
To get the <title> text simply use Nokogiri's HTML::Document#title method:
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
doc.title # => "TestExample [Date]"
There's no regex to write or maintain, and this will work as long as the HTML is reasonably valid.
Since you're trying to get what looks like a template for a date, you'll probably want to rewrite that string, which Nokogiri also makes easy using title =:
require 'date'
require 'nokogiri'
doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
title = doc.title
title['[Date]'] = Date.today.to_s
doc.title = title
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>TestExample 2020-03-18</title>
# >> </head> </html>

What are some examples of using Nokogiri?

I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.
Using IRB and Ruby 1.9.2:
Load Nokogiri:
> require 'nokogiri'
#=> true
Parse a document:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
#node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri likes well formed docs. Note that it added the DOCTYPE because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
Search the document to find the first <p> node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
Search the document for all <p> nodes inside the <body> tag, and grab the content of the first one. search returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
This is an important point, and one that trips up almost everyone when first using Nokogiri. search and its css and xpath variants return a NodeSet. NodeSet.text or content concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
Returning back to the original HTML...
Change the content of the node:
> doc.at('p').content = 'bar'
#=> "bar"
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
Remove a node:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.

parsing out the html doctype tag in Nokogiri

How can I parse out the doctype tag to get the html version from a html file?
Trying to use doctype(or DOCTYPE or !DOCTYPE) as an argument in xpath raises an invalide expression error.
The doctype is not part of the document, but part of its DTD
require 'rubygems'
require 'nokogiri'
html = <<EOF
<!DOCTYPE foo PUBLIC "bar" "qux">
<html>
</html>
EOF
doc = Nokogiri::HTML(html)
puts doc.internal_subset.name
puts doc.internal_subset.external_id
puts doc.internal_subset.system_id

Preserve structure of an HTML page, removing all text nodes

I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div> the above may or may not do what you expect. Alternatively, the following may match your needs:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script> element. I've added more text nodes to show how it handles them.
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)

Resources