Nokogiri should not include DOCTYPE [duplicate] - ruby

This question already has answers here:
How to prevent Nokogiri from adding <DOCTYPE> tags?
(2 answers)
Closed 8 years ago.
I am trying to produce a partial HTML document using Nokogiri, e.g. something along the lines of:
html_content = Nokogiri::HTML::Builder.new() do |doc|
# producing document here, redacted for brevity
end.to_html
This works well enough, except for a little catch: data will later be dispatched to a remote Drupal-powered server and rendered as part of a page and thus should not contain the initial <!DOCTYPE html ...> declaration.
How would I go about convincing Nokogiri not to produce the DOCTYPE tag? Or is Nokogiri's HTML builder the wrong way to go about that?
Thanks in advance.

To achieve this you could use document fragments and the Builder.with method, like this:
require 'nokogiri'
include Nokogiri
fragment = HTML.fragment('')
HTML::Builder.with(fragment) do |f|
f.div('foo')
end
fragment.to_html
# => <div>foo</div>

Nokogiri makes it easy to create templates you can populate on the fly; I'd do it this way:
require 'nokogiri'
DESTINATION_HOST = 'http://www.example.com/some/API/call'
HTML_TEMPLATE = <<EOT
<form method="post">
<input name="user" type="text">
<input name="desc" type="text">
</form>
<div id="quote">
</div>
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(HTML_TEMPLATE)
doc.at('form')['action'] = DESTINATION_HOST
doc.at('div').content = "Danger is my middle name."
[
['user', 'Austin Powers'],
['desc', 'Man of Mystery'],
].each do |name, value|
doc.at("input[name=\"#{name}\"]")['value'] = value
end
puts doc.to_html
# >> <form method="post" action="http://www.example.com/some/API/call">
# >> <input name="user" type="text" value="Austin Powers"><input name="desc" type="text" value="Man of Mystery">
# >> </form>
# >> <div id="quote">Danger is my middle name.</div>
The array and other fields that are populated could easily be loaded from a CSV or YAML file, JSON retrieved on the fly from another host, or directly from a database call.
You know how your document should look beforehand, so use that knowledge to create a template. Nokogiri's Builder is better suited for those times you're not even sure what tags you're going to need and need to dynamically build the entire document structure on the fly.
The hardest part is to define how you're going to loop over various tags in the document to stuff them with content or fill in the parameters, but once you've done that it's easy to create boilerplate you fill in and forward to something else.

Related

Why is XPath returning value of '0' using Ruby, Nokogiri and Watir?

I'm working on a white-hat web-crawler that will periodically log into my account and check some information for me using Ruby with Watir and Nokogiri.
Here's the simplified HTML I'm trying to pull information from:
<div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002">
<div class="banner-g">
<div class="container">
<div id="user-info">
<div id="acct-value">
GAIN/LOSS <span class="SPShares">-$12.85</span>
</div>
<div id="committed">
INVESTED <span class="SPPortfolio">$152.11</span>
</div>
<div id="avail">
AVAILABLE <span class="SPBalance">$26.98</span>
</div>
I'm trying to pull the $26.98. at the bottom of the excerpt.
Here are three snippets of code I'm using. They're all pretty much identical except for the XPath. The first two return their values perfectly, but the third always returns a value of "0" even though it 'should' return "$26.98" or "26.98".
val_one = page_html.xpath(".//*[#id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i
val_two = page_html.xpath(".//*[#id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i
val_three = page_html.xpath(".//*[#id='avail']/a/span").text.gsub(/\D/,'').to_i
puts val_three
I assume it's a problem with the XPath, but I've gone through dozens of XPath troubleshooting questions here and none have worked. I checked the XPath with both FirePath and "XPath Checker". I also tried having the XPath search for the "SPBalance" class but that gave the same result.
When I remove to.i from the end, it returns a blank line instead of a zero.
Elsewhere in the site when using Watir, I was able to fix problems recording a value by calling .focus, but for this piece of the code, which is more Nokogiri, using .focus causes the error message:
undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError)
I assume .focus doesn't work for Nokogiri.
Update: Replaced HTML with a cleaner/more complete version.
I've continued to play around with different ways of reaching that data cell, including xpath, css and a search method. Someone told me xpath wouldn't work for this page so I spent even more time trying to get css to work. Someone else told me the page had Javascript, which would prevent Watir from working. So I tried rewriting the app for Selenium instead. Selenium did not solve the problem, and created a whole host of other problems.
Update: After following advice from the Tin Man, I've found that the node is not actually visible in the HTML when it is downloaded using curl.
I'm now trying to access the node using Watir instead of Nokogiri (as he suggested).
Here's some of what I've tried so far:
avail_funds = browser.span :class => 'SPBalance'
avail_funds.exists?
avail_funds.text
avail_funds = browser.span(:css, 'span[customattribute]').text
avail_funds = browser.div(:id => "avail").a(:href => "/Profile/MyShares").span(:class => "SPBalance").text
avail_funds = browser.span(:xpath, ".//*[#id='avail']/a/span").text
avail_funds = browser.span(:css, 'span[class="SPBalance"]').text
avail_funds = browser.span.text
avail_funds = browser.div.text
browser.span(:class, "SPBalance").focus
avail_funds = browser.span(:class, "SPBalance").text
avail_funds = #browser.span(:class => 'SPBalance').inner_html
puts #browser.spans(:class => "SPBalance")
puts #browser.span(:class => "SPBalance")
texts = #browser.spans(:class => "SPBalance").map do |span|
span.text
end
So far all of the above return either blank lines or an error message.
The div class with the ID "user-info" is visible within the HTML as downloaded via curl. Everything beneath that, however, is not visible.
When I try:
avail_funds = browser.div(:id => "user-info").text
I get only blank lines.
When I try:
avail_funds = browser.div(:class => "navbar navbar-default navbar-fixed-top hidden-xs hidden-sm").text
I get actual text back! But unfortunately the string does not contain the value I want.
I also tried:
puts browser.html
Because I thought if the value where visible in that version of the HTML, as it is through my Firefox plug-in, I could parse down to the value I want. But unfortunately the value is not visible in that version of the HTML.
By first 2 commands you fetch data directly from table cell beginning from the root of the document, and in the last one you starting from the center.
Try out to give span id and get data again, and then grow up the complexity and you will find your error in xpath
The first problem is you're trying to use a long, too-long, selector that is referencing tags that don't exist:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<head>
<body class="cbp-spmenu-push">
<div id="FreshWidget" class="freshwidget-container responsive" data-html2canvas-ignore="true" style="display: none;">
<div id="freshwidget-button" class="freshwidget-button fd-btn-right" data-html2canvas-ignore="true" style="display: none; top: 235px;">
<link rel="stylesheet" href="/Content/css/NavPushComponent.css"/>
<script src="/Scripts/classie.js"/>
<script src="/Scripts/modernizr.custom.js"/>
<div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002">
<div class="banner-g">
<div class="container">
<div id="user-info">
<div id="acct-value">
<div id="committed">
<div id="avail">
<a href="/Profile/MyBalance">
AVAILABLE
<span class="SPBalance">$31.59</span>
EOT
doc.at('tbody') # => nil
".//*[#id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]"
".//*[#id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]"
There is no <tbody> tag in your sample, and there rarely is in HTML created in the wild, especially if people created it manually. We usually see <tbody> in HTML someone grabbed from a browser's "View Source" display, which is the resulting output after their engine has mangled the HTML in an attempt to make it readable. Don't use that output. Instead, ALWAYS go straight to the source and use wget or curl and download the page and inspect it with an editor, or even use nokogiri some_url on the command-line and look at it there.
A second problem is your HTML snippet is invalid because it's full of unterminated tags. Nokogiri will do fixups on bad HTML, which can actually move nodes around, making it difficult to find nodes, especially when debugging. In this particular case Nokogiri is able to terminate them, but it's important to honor tag closures.
Here's what I'd use:
value = doc.at('span.SPBalance').text # => "$31.59"
This is using CSS which is usually much more readable than XPath. at means "find the first occurrence" and is equivalent to search('span.SPBalance').first.
The XPath equivalent would be:
doc.at('//span[#class="SPBalance"]')
doc.at('//span[#class="SPBalance"]').text # => "$31.59"
Once I have the value then it's easy to manipulate it.
value[/[\d.]+/].to_f # => 31.59
Moving on...
the third always returns a value of "0" even though it should return "$31.59" or "31.59"
'$31.58'.to_i # => 0
'$'.to_i # => 0
'31.58'.to_i # => 31
'$31.58'.to_f # => 0.0
'31.58'.to_f # => 31.58
The documentation for to_f and to_i say respectively:
Returns the result of interpreting leading characters in str as a floating point number.
and
Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36).
In both cases "leading characters" is significant.
using .focus causes the error message:
undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError)
I assume .focus doesn't work for Nokogiri.
You could always check the NodeSet documentation, which confirms that focus is not a method.

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Ruby Mechanize, save html as file after filling a form

I want to save the html after filling a form. lets say:
page.form.field.value = 'testing'
page.save 'test.html'
the generated test.html file don't have the modified value attribute
<input name='something' value=''>
I'm expecting:
<input name='something' value='testing'>
You want to use dom functions for that:
page.at('[name=something]')['value'] = 'testing'
In other words there's no reason to expect that changes to the Form fields will update the dom.

Building an HTML document with content from another

I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?

Pass Nokogiri object to ERB page, and then back to post

This is in Sinatra. In my 'get', I create an instance variable which is a nokogiri object, created from an external xml file. I go to an erb file and parse through that nokogiri object in order to do the page layout. In my post method, I need access to that same nokogiri object (I may return to post numerous times and may modify the nokogiri object). The way I've been doing this is to set a hidden variable in the erb page, like this:
<input type="hidden" name="test" value= '<%= #test %>' >
Then in my post, I create a nokogiri object from that variable like this:
#test = Nokogiri::XML(params["test"])
This seemed clunky, but I'm not experienced in this stuff. Anyway, everything worked fine, except that somewhere along the line, my embedded quotes in the xml get mangled. For example, node in my file starts like this:
<property name="blah" value='{"name:foo"}'> </property>
And when I do a puts in my post of params["test"], I get this:
<property name="blah" value="{"name:foo"}"> </property>
(single quotes became double quotes), and finally, after converting it back into a nokogiri object, with the following code:
#test = Nokogiri::XML(params["test"])
I get this:
<property name="blah" value="{"/>name:foo"}"> </root>
Is there a better way to retain access to the object? If not, is there a way to retain my embedded quotes ( I think setting the hidden variable in the erb file is where it gets mangeled)
Summary
Cache the Nokogiri documents in a constant (e.g. a hash or module), which live across requests (within the same server run; see below).
Send only the key to the hash in your form.
Use the key to get the document back out of the constant later on.
Example
package_32.xml
<packages><kittens in_box="true">32</kittens></packages>
cache_nokodocs.rb
require 'sinatra'
require 'nokogiri'
module NokoDocs
#docs_by_file = {}
def self.[](file)
#docs_by_file[file] ||= Nokogiri::XML(IO.read(file))
end
end
get "/xml/:doc" do
#doc = params['doc']
#xml = NokoDocs[#doc]
<<-ENDHTML
The XML starts with '#{#xml.root.name}'
<form method="post" action="/">
<input type="hidden" name="xml" value="#{#doc}">
<button type="submit">Go</button>
</form>
ENDHTML
end
post "/" do
#xml = NokoDocs[params['xml']]
#xml.to_s
end
Using
C:\>curl http://localhost:4567/xml/package_32.xml
The XML starts with 'packages'
<form method="post" action="/">
<input type="hidden" name="xml" value="package_32.xml">
<button type="submit">Go</button>
</form>
# simulate post that the web browser does from the command line
C:\>curl -d xml="package_32.xml" http://localhost:4567/
<?xml version="1.0"?>
<packages>
<kittens in_box="true">32</kittens>
</packages>
The first time any user requests a particular XML file, it will be loaded into the hash; subsequent requests for that file will fetch it from the hash directly, pre-parsed.
Beware!
The documents will not be cached across multiple instances of the server (e.g. if you're behind a reverse proxy). They will also not be cached across server restarts. However, if these are static files on disk, the worst that will happen is that the particular server session will just have to re-create the Nokogiri document once before caching it.
Using the file name on disk and then letting the user post it back to you is probably a really, really dangerous thing to do. Instead, you might create a custom or random key when you load the document and use that. For example:
require 'digest'
module NokoDocs
#docs_by_file = {}
def self.from_file(file)
key = Digest::SHA1( file + rand(100) )
[
#docs_by_file[key] ||= Nokogiri::XML(IO.read(file)),
key
]
end
def self.from_key(key)
#docs_by_file[key]
end
end
get "/xml/:doc" do
#xml, #key = NokoDocs.from_file params['doc']
...
"<input type="hidden" name="key" value="#{#key}">"
...
end
post "/" do
#xml = NokoDocs.from_key params['key']
end
This is a potential memory leak. Each unique document your users request is parsed as a big Nokogiri document and preserved forever (until you restart the server). You might want a system that records the last access time and have a timed job that periodically sweeps out items that haven't been accessed in a while.

Resources