This is in Sinatra. In my 'get', I create an instance variable which is a nokogiri object, created from an external xml file. I go to an erb file and parse through that nokogiri object in order to do the page layout. In my post method, I need access to that same nokogiri object (I may return to post numerous times and may modify the nokogiri object). The way I've been doing this is to set a hidden variable in the erb page, like this:
<input type="hidden" name="test" value= '<%= #test %>' >
Then in my post, I create a nokogiri object from that variable like this:
#test = Nokogiri::XML(params["test"])
This seemed clunky, but I'm not experienced in this stuff. Anyway, everything worked fine, except that somewhere along the line, my embedded quotes in the xml get mangled. For example, node in my file starts like this:
<property name="blah" value='{"name:foo"}'> </property>
And when I do a puts in my post of params["test"], I get this:
<property name="blah" value="{"name:foo"}"> </property>
(single quotes became double quotes), and finally, after converting it back into a nokogiri object, with the following code:
#test = Nokogiri::XML(params["test"])
I get this:
<property name="blah" value="{"/>name:foo"}"> </root>
Is there a better way to retain access to the object? If not, is there a way to retain my embedded quotes ( I think setting the hidden variable in the erb file is where it gets mangeled)
Summary
Cache the Nokogiri documents in a constant (e.g. a hash or module), which live across requests (within the same server run; see below).
Send only the key to the hash in your form.
Use the key to get the document back out of the constant later on.
Example
package_32.xml
<packages><kittens in_box="true">32</kittens></packages>
cache_nokodocs.rb
require 'sinatra'
require 'nokogiri'
module NokoDocs
#docs_by_file = {}
def self.[](file)
#docs_by_file[file] ||= Nokogiri::XML(IO.read(file))
end
end
get "/xml/:doc" do
#doc = params['doc']
#xml = NokoDocs[#doc]
<<-ENDHTML
The XML starts with '#{#xml.root.name}'
<form method="post" action="/">
<input type="hidden" name="xml" value="#{#doc}">
<button type="submit">Go</button>
</form>
ENDHTML
end
post "/" do
#xml = NokoDocs[params['xml']]
#xml.to_s
end
Using
C:\>curl http://localhost:4567/xml/package_32.xml
The XML starts with 'packages'
<form method="post" action="/">
<input type="hidden" name="xml" value="package_32.xml">
<button type="submit">Go</button>
</form>
# simulate post that the web browser does from the command line
C:\>curl -d xml="package_32.xml" http://localhost:4567/
<?xml version="1.0"?>
<packages>
<kittens in_box="true">32</kittens>
</packages>
The first time any user requests a particular XML file, it will be loaded into the hash; subsequent requests for that file will fetch it from the hash directly, pre-parsed.
Beware!
The documents will not be cached across multiple instances of the server (e.g. if you're behind a reverse proxy). They will also not be cached across server restarts. However, if these are static files on disk, the worst that will happen is that the particular server session will just have to re-create the Nokogiri document once before caching it.
Using the file name on disk and then letting the user post it back to you is probably a really, really dangerous thing to do. Instead, you might create a custom or random key when you load the document and use that. For example:
require 'digest'
module NokoDocs
#docs_by_file = {}
def self.from_file(file)
key = Digest::SHA1( file + rand(100) )
[
#docs_by_file[key] ||= Nokogiri::XML(IO.read(file)),
key
]
end
def self.from_key(key)
#docs_by_file[key]
end
end
get "/xml/:doc" do
#xml, #key = NokoDocs.from_file params['doc']
...
"<input type="hidden" name="key" value="#{#key}">"
...
end
post "/" do
#xml = NokoDocs.from_key params['key']
end
This is a potential memory leak. Each unique document your users request is parsed as a big Nokogiri document and preserved forever (until you restart the server). You might want a system that records the last access time and have a timed job that periodically sweeps out items that haven't been accessed in a while.
Related
I'm trying to write "Private Equity Group; USA" to a file.
"Private Equity Group" prints fine, but I get an error for the "USA" portion
TypeError: null is not an object (evaluating 'style.display')"
HTML code:
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
I get the error when I print the XPath or have it in an if statement:
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]/text()')){
file.puts "#{internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]/text()')}"
}
Capybara is not a general purpose xpath library - it is a library aimed at testing, and therefore is element centric. The xpaths used need to refer to elements, not text nodes.
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]')){
file.puts internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]').text
}
although using XPath at all for this is just a bad idea. Whenever possible default to CSS, it's easier to read, and faster for the browser to process - something like
if (internet.has_css?('#addrDiv-Id > div > div:nth-of-type(3)')){
file.puts internet.find('#addrDiv-Id" > div > div:nth-of-type(3)').text
}
or if the HTML allows it (I don't know without seeing more of the HTML)
if (internet.has_css?('#addrDiv-id .cl.profile-xsmall')){
file.puts internet.find('#addrDiv-id .cl.profile-xsmall').text
}
or even cleaner if it works for your use case
file.puts internet.first('#addrDiv-id .cl.profile-xsmall')&.text
Another way to do it :
xml = %{<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA</div>}
require 'rexml/document'
doc = REXML::Document.new xml
print(REXML::XPath.match(doc, 'normalize-space(string(//div[#class="cl profile-xsmall"]))'))
Output :
["Private Equity Group USA"]
I'd say the HTML isn't well-formed, using span would have been better, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
EOT
div = doc.at('.profile-small-bold')
[div.text.strip, div.next_sibling.text.strip].join(' ')
# => "Private Equity Group USA"
which can be reduced to:
[div, div.next_sibling].map { |n| n.text.strip }.join(' ')
# => "Private Equity Group USA"
The problem is that you have two nested divs, with "USA" trailing, so it's important to point to the inner node which has the main text you want. Then "USA" is in the following text node, which is accessible using next_sibling:
div.next_sibling.class # => Nokogiri::XML::Text
div.next_sibling # => #<Nokogiri::XML::Text:0x3c "\n USA\n">
Note, I'm using CSS selectors; They're easier to read, which is echoed by the Nokogiri documentation. I have no proof they're faster, and, because Nokogiri uses libxml to process both, there's probably no real difference worth worrying about, so use whatever makes more sense, and run benchmarks if you're curious.
You might be tempted to use text against the div class="cl profile-xsmall" node, but don't be sucked into that, as it's a trap:
doc.at('.profile-xsmall').text # => "\n Private Equity Group\n USA\n"
doc.at('.profile-xsmall').text.gsub(/\s+/, ' ').strip # => "Private Equity Group USA"
text will return a string of the text nodes after they're concatenated together. In this particular, rare case, it results in a somewhat usable result, however, usually you'll get something like this:
doc = Nokogiri::HTML('<div><p>foo</p><p>bar</p></div>')
doc.at('div').text # => "foobar"
doc.search('p').text # => "foobar"
Once those text nodes have been concatenated it's really difficult to take them apart again. Nokogiri's documentation talks about this:
Note: This joins the text of all Node objects in the NodeSet:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"
Instead, if you want to return the text of all nodes in the NodeSet:
doc.css('d').map(&:text) # => ["foo", "bar"]
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
Um, no, not according to the HTML you gave us. But, let's pretend.
Using an absolute path to a node is a good way to write fragile selectors. It takes only a small change in the HTML to break your access to the node. Instead, find way-points to skip through the HTML to find the node you want, taking advantage of CSS and XPath to search downward through the DOM.
Typically, a selector like yours is generated by a browser, which isn't a good source to trust. Often browsers do fixups on malformed HTML, which changes it from what Nokogiri or a parser would see, resulting in a non-existing target, or the browser presents the HTML after JavaScript has had a change to run, which can move nodes, hide them, add new ones, etc.
Instead of trusting the browser, use curl, wget or nokogiri at the command-line to dump the file and look at it using a text editor. Then you'll be seeing it just as Nokogiri sees it, prior to any fixups or mangling.
How do I interact with a file_field thats hidden by its parent?
<span class="btn button-large fileinput-button">
Select files...
<input accept="image/png, image/gif, image/jpeg" id="gallery_files" multiple="multiple" name="gallery_files" type="file">
</span>
The button overlays the input, therefore it's not visible.
Edit
For the record, here's my code:
data[:photos].each do |photo|
$browser.file_field.set photo
end
and the error: Watir::Wait::TimeoutError: timed out after 20 seconds, waiting for {:tag_name=>"input", :type=>"file"} to become present
Workable example in a Gist
I was a bit suprised, but I was able to set the file field in the sample HTML without any issue using:
browser.file_field.set('path/to/file.txt')
From the code, it looks like setting the file field only requires the input to exist. It does not require it to be visible (or present).
Given that you are getting a Watir::Wait::TimeoutError exception, I would guess that your code is actually failing before the file_field.set. As it looks like the page has the input in a dialog, I am guessing your code actually looks more like:
$browser.file_field.wait_until_present
data[:photos].each do |photo|
$browser.file_field.set photo
end
It would be the wait_until_present method that is actually throwing the exception.
Solution 1
Assuming that an explicit wait method is being called for the file field, you should be able to just remove the wait.
If you have the wait because the dialog is being loaded by Ajax, you could try waiting for a different element instead - eg the parent span.
Solution 2
If for some reason you need the file field to be present, you will need to change its CSS. In this case, the opacity:
p $browser.file_field.present?
#=> false
$browser.execute_script('arguments[0].style.opacity = "1.0";', browser.file_field)
p $browser.file_field.present?
#=> true
For my situation, this worked:
$browser.execute_script("jQuery(function($) {
$('.fileinput-button').css('visibility', 'hidden')
$('#gallery_files').css('visibility', 'visible').css('opacity', '1').css('width', '100').css('height', '50')
})")
I had to hide the parent span, then show, resize, and change the opacity of the input
I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:
<template>
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
</template>
However, I only get as a text string what is between the <template> tags.
I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.
With Crack, I can put a string like
string = "<SETPROFILE><NAME>email</NAME><VALUE>go#go.com</VALUE></SETPROFILE>"
and my output from Crack is:
{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go#go.com"}}
Then I can use a case statement for the possible values I care about.
Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?
These tags also need to be removed. I would like to continue to use the excellent suggestion from #TinMan.
It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.
Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
EOT
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n] = n.text
}
nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}
Or, more legibly:
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}
Getting the content of a particular tag is easy then:
nodes['RECALL'] # => "first_name"
Iterating over all the tags is also easy:
nodes.keys.each do |k|
...
end
You can even replace a tag and its content with text:
doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred. Thanks for giving me your email. \n<SETPROFILE>\n <NAME>email</NAME>\n <VALUE>\n <star/>\n </VALUE>\n</SETPROFILE>. I have just sent you something.\n"
How to replace the nested tags is left to you as an exercise.
This question already has answers here:
How to prevent Nokogiri from adding <DOCTYPE> tags?
(2 answers)
Closed 8 years ago.
I am trying to produce a partial HTML document using Nokogiri, e.g. something along the lines of:
html_content = Nokogiri::HTML::Builder.new() do |doc|
# producing document here, redacted for brevity
end.to_html
This works well enough, except for a little catch: data will later be dispatched to a remote Drupal-powered server and rendered as part of a page and thus should not contain the initial <!DOCTYPE html ...> declaration.
How would I go about convincing Nokogiri not to produce the DOCTYPE tag? Or is Nokogiri's HTML builder the wrong way to go about that?
Thanks in advance.
To achieve this you could use document fragments and the Builder.with method, like this:
require 'nokogiri'
include Nokogiri
fragment = HTML.fragment('')
HTML::Builder.with(fragment) do |f|
f.div('foo')
end
fragment.to_html
# => <div>foo</div>
Nokogiri makes it easy to create templates you can populate on the fly; I'd do it this way:
require 'nokogiri'
DESTINATION_HOST = 'http://www.example.com/some/API/call'
HTML_TEMPLATE = <<EOT
<form method="post">
<input name="user" type="text">
<input name="desc" type="text">
</form>
<div id="quote">
</div>
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(HTML_TEMPLATE)
doc.at('form')['action'] = DESTINATION_HOST
doc.at('div').content = "Danger is my middle name."
[
['user', 'Austin Powers'],
['desc', 'Man of Mystery'],
].each do |name, value|
doc.at("input[name=\"#{name}\"]")['value'] = value
end
puts doc.to_html
# >> <form method="post" action="http://www.example.com/some/API/call">
# >> <input name="user" type="text" value="Austin Powers"><input name="desc" type="text" value="Man of Mystery">
# >> </form>
# >> <div id="quote">Danger is my middle name.</div>
The array and other fields that are populated could easily be loaded from a CSV or YAML file, JSON retrieved on the fly from another host, or directly from a database call.
You know how your document should look beforehand, so use that knowledge to create a template. Nokogiri's Builder is better suited for those times you're not even sure what tags you're going to need and need to dynamically build the entire document structure on the fly.
The hardest part is to define how you're going to loop over various tags in the document to stuff them with content or fill in the parameters, but once you've done that it's easy to create boilerplate you fill in and forward to something else.
I'm making a URL shortener application in Sinatra. It works as follows:
The front page is a form with one field, to enter a long url into:
<form action="" method="post">
<input type='url' name='url' placeholder='Enter URL to be shortened'>
<input type="submit">
</form>
The form posts to the same front page and the code for posting to '/' is this:
post '/' do
#Makes variable of POSTed url.
#long = params[:url]
loop do
#makes random six letter hash
#rand = (0...6).map{(65+rand(26)).chr}.join.downcase
#don't generate another one if it isn't found in the database
break if Short.first(id: "#{#rand}").nil?
end
#saves url and hash to database
#input = Short.create(url: #long, id: #rand)
#displays link with hash to be copied into browser address bar
"http://192.168.1.3:999/"+#rand
end
The problem is that when I submit the form, it doesn't return the http://192.168.1.3:999/... or anything I put after the #input=Short.create(... line. It doesn't return any errors, even when raise_on_save_failure is true. If I comment that line out, it works fine (except when trying to use the shortened url).
EDIT: When I change the code to allow non-urls, it functions perfectly normally. It only breaks with the exact url format.