How to build an XPath parse tree in Ruby? - ruby

I need to be able to understand the contents of specific XPath statements provided as data, which I realize is kind of unusual.
Rather than build a full-fledged XPath statement parser or fall back to regular expressions (since XPath is recursive), I was hoping there would be a Ruby XML library out there with an implementation that provides the root of an XPath parse tree. It looks like Nokogiri does not. Is there a Ruby library that does? Several searches suggest that the results are going to be about evaluating XPath statements against specific XML documents, which is not what I'm looking for.

If XPath 1.0 is enough for you then you can use REXML's XPath parser, no gem is needed.
require 'rexml/parsers/xpathparser.rb'
xpath_parser = REXML::Parsers::XPathParser.new
xpath_parser.parse('//guests/person[#id="jane doe"]')
#=> [:document, :descendant_or_self, :node, :child, :qname, "", "guests", :child, :qname, "", "person", :predicate, [:eq, [:attribute, :qname, "", "id"], [:literal, "jane doe"]]]

Related

Crystal xpath library or implementation

Is there any way to use xpath when parsing an HTML file ?
I am looking for a Ruby Nokogiri equivalent but Crystagiri does not implement it ( yet ? ). Also tried myhtml and modest but to no avail.
You don't need to use external libraries for this! Crystal has an XML module built in, which has xpath support.
Here's a basic example:
nodes = XML.parse_html(html_content)
nodes.xpath_nodes(query).each do |node|
# do something
end
where html_content is your HTML as a string, and query is your xpath query.
found one : hq from maiha
it implements xpath by wrapping the Crystal XML and myhtml and works well.
require "hq"
node = Hq.parse("<html><body><div>82 users</div></body></html>")
node.xpath("/html/body/div").text # => "82 users"
node.xpath("/html/body/div").int # => 82

Ruby Nokogiri - How to prevent Nokogiri from printing HTML character entities

I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&amp;".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

XPath tokenize() method not recognized by msxml3.dll

I'm attempting to use the tokenize method in a SelectNodes(" ") call, to filter some things out.
I have something along the lines of:
<nodes>
<node colors="RED,BLUE,YELLOW"/>
</nodes>
And my xpath is as such:
nodes/node[not(empty(tokenize("GREEN,YELLOW,PURPLE", ",") intersect tokenize(#colors, ",")))]
Simply, I've got two comma delimited list, one as an attribute, and one as a "filter" for the attributes. I want to select all nodes where #colors contains, somewhere, one of the words inside of "GREEN,YELLOW,PURPLE".
I thought I had the solution for it with that XPath, but it seems either:
A: I did something wrong, or
B: The version of XML DOM document I am using does not support tokenize()
The XPath above, in a SelectNodes( ) call will throw up an error message, saying msxml3.dll: Unknown method.", then pointing to the tokenize() method.
I tried doing setProperty("SelectionLanguage", "XPath"), but that did not seem to solve the issue either.
Is there any way for me to perform an equivalent XPath selection, without resorting to a bunch of and contains(#colors, "GREEN") and contains(#colors, "YELLOW")...?
As JLRishe says, msxml does not support XPath 2.0.
Depending on the environment that you are in there is probably third-party software you can use that supports either XPath 2.0 or XQuery 1.0 (which is a superset of XPath 2.0).
Microsoft's XML software is getting very dated and there has been little new development for 10 years now. It's time to consider alternatives.

How do I get the input value from a Nokogiri::XML::NodeSet?

I am looking for my input element using Nokogiri's xpath method.
It's returning an object of class Nokogiri::XML::NodeSet:
[#<Nokogiri::XML::Element:0x3fcc0e07de14 name="input" attributes=[#<Nokogiri::XML::Attr:0x3fcc0e07dba8 name="type" value="text">, #<Nokogiri::XML::Attr:0x3fcc0e07db94 name="name" value="creditInstallmentAmount">, #<Nokogiri::XML::Attr:0x3fcc0e07db44 name="style" value="width:240px">, #<Nokogiri::XML::Attr:0x3fcc0e07dae0 name="value" value="94.8">, #<Nokogiri::XML::Attr:0x3fcc0e07da18 name="readonly" value="true">]>
Is there a faster and cleaner way to get the value of input than casting this using to_s:
"<input type=\"text\" name=\"creditInstallmentAmount\" style=\"width:240px\" value=\"94.8\" readonly>"
and match with regular expressions?
A couple things will help:
Nokogiri has the at method, which is the equivalent of search(...).first, and, instead of returning a NodeSet, it returns the Node itself, making it easy to grab values from it:
require 'nokogiri'
doc = Nokogiri::HTML('<input type="text" name="creditInstallmentAmount" style="width:240px" value="94.8" readonly>')
doc.at('input')['value'] # => "94.8"
doc.at('input')['value'].to_f # => 94.8
Also, notice I'm using CSS notation, instead of XPath. Nokogiri supports both, and a lot of times the CSS is more obvious and easily readable. The at_css method is an alias to at for convenience.
Note that Nokogiri uses a little test in search and at to try to determine whether the selector is CSS or XPath, and then branches accordingly to the specific method. The test can be fooled, at which point you should use the specific CSS or XPath variant, or always use them if you're paranoid. In years of using Nokogiri I've only once encountered the situation where the code was confused.
If you want to be more explicit about which input you want, you can look into the parameters for the tag:
doc.at('input[#name="creditInstallmentAmount"]')['value'] # => "94.8"
Get familiar with the difference between search and at and their varients, and Nokogiri will really become useful to you. Learn how to access the parameters and text() nodes and you'll know 99% of what you need to know for parsing HTML and XML.
Ok, I found the answer:
.map{|node| node["value"]}.first
Ok, this works for me
require 'nokogiri'
require 'open-uri'
html = open ARGV[0]
doc = Nokogiri::HTML(html)
inputs = doc.search 'input'
inputs.map{|node| node['name']}
or all in one
inputs = Nokogiri::HTML(html).search('input').map{|node| node['name']}

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

Resources