I am looking for my input element using Nokogiri's xpath method.
It's returning an object of class Nokogiri::XML::NodeSet:
[#<Nokogiri::XML::Element:0x3fcc0e07de14 name="input" attributes=[#<Nokogiri::XML::Attr:0x3fcc0e07dba8 name="type" value="text">, #<Nokogiri::XML::Attr:0x3fcc0e07db94 name="name" value="creditInstallmentAmount">, #<Nokogiri::XML::Attr:0x3fcc0e07db44 name="style" value="width:240px">, #<Nokogiri::XML::Attr:0x3fcc0e07dae0 name="value" value="94.8">, #<Nokogiri::XML::Attr:0x3fcc0e07da18 name="readonly" value="true">]>
Is there a faster and cleaner way to get the value of input than casting this using to_s:
"<input type=\"text\" name=\"creditInstallmentAmount\" style=\"width:240px\" value=\"94.8\" readonly>"
and match with regular expressions?
A couple things will help:
Nokogiri has the at method, which is the equivalent of search(...).first, and, instead of returning a NodeSet, it returns the Node itself, making it easy to grab values from it:
require 'nokogiri'
doc = Nokogiri::HTML('<input type="text" name="creditInstallmentAmount" style="width:240px" value="94.8" readonly>')
doc.at('input')['value'] # => "94.8"
doc.at('input')['value'].to_f # => 94.8
Also, notice I'm using CSS notation, instead of XPath. Nokogiri supports both, and a lot of times the CSS is more obvious and easily readable. The at_css method is an alias to at for convenience.
Note that Nokogiri uses a little test in search and at to try to determine whether the selector is CSS or XPath, and then branches accordingly to the specific method. The test can be fooled, at which point you should use the specific CSS or XPath variant, or always use them if you're paranoid. In years of using Nokogiri I've only once encountered the situation where the code was confused.
If you want to be more explicit about which input you want, you can look into the parameters for the tag:
doc.at('input[#name="creditInstallmentAmount"]')['value'] # => "94.8"
Get familiar with the difference between search and at and their varients, and Nokogiri will really become useful to you. Learn how to access the parameters and text() nodes and you'll know 99% of what you need to know for parsing HTML and XML.
Ok, I found the answer:
.map{|node| node["value"]}.first
Ok, this works for me
require 'nokogiri'
require 'open-uri'
html = open ARGV[0]
doc = Nokogiri::HTML(html)
inputs = doc.search 'input'
inputs.map{|node| node['name']}
or all in one
inputs = Nokogiri::HTML(html).search('input').map{|node| node['name']}
Related
I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.
I'm having an awful time trying to use a library to parse an XML File into a hash like object, modify it, then print it back out to another XML file in Ruby. For a class I'm taking, we're supposed to use a Java JAXB like library where we convert XML into an object. We've already done SAX and DOM methods so we can't use those methods of XML de-serialization. Nokogiri helped me with both of these in Ruby.
The only problem is that besides the SIMPLE modifications I'm making to the objects, when I write to file it has drastic differences. Is there a Ruby library meant for doing just this? I've tried: ROXML, XML::Mapping, and ActiveSupport::CoreExt. The only one I can get to even run is ActiveSupport, and even then it starts putting element attributes as child elements in the output XML.
I'm willing to try out XmlSimple, but I'm curious has anyone actually had to do this before/run into the same problems? Again, I can't read in lines one at a time like SAX or build a Tree like structure like DOM, it needs to be a hash like object.
Any help is much appreciated!
You should have a look into nokogiri: http://nokogiri.org/
Then you can parse the XML like this :
xml_file = "some_path"
#xml = Nokogiri::XML(File.open xml_file)
#xml.xpath('//listing').each do |node|
style = node.search("style").text
end
With Xpath, you can perform queries in the XML :
#xml.xpath("//listing[name='John']").first(10)
OK, I got it working. After looking at ActiveSupport::CoreExt 's source code I found it just uses a gem called xml-simple. What's obnoxious is the gem, library name in the require statement, and class name are a mixture of hyphenated and non hyphenated spellings. For future reference here's what I did:
# gem install xml-simple
# ^ all lowercase, hyphenated
require 'xmlsimple'
# ^ all lowercase, not hyphenated
doc = XmlSimple.xml_in 'hw3.xml', 'KeepRoot' => true
# ^ Camel cased (it's a class), not hyphenated
# doc.class => Hash
# manipulate doc as a hash
file = File.new('HW3a.xml', 'w')
file.write("<?xml version='1.0' encoding='utf-8'?>\n")
file.write(XmlSimple.xml_out doc, 'KeepRoot' => true)
I hope this helps someone. Also make sure you pay attention to case and hyphens with this gem!!!
I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}
require 'nokogiri'
doc = Nokogiri::XML "<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.css("a, b").each {|o| p o.to_s}
# "<a>foo<c>bar</c></a>"
# "<a>more</a>"
# "<b>jim<d>jam></d></b>"
How can I keep tags in their original order? Or also remove nested tags?
You might want to look at whitelist/blacklist/scrubbing gems. Sanitize and Loofah come to mind.
From Sanitize's description:
Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
From Loofah's description:
Loofah excels at HTML sanitization (XSS prevention). It includes some nice HTML sanitizers, which are based on HTML5lib’s whitelist, so it most likely won’t make your codes less secure. (These statements have not been evaluated by Netexperts.)
In either case, they'll save you from reinventing a wheel.
require 'nokogiri'
doc = Nokogiri::XML "
<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.xpath('root//*[name()!="a"][name()!="b"]').remove
puts doc
#=> <?xml version="1.0"?>
#=> <root>
#=> <a>foo</a>
#=> <b>jim</b>
#=> <a>more</a>
#=>
#=> </root>
If this is just an issue of order and none of the tags you need to isolate are nested, using XPath instead of CSS selectors in Nokogiri should return the tags in the same order they are in the document:
doc.xpath("//a | //h3").each { |o| puts o }
I'm not sure if this behavior is in any spec for Nokogiri, so you may want to be careful, but in my experience it is true.
Of course, if the tags you're after are ever nested you may need to define what it means to "remove all but certain tags" (e.g. what happens to removed tags and their contents that exist inside non-removed tags and their contents, etc.).
If your requirement is sufficiently complicated such that XPath queries won't cut it, you may need to "walk the DOM" using something like doc.root.children and recursively examine the children of each node.
I'm trying to extract elements with an attribute, and not extract the descendants separately that have the same attribute.
Using the following html:
<html><body>
<div box>
some text
<div box>
some more text
</div>
</div>
<div box>
this needs to be included as well
</div>
</body></html>
I want to be able to extract the two outer <div box> and its descendants including the inner <div box>, but don't want to have the inner <div box> extracted separately.
I have tried using all sorts of different expressions but think I am missing something quite fundamental. The main expression I have been trying is: //[#box and not(ancestor::#box) but this still returns two elements.
I am trying to do this using the 'Hpricot' (0.8.3) Gem in Ruby 1.9.2 as follows:
# Assuming html is set to the html above
doc = Hpricot(html)
elements = doc.search('//[#box and not(ancestor::#box)]')
# The following is returning 3 instead of 2
elements.size
Any help on this would be great.
Your XPATH is invalid. You have to address something in order to use the predicate filter(e.g. []). Otherwise, there isn't anything to filter.
This XPATH works:
//div[#box and not(ancestor::div/#box)]
If the elements aren't all guarenteed to be <div>, you can use a more generic match for elements:
//*[#box and not(ancestor::*/#box)]
Using elements = doc.search('//[#box and not(ancestor::#box)]') isn't correct.
Use elements = doc.at('//div[#box]') which will find the first occurrence.
I'd recommend using Nokogiri over Hpricot. Nokogiri is well supported, very flexible and more robust.
EDIT: Added because original question changed:
Thanks that worked perfectly, except I forget to mention that I want to return multiple outer elements. Sorry about that, I have updated the question. I will look into Nokogiri further, I didn't choose it originally because Hpricot seemed more approachable.
Remember that XPath acts like accessing a file in a directory at its simplest form, so you can drill down and search in "subdirectories". If you only want the outer <div> tags, then look inside the <body> level and no further:
doc.search('/html/body/div')
or, if you might have unadorned div tags along with the targets:
doc.search('/html/body/div[#box]')
Regarding Hpricot seeming more approachable:
Nokogiri implements a superset of Hpricot's accessors, allowing you to drop it into place for most uses. It supports XPath and CSS accessors allowing more intuitive ways of getting at data if you live in CSS and HTML and don't grok XPath. In addition there are many methods to find your desired target:
doc.search('body > div[box]')
(doc / 'body > div[box]')
doc.css('body > div[box]')
Nokogiri supports the at and % synonym found in Hpricot also, along with css_at, if you only want the first occurrence of something.
I started using Nokogiri after running into some situations where Hpricot exploded because it couldn't handle malformed news-feeds in the wilds.