Is there a Nokogiri example code for parsing Acrobat XFDF? - ruby

I am looking for a ruby code snippet that shows use of Nokogiri to parse Acrobat XFDF data.

It's no different than parsing any other XML:
require 'nokogiri'
xfdf = '<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
  <f href="Demo PDF Form.pdf"/>
  <fields>
    <field name="Date of Birth">
      <value>01-01-1960</value>
    </field>
    <field name="Your Name">
      <value>Mr. Customer</value>
    </field>
  </fields>
  <ids original="FEBDB19E0CD32274C16CE13DCF244AD2" modified="5BE74DD4F607B7409DC03D600E466E12"/>
</xfdf>
'
doc = Nokogiri::XML(xfdf)
doc.at('//xmlns:f')['href'] # => "Demo PDF Form.pdf"
doc.at('//xmlns:field[#name="Date of Birth"]').text # => "\n      01-01-1960\n    "
doc.at('//xmlns:field[#name="Your Name"]').text # => "\n      Mr. Customer\n    "
It uses a XML namespace, so you have to honor that in the xpaths, or deal with it by telling Nokogiri to ignore them, but this is common in XML.

You can use [nguyen][1] gem to do parsing job
xfdf = Nguyen::Xfdf.new(:key => 'value', :other_key => 'other value')
# use to_xfdf if you just want the XFDF data, without writing it to a file puts
xfdf.to_xfdf
# write xfdf file
xfdf.save_to('path/to/file.xfdf')

Related

Nokogiri XSLT tagging document as XML type when using JSON

I am using Nokogiri to transform an XML document to JSON. The code is straight forward:
#document = Nokogiri::XML(entry.data)
xslt = Nokogiri::XSLT(File.read("#{File.dirname(__FILE__)}/../../xslt/my.xslt"))
transform = xslt.transform(#document)
entry in this case is a Mongoid based model and data is an XML blob attribute stored as a string on MongoDB.
When I dump the contents of transform, the JSON is there. The problem is, Nokogiri is tagging the top of the document with:
<?xml version="1.0"?>
What's the correct way of addressing that?
Try the #apply_to method as below(Source):
require 'nokogiri'
doc = Nokogiri::XML('<?xml version="1.0"><root />')
xslt = Nokogiri::XSLT("<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'/>")
puts xslt.transform(doc)
puts "######"
puts xslt.apply_to(doc)
# >> <?xml version="1.0"?>
# >> ######
# >>

How to navigate a XML object in Ruby

I have a regular xml object created from a response of a web service.
I need to get some specific values from some specific keys... for example:
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
How can I get <needThisValue> and <needThisValue2> in Ruby?
I'm a big fan of Nokogiri:
xml = <<EOT
<tag>
<tag2>
<tag3>
<needThisValue>3</needThisValue>
<tag4>
<needThisValue2>some text</needThisValue2>
</tag4>
</tag3>
</tag2>
</tag>
EOT
This creates a document for parsing:
require 'nokogiri'
doc = Nokogiri::XML(xml)
Use at to find the first node matching the accessor:
doc.at('needThisValue2').class # => Nokogiri::XML::Element
Or search to find all nodes matching the accessor as a NodeSet, which acts like an Array:
doc.search('needThisValue2').class # => Nokogiri::XML::NodeSet
doc.search('needThisValue2')[0].class # => Nokogiri::XML::Element
This uses a CSS accessor to locate the first instance of each node:
doc.at('needThisValue').text # => "3"
doc.at('needThisValue2').text # => "some text"
Again with the NodeSet using CSS:
doc.search('needThisValue')[0].text # => "3"
doc.search('needThisValue2')[0].text # => "some text"
You can use XPath accessors instead of CSS if you want:
doc.at('//needThisValue').text # => "3"
doc.search('//needThisValue2').first.text # => "some text"
Go through the tutorials to get a jumpstart. It's very powerful and quite easy to use.
require "rexml/document"
include REXML
doc = Document.new string
puts XPath.first(doc, "//tag/tag2/tag3/needThisValue").text
puts XPath.first(doc, "//tag/tag2/tag3/tag4/needThisValue2").text
Try this Nokogiri tutorial.
You'll need to install nokogiri gem.
Good luck.
Check out the Nokogiri gem. You can read some tutorials enter link description here. It's fast and simple.

How do I validate specific attributes in XML using Ruby's REXML?

I'm trying to read some XML I've retrieved from a web service, and validate a specific attribute within the XML.
This is the XML up to the tag that I need to validate:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<QueryResponse xmlns="http://tempuri.org/">
<QueryResult xmlns:a="http://schemas.datacontract.org/2004/07/Entity"
xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:Navigation i:nil="true" />
<a:SearchResult>
<a:EntityList>
<a:BaseEntity i:type="a:Product">
<a:ExtractDateTime>1290398428</a:ExtractDateTime>
<a:ExtractDateTimeFormatted>11/22/2010
04:00:28</a:ExtractDateTimeFormatted>
Here's the code I have thus far using REXML in Ruby:
require 'xmlsimple'
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
class Listener
include StreamListener
xmlfile = File.new("rbxml_CS_Query.xml")
xmldoc = Document.new(xmlfile)
# Now get the root element
root = xmldoc.root
puts root.attributes["a:EntityList"]
# This will output the date/time of the query response
xmldoc.elements.each("a:BaseEntity"){
|e| puts e.attributes["a:ExtractDateTimeFormatted"]
}
end
I need to validate that ExtractDateTimeFormatted is there and has a valid value for that attribute. Any help is greatly appreciated. :)
Reading from local xml file.
File.open('temp.xml', 'w') { |f|
f.puts request
f.close
}
xml = File.read('temp.xml')
doc = Nokogiri::XML::Reader(xml)
extract_date_time_formatted = doc.at(
'//a:ExtractDateTimeFormatted',
'a' => 'http://schemas.datacontract.org/2004/07/Entity'
).inner_text
show = DateTime.strptime(extract_date_time_formatted, '%m/%d/%Y')
puts show
When I run this code I get an error: "undefined method 'at' for # on line 21
Are you tied to REXML or can you switch to Nokogiri? I highly recommend Nokogiri over the other Ruby XML parsers.
I had to add enough XML tags to make the sample validate.
require 'date'
require 'nokogiri'
xml = %q{<?xml version="1.0"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<QueryResponse xmlns="http://tempuri.org/">
<QueryResult xmlns:a="http://schemas.datacontract.org/2004/07/Entity" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:Navigation i:nil="true"/>
<a:SearchResult>
<a:EntityList>
<a:BaseEntity i:type="a:Product">
<a:ExtractDateTime>1290398428</a:ExtractDateTime>
<a:ExtractDateTimeFormatted>11/22/2010</a:ExtractDateTimeFormatted>
</a:BaseEntity>
</a:EntityList>
</a:SearchResult>
</QueryResult>
</QueryResponse>
</s:Body>
</s:Envelope>
}
doc = Nokogiri::XML(xml)
extract_date_time_formatted = doc.at(
'//a:ExtractDateTimeFormatted',
'a' => 'http://schemas.datacontract.org/2004/07/Entity'
).inner_text
puts DateTime.strptime(extract_date_time_formatted, '%m/%d/%Y')
# >> 2010-11-22T00:00:00+00:00
There's a couple things going on that could make this harder to handle than a simple XML file.
The XML is using namespaces. They are useful, but you have to tell the parser how to handle them. That is why I had to add the second parameter to the at() accessor.
The date value is in a format that is often ambiguous. For many days of the year it is hard to tell whether it is mm/dd/yyyy or dd/mm/yyyy. Here in the U.S. we assume it's the first, but Europe is the second. The DateTime parser tries to figure it out but often gets it wrong, especially when it thinks it's supposed to be dealing with a month 22. So, rather than let it guess, I told it to use mm/dd/yyyy format. If a date doesn't match that format, or the date's values are out of range Ruby will raise an exception, so you'll need to code for that.
This is an example of how to retrieve and parse XML on the fly:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://java.sun.com/developer/earlyAccess/xml/examples/samples/book-order.xml'))
puts doc.class
puts doc.to_xml
And an example of how to read a local XML file and parse it:
require 'nokogiri'
doc = Nokogiri::XML(File.read('test.xml'))
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <root xmlns:foo="bar">
# >> <bar xmlns:hello="world"/>
# >> </root>

How to generate XML file using Ruby and Builder::XMLMarkup templates?

As you all know, with Rails it is possible to use Builder::XMLMarkup templates to provide an http reponse in XML format instead of HTML (with the respond_to command). My problem is that I would like to use the Builder::XMLMarkup templating system not with Rails but with Ruby only (i.e. a standalone program that generates/outputs an XML file from an XML template). The question is then twofold:
How do I tell the Ruby program which is the template I want to use? and
How do I tell the Builder class which is the output XML file ?
There is already a similar answer to that in Stackoverflow (How do I generate XML from XMLBuilder using a .xml.builder file?), but I am afraid it is only valid for Rails.
Here's a simple example showing the basics:
require 'builder'
#received_data = {:books => [{ :author => "John Doe", :title => "Doeisms" }, { :author => "Jane Doe", :title => "Doeisms II" }]}
#output = ""
xml = Builder::XmlMarkup.new(:target => #output, :indent => 1)
xml.instruct!
xml.books do
#received_data[:books].each do |book|
xml.book do
xml.title book[:title]
xml.author book[:author]
end
end
end
The #output object will contain your xml markup:
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book>
<title>Doeisms</title>
<author>John Doe</author>
</book>
<book>
<title>Doeisms II</title>
<author>Jane Doe</author>
</book>
</books>
The Builder docs at github.com provide more examples and links to more documentation.
To select a specific template, you could pass arguments to your program for this decision.
Anyway, I prefer to use libxml-ruby to parse and build XML documents, but that's a matter of taste.
I used Tilt to do (the first part of) this. It's really easy:
require 'builder'
require 'tilt'
template = Tilt.new('templates/foo.builder')
output = template.render
That will get you a string representation of your xml. At that point you can write it out to disk yourself.

Get element text from XML doc

I'm trying to extract some information from XML from Weather Underground.
I can open the resource and pull out the desired elements, but I really want to return the element text as a variable, without the containing XML element tags, so I can manipulate it and display it on a web page.
Perhaps there is a way to do this using regexp to strip off the tags, but I suspect/hope I can do this in a more elegant fashion directly in Nokogiri.
Currently I am using irb to work out the syntax:
irb>require 'rubygems'
irb>require 'nokogiri'
irb>require 'open-uri'
irb>doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
=> <?xml version="1.0"?>
# [...]
<!-- 0.036:0 -->
irb>doc.xpath('/current_observation/weather')
=> <weather>Clear</weather>irb(main):019:0>
irb>doc.xpath('/current_observation/wind_dir')
=> <wind_dir>North</wind_dir>
irb>doc.xpath('/current_observation/wind_mph')
=> <wind_mph>10</wind_mph>
irb>doc.xpath('/current_observation/pressure_string')
=> <pressure_string>31.10 in (1053 mb)</pressure_string>
I need help with the specific syntax while using constructs such as:
doc.xpath.element('/current_observation/weather')
doc.xpath.text('/current_observation/weather')
doc.xpath.node('/current_observation/weather')
doc.xpath.element.text('/current_observation/weather')
All return errors.
As per XPath, you can return the text node of an element with text().
In your example it should be doc.xpath('/current_observation/weather/text()') to get the content of weather's text node.
Something like this works for me:
irb(main):019:0> doc.xpath('//current_observation/weather').first.content
=> "Clear"
One of the nice things about Nokogiri is its flexibility when writing accessors. You're not limited to XPath only, instead you can use CSS accessors:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://api.wunderground.com/auto/wui/geo/WXCurrentObXML/index.xml?query=KBHB'))
weather_report = %w[weather wind_dir wind_mph pressure_string].inject({}) { |h, n|
h[n.to_sym] = doc.at('current_observation ' << n).text
h
}
weather_report # => {:weather=>"Overcast", :wind_dir=>"South", :wind_mph=>"6", :pressure_string=>"29.67 in (1005 mb)"}

Resources