How to parse a DTD file in Ruby - ruby

I was trying to convert a DTD file to a YAML file, and I've tried loading it both in libXML and Nokogiri, but it seems that a DTD file is not a valid XML file. I'm fine with using any third-party gems as long as I can parse the DTD file.
My attempt at conversion:
wget "http://xml.evernote.com/pub/enml2.dtd"
irb
require 'nokogiri'
xml = Nokogiri::XML::Document.parse('enml2.dtd')
xml.to_yaml
=> "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n message: |\n Start tag expected, '<' not found\n domain: 1\n code: 4\n level: 3\n file: \n line: 1\n str1: \n str2: \n str3: \n int1: 0\n column: 1\n"
Any online XML validator also returns the error "Start tag expected". I assume it is because all valid XML docs start with <?xml, which DTD files seem to be missing. This is what has led me to the conclusion that all DTD files are invalid XML files, however, it does feel weird that the XML definition syntax itself was not defined as valid XML. Why?
I'm parsing the DTD file to remove invalid attributes from an XML file, to know which attributes to keep and which to remove, so I need a way to parse the DTD file.
And ultimately, this is all just a step in trying to convert HTML to ENML (Evernote Markup Language). The steps involved in it include:
Converting HTML to valid XHTML
Converting the body to an en-note element
Removing invalid tags and attributes as per the dtd file
Validating the enml file against the dtd
I'm currently thinking to just copy the disallowed attributes and tags from "Understanding the Evernote Markup Language" and using that to validate my XHTML, but I'd prefer to use the DTD as my source.
The Nokogiri DTD class is a Node class for holding an inline DTD node and validating against it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support. And even if it did work, all I would get is validation.
I did get validation to work properly using:
#dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd')
#enml_document = XML::Document.string enml
#ret = enml_document.validate dtd
I haven't tried REXML. I will give that a go and report back.
I'm trying to convert an HTML document to a XML document that validates with the given DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to strip them, or remove them. I also need to know which attributes are allowed and which are not, so that I can parse the XML properly and remove/sanitize the offending elements and attributes.
For the cleanup purpose, I'm using Loofah, but to use it, I need a list of tag->attributes (which attributes are available for each tag). Instead of making multiple passes validating the doc, which I am doing at the end of cleanup, I'm just looping through each XML tag, and cleaning them up. But to know how to clean them, I need to know which tags and elements are supported in the valid schema. Thus, I need to parse the DTD file.
From what I understand, XLST is the right tool for the job, but I'm not comfortable enough to use it.

However, it does feel weird to me that the xml definition syntax itself was not defined as valid XML. I'd love to know any reasons behind this.
DTDs are a holdover from SGML, the precursor of XML, so it is actually not very strange that DTDs are not XML files. Keeping DTDs and their particular syntax was a deliberate decision when XML was created.
More modern schema languages such as W3C XML Schema and RELAX NG do use XML syntax.
The reason I'm parsing the DTD file is that I want to remove invalid attributes from an XML file. To know which attributes to keep and which to remove, I need a way to parse the DTD file. (from question)
I am just looking for a way to parse DTD files, not just validate using them, because I want to perform custom cleanup and validation using the dtd. (from bounty text)
I don't really understand what you mean by "custom cleanup". I also don't see the point in trying to parse the DTD in the first place.
In order to find out if any elements or attributes in an XML file are invalid (if they break the rules in an associated DTD), you need to parse the XML file using a validating XML parser. The parser will then tell you if there are any errors that need to be fixed.
Nokogiri is based on libxml2 which provides a validating parser. It does support external DTDs that are specified using <!DOCTYPE foo SYSTEM "bar.dtd"> syntax (how to make this work is shown in a comment on the issue that you refer to: https://github.com/sparklemotion/nokogiri/issues/440#issuecomment-3031164).
Here is how the validation can be done:
require 'nokogiri'
xml = File.read("yourfile.xml")
options = Nokogiri::XML::ParseOptions::DTDLOAD # Needed for the external DTD to be loaded
doc = Nokogiri::XML::Document.parse(xml, nil, nil, options)
puts doc.external_subset.validate(doc)
If there is no output from this code, then the XML document is valid against the DTD.

Related

Ruby: parsing message from confluence xml macro

I am trying to parse the message that says "this is a test"
<ac:structured-macro ac:name="warning"><ac:rich-text-body><strong>High</strong> This is a test!</ac:rich-text-body></ac:structured-macro>
I am using nokogiri in ruby and was able to parse this much and nothing else. To get this far, my code looks something like this:
xml = Nokogiri::XML(response)
body = xml.at("body").text
alert_body = alert[3]
I have wasted too many hours looking in the confluence rest api documentation and google for just general xml parsing.
The problems are:
There is no body tag in your example XML.
You're dealing with XML-Namespaces so your selector needs to change.
Your XML sample is incomplete since it's missing the line that would define the namespaces, so this is a bit of a hack but should give you an idea what needs to be done:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<foo xmlns:ac="http://www.w3.org/2005/Atom">
<ac:structured-macro ac:name="warning"><ac:rich-text-body><strong>High</strong> This is a test!</ac:rich-text-body></ac:structured-macro>
</foo>
EOT
doc.at('ac|rich-text-body').text # => "High This is a test!"
Namespaces are useful but they can be a major pain in the neck. Nokogiri makes it pretty easy to deal with them, especially when using CSS selectors. Read Nokogiri's "Searching an HTML / XML Document" page's "Namespaces" section for more information.

How to create an XSLT Filter to extract all contents of an ODT (containing an XForm) to XHTML

I'm trying to use the Export feature of OpenOffice Writer to create an XHtml File from an ODT containing an XForm.
What I noticed was that the XForm Model was not getting exported. I copied the default XSL file used and I changed the "xsl:stylesheet" node's "exclude-result-prefixes" to an empty string.
The output was the same. I searched the internet for more help and came across
https://issues.apache.org/ooo/show_bug.cgi?id=87731
The "xsl:template" tags provided here helped in exporting the most of the content.
1. The XForm instance, model, binding etc.
However the actual controls were still missing...
I believe the trick lies in "xsl:template" tags, but have no documentation to understand how the export feature uses them.
Any ideas on this?????
Decide which XHTML element maps to each missing XForms control, then create a xsl:template matching each control which includes the desired XHTML output. Here are a few examples of similar conversion stylesheets:
Atom2HTML
RDF2HTML(View Source)
XML2XHTML
XSLTForms is an excellent reference as well.

Can Nokogiri retain attribute quoting style?

Here is the contents of my file (note the nested quotes):
<?xml version="1.0" encoding="utf-8"?>
<property name="eventData" value='{"key":"value"}'/>
in Ruby I have:
file = File.read(settings.test_file)
#xml = Nokogiri::XML( file)
puts "#xml " + #xml.to_s
and here is the output:
<property name="eventData" value="{"key":"value"}"/>
Is there a way to convert it so the output would preserve the quotes exactly? i.e. single on the outside, double on the inside?
No, it cannot. There is no information stored in a Nokogiri::XML::Attr (nor the underlying data structure in libxml2) about what type of quotes were (or should be) used to delimit an attribute. As such, all serialization (done by libxml2) uses the same attribute quoting style.
Indeed, this information is not even properly retained within the XML Information Set, as described by the specs:
Appendix D: What is not in the Information Set
The following information is not represented in the current version of the XML Information Set (this list is not intended to be exhaustive):
[...]
17) The kind of quotation marks (single or double) used to quote attribute values.
The good news is that the two XML serialization styles describe the exact same content. The bad news is that unless you're using a Canonical XML Serialization (which Nokogiri is not yet able to produce just recently able to produce) there are a large variety of ways to represent the same document that would look like many spurious 'changes' to a standard text-diffing tool.
Perhaps if you can describe why you wanted this functionality (what is the end goal you are trying to accomplish?) we could help you further.
You might also be interested in this similar question.

How can I render XML character entity references in Ruby?

I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.

Ruby XMLParsing Exception

I get a ParseException every time I try to parse a http get_response data in Ruby. The Exception is because of the presence of '&' in the data. How do I solve this?
Illegal character '&' in raw string (REXML::ParseException)
Is the data you're passing to the parser XML? Do other parsers complain about it?
Check to make sure that the data that you're trying to parse is well-formed XML. If you are trying to pass it HTML or RSS from the web, then it almost certainly isn't well-formed XML (HTML is not XML, though XHTML might be, and while RSS is supposed to be XML, there are lots of bad RSS generators out there that general RSS that is not well formed or invalid).
If you need to parse HTML, try Hpricot. If you need to parse RSS, use the built-in RSS parser; there are some examples here.
If you're trying to parse HTML consider using Nokogiri.
Nokogiri::HTML("<html>...</html>")
You can also try Nokogiri::XML but I believe that requires valid markup.

Resources