I need to compare two xml files and display the differences in a html report. In order to do this, I installed the ruby gem Diffy (and the gems rspec and diff-lcs as directed by the Diffy documentation), but it does not seem to be working properly as differences are not being returned.
I have two xmls files I want to compare.
Xml file one:
<?xml version="1.0" encoding="UTF-8"?>
<SourceDetails>
<Origin>Origin</Origin>
<Identifier>Identifier</Identifier>
<Timestamp>2001-12-31T12:00:00</Timestamp>
</SourceDetails>
<AsOfDate>2001-01-01</AsOfDate>
<Instrument>
<ASXExchangeSecurityIdentifier>ASX</ASXExchangeSecurityIdentifier>
</Instrument>
<Rate>0.0</Rate>
Xml file two:
<?xml version="1.0" encoding="UTF-8"?>
<SourceDetails>
<Origin>FEED</Origin>
<Identifier>IR</Identifier>
<Timestamp>2017-01-01T02:11:01Z</Timestamp>
</SourceDetails>
<AsOfDate>2017-01-02</AsOfDate>
<Instrument>
<CommonCode>GB0</CommonCode>
</Instrument>
<Rate>0.69</Rate>
When I supply the two xml files as arguments to the diffy function:
puts Diffy::Diff.new('xmldoc1', 'xmldoc2', :source => 'files').to_s(:html)
no differences are returned. When I store the two xml files in String variables and supply these variables as arguments to the Diffy function:
puts Diffy::Diff.new(doc1, doc2, :include_plus_and_minus_in_html => true).to_s(:html)
again no differences are returned. To figure out if my xmls were causing the problem, I also tried supplying two different strings to the Diffy function:
puts Diffy::Diff.new("Hello how are you", "Hello how are you\nI'm fine\nThat's great\n")
but this also returned nothing when there are clear differences.
Does anyone know what the problem may be?
Related
In an .xsl file I want to use nodes from a separate file ("foo.xsd"). The .xsl file uses an explicit namespace prefix, the external file doesn't but rather relies on a default namespace. Their namespace URIs match up.
Reading in the nodes, with ant.xslt the following XPath expression results in an ArrayIndexOutOfBounds exception later
document('foo.xsd')/xsd:schema/xsd:*
while it works when removing the last reference to the namespace prefix
document('foo.xsd')/xsd:schema/*
Here is a minimal example that reproduces the issue. The transformation input file
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
<element name="bar"/>
</schema>
and a transformation .xsl file
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
version="1.0">
<xsl:variable name="nodeSet" select="document('foo.xsd')/xsd:schema/xsd:*[#name]"/>
<xsl:template match="/xsd:schema/xsd:element">
<xsl:value-of select="$nodeSet/#name"/>
</xsl:template>
</xsl:stylesheet>
The referenced foo.xsd is just a copy of the input file, so in this cut down example I'm running over one instance of the file and reading in the other instance in the stylesheet.
Goold ol' xsltproc is extracting the right attribute value ("bar"). ant.xslt with the default processor (Xalan) throws an ArrayIndexOutOfBoundsException (I presume when looking for the colon insided the element name).The problem only arises when referencing nodeSet as in the <xsl:value-of> element.
The <xsl:template> matches in all cases, using prefixes.
My question is: Did I hit a bug in Xalan, or am I doing something generally wrong?
I'm aware of the various work-arounds concerning namespace prefixes, like using [local-name() = 'element'] and such, so please don't post answers in that vein. I'm looking for a general answer whether this should work (like, according to the specs).
Background Material
Stacktrace (part.) that hints at Xalan:
...
Caused by: javax.xml.transform.TransformerException: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 512
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:783)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:370)
at org.apache.tools.ant.taskdefs.optional.TraXLiaison.transform(TraXLiaison.java:201)
at org.apache.tools.ant.taskdefs.XSLTProcess.process(XSLTProcess.java:870)
... 126 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 512
at java.xml/com.sun.org.apache.xml.internal.utils.SuballocatedIntVector.elementAt(SuballocatedIntVector.java:441)
at java.xml/com.sun.org.apache.xml.internal.dtm.ref.DTMDefaultBase._firstch(DTMDefaultBase.java:523)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl.access$200(SAXImpl.java:73)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.SAXImpl$NamespaceChildrenIterator.next(SAXImpl.java:1431)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.CurrentNodeListIterator.setStartNode(CurrentNodeListIterator.java:158)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.StepIterator.setStartNode(StepIterator.java:97)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.StepIterator.setStartNode(StepIterator.java:97)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.DupFilterIterator.setStartNode(DupFilterIterator.java:97)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.dom.CachedNodeListIterator.setStartNode(CachedNodeListIterator.java:57)
at jdk.translet/die.verwandlung.test.topLevel()
at jdk.translet/die.verwandlung.test.transform()
at java.xml/com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet.transform(AbstractTranslet.java:624)
at java.xml/com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:776)
... 129 more
Invokation is through Gradle via shell command gradlew mytask using the Gradle built-in Ant using its built-in ant.xslt task.
build.gradle:
tasks.register('mytask') {
doLast {
ant.xslt(
baseDir: '.',
in: 'input.xsd',
out: 'out.xml',
style: 'stylefile.xsl'
)
}
}
Your xslt/xml code is fine (works with Saxon). So it's either something in the way you're running the transformation, or it's a bug in the version of Xalan that you're using.
Xalan shouldn't be throwing an ArrayIndexOutOfBounds exception anyway. It's presumably Xalan code on the stack trace?
I have a sample XML file (let's call it example.xml for the sake of this question) and want to turn it into a Nokogiri object.
According to documentation and lots of other online sources, this should work:
xml = Nokogiri::XML(File.read("example.txt"))
But the value of xml.to_xml is only:
"<?xml version=\"1.0\"?>\n"
In other words, it's ignoring the rest of the file. There are many tags afterwards and none of them are in the xml object.
How do I get Nokogiri to get all the tags?
Here's the XML I'm using:
<? xml version="1.0" encoding="UTF-8" ?>
<Document>
<Test>Test</Test>
</Document>
It looks like you are trying to parse an invalid XML doc.
This can be fixed by removing the spaces in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Test>Test</Test>
</Document>
How I figured this out
By default, when Nokogiri has errors parsing a document it populates an errors array.
xml = Nokogiri::XML(File.read("example.txt"))
p xml.errors
# => [#<Nokogiri::XML::SyntaxError: xmlParsePI : no target name>, #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>]
You can also configure Nokogiri to raise an exception of it has parsing errors:
xml = Nokogiri::XML(File.read("example.txt")) do |config|
config.strict
end
Both of these cases show that there were issues parsing the document
I am using ruby 1.9.3 with rails 3.1. My requirement is that there is a parser file like below. when i opened with browser; Tags are not aligned in order. After the <item>; the data are clubbed format. There is a presence of
<?xml version="1.0" encoding="utf-8"?>
when I opened in sublime text; it shows after the <item>
<![CDATA[<?xml version="1.0" encoding="utf-8"?>
also after the </item> there is ]]> present. The data needs to be parsed are inside this <item></item>. the method called parse_file form Nokogiri called only start_element, end_element. When we tried manually by editing the file via removing the above statements; then it will call the characters method to fetch the data. Below is the example code.is there any other way?.
<batch transactionType="HC"><item><?xml version="1.0" encoding="utf-8"?><C><CI><Ve>00501</Ve></CI></C></item></batch>
You can do it easily using "xml-simple". Assuming your XML file name is "test.xml", first install the gem:
gem install xml-simple
Then, you can try:
require "XmlSimple"
abc = XmlSimple.xml_in File.read("test.xml")
puts abc['item']
The output should be:
{"C"=>[{"CI"=>[{"Ve"=>["00501"]}]}]}
This question already has answers here:
Nokogiri/Xpath namespace query
(3 answers)
Closed 8 years ago.
This is probably an XML namespace newbie question but I can't figure out how to get an XPath to work with the following trunctated XML with this particular root element:
<?xml version="1.0" encoding="UTF-8"?>
<CreateOrUpdateEventsRequest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dhamma.org" version="3-0-0">
<LanguageKey>
<IsoCode>en</IsoCode>
</LanguageKey>
<Publish>
<Value>true</Value>
</Publish>
<Events>
<Event>
<EventKey>
<LocationKey>
<SubDomain>rasmi</SubDomain>
</LocationKey>
<EventId>10DayPDFStdTag</EventId>
</EventKey>
</Event>
</Events>
</LanguageKey>
</CreateOrUpdateEventsRequest>
Using Ruby and Nokogiri (with a just updated libxml2), it works fine with XPath only if I delete all the extra info in the root element, making it:
<CreateOrUpdateEventsRequest>
Otherwise nothing works:
$> #doc.xpath("//CreateOrUpdateEventsRequest") #=> [] with original header, an array of nodes with modified header
$> #doc.xpath("//LanguageKey") #=> [] with the original header, an array of nodes with modified header
$> #doc.xpath("//xmlns:LanguageKey") #=> undefined namespace prefix with the original
How do I address namespaces like this with XPath?
Many thanks for the help.
The answer seems to be that the XML re-declared XMLNS when it should have declared the namespace with a prefix as in xmlns:myns.
From www.w3.org:
The XML specification reserves all names beginning with the letters 'x', 'm', 'l' in any combination of upper- and lower-case for use by the W3C. To date three such names have been given definitions—although these names are not in the XML namespace, they are listed here as a convenience to readers and users:
xml: See http://www.w3.org/TR/xml/#NT-XMLDecl and http://www.w3.org/TR/xml-names/#xmlReserved
xmlns: See http://www.w3.org/TR/xml-names/#ns-decl
xml-stylesheet: See The xml-stylesheet processing instruction
I don't use Nokogiri nor Ruby,
but you need to register a prefix for namespace http://dhamma.org
When I read http://nokogiri.org/tutorials/searching_a_xml_html_document.html
I understand you must do something like
$> #doc.xpath('//dha:LanguageKey', 'dha' => 'http://dhamma.org')
Here's some code to consider. Starting with code to create a Nokogiri::XML::Document:
require 'nokogiri'
XML = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<CreateOrUpdateEventsRequest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dhamma.org" version="3-0-0">
<LanguageKey>
<IsoCode>en</IsoCode>
</LanguageKey>
<Publish>
<Value>true</Value>
</Publish>
<Events>
<Event>
<EventKey>
<LocationKey>
<SubDomain>rasmi</SubDomain>
</LocationKey>
<EventId>10DayPDFStdTag</EventId>
</EventKey>
</Event>
</Events>
</LanguageKey>
</CreateOrUpdateEventsRequest>
EOT
doc = Nokogiri::XML(XML)
Here's the root node's name:
doc.root.name # => "CreateOrUpdateEventsRequest"
The docs say:
When using CSS, if the namespace is called “xmlns”, you can even omit the namespace name.
doc.at('CreateOrUpdateEventsRequest').name # => "CreateOrUpdateEventsRequest"
doc.at('LanguageKey').to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
Using XPath, we can specify the default namespace as:
doc.at('//xmlns:LanguageKey').to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
Sometimes, if there are a lot of namespaces it makes sense to use collect_namespaces and pass them in:
name_spaces = doc.collect_namespaces # =>
doc.at('//xmlns:LanguageKey', name_spaces).to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
You'll need to look through the documentation for Nokogiri::XML::Node for more information on the various methods.
I recommend using CSS selectors for simplicity and readability over XPath, as a first try. I think XPath has more functionality but it makes my eyes bug out sometimes, so I prefer CSS.
I'm trying to pull out the dc:title element using an xpath. I can pull out the metadata using the following code.
doc = <<END
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0">
<metadata xmlns:dc="URI">
<dc:title>title text</dc:title>
</metadata>
</package>
END
doc = Nokogiri::XML(doc)
# Awesome this works!
puts '//xmlns:metadata'
puts doc.xpath('//xmlns:metadata')
# => <metadata xmlns:dc="URI"><dc:title>title text</dc:title></metadata>
As you can see the above appears to work correctly. However I don't seem to be able to get the title information from this node tree, all of the below fail.
puts doc.xpath('//xmlns:metadata/title')
# => nil
puts doc.xpath('//xmlns:metadata/dc:title')
# => ERROR: `evaluate': Undefined namespace prefix
puts doc.xpath('//xmlns:dc:title')
# => ERROR: 'evaluate': Invalid expression: //xmlns:dc:title
Could someone please explain how namespaces should be used in an xpath with the above xml doc.
All namespaces need to be registered when parsing. Nokogiri automatically registers namespaces on the root node. Any namespaces that are not on the root node you have to register yourself. This should work:
puts doc.xpath('//dc:title', 'dc' => "URI")
Alternately, you can remove namespaces altogether. Only do this if you are certain there will be no conflicting node names.
doc.remove_namespaces!
puts doc.xpath('//title')
With properly registered prefix opf for 'http://www.idpf.org/2007/opf' namespace URI, and dc for 'URI', you need:
/*/opf:metadata/dc:title
Note: xmlns and xml are reserved prefixes that can't be bound to any other namespace URI than the built-in 'http://www.w3.org/2000/xmlns/' and 'http://www.w3.org/XML/1998/namespace'.
As an alternative to explicitly constructing a hash of namespace URIs, you can retrieve the namespace definitions from the xml element where they're defined.
Using your example:
# First grab the metadata node, because that's where "dc" is defined.
metadata = doc.at_xpath('//xmlns:metadata')
# Pass metadata's namespaces as the resolver.
metadata.at_xpath('dc:title', metadata.namespaces)
Note that the second xpath could've also been:
doc.at_xpath('//dc:title', metadata.namespaces).to_s
But why search from the root when you have a nearer ancestor? Also, you should consider the namespace-defining element plus its children as the "scope" of the namespace. Searching a limited scope is less confusing and avoids subtle bugs.