How to turn a file into a Nokogiri::XML object? - ruby

I have a sample XML file (let's call it example.xml for the sake of this question) and want to turn it into a Nokogiri object.
According to documentation and lots of other online sources, this should work:
xml = Nokogiri::XML(File.read("example.txt"))
But the value of xml.to_xml is only:
"<?xml version=\"1.0\"?>\n"
In other words, it's ignoring the rest of the file. There are many tags afterwards and none of them are in the xml object.
How do I get Nokogiri to get all the tags?
Here's the XML I'm using:
<? xml version="1.0" encoding="UTF-8" ?>
<Document>
<Test>Test</Test>
</Document>

It looks like you are trying to parse an invalid XML doc.
This can be fixed by removing the spaces in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Test>Test</Test>
</Document>
How I figured this out
By default, when Nokogiri has errors parsing a document it populates an errors array.
xml = Nokogiri::XML(File.read("example.txt"))
p xml.errors
# => [#<Nokogiri::XML::SyntaxError: xmlParsePI : no target name>, #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>]
You can also configure Nokogiri to raise an exception of it has parsing errors:
xml = Nokogiri::XML(File.read("example.txt")) do |config|
config.strict
end
Both of these cases show that there were issues parsing the document

Related

Ruby gem Diffy not returning differences

I need to compare two xml files and display the differences in a html report. In order to do this, I installed the ruby gem Diffy (and the gems rspec and diff-lcs as directed by the Diffy documentation), but it does not seem to be working properly as differences are not being returned.
I have two xmls files I want to compare.
Xml file one:
<?xml version="1.0" encoding="UTF-8"?>
<SourceDetails>
<Origin>Origin</Origin>
<Identifier>Identifier</Identifier>
<Timestamp>2001-12-31T12:00:00</Timestamp>
</SourceDetails>
<AsOfDate>2001-01-01</AsOfDate>
<Instrument>
<ASXExchangeSecurityIdentifier>ASX</ASXExchangeSecurityIdentifier>
</Instrument>
<Rate>0.0</Rate>
Xml file two:
<?xml version="1.0" encoding="UTF-8"?>
<SourceDetails>
<Origin>FEED</Origin>
<Identifier>IR</Identifier>
<Timestamp>2017-01-01T02:11:01Z</Timestamp>
</SourceDetails>
<AsOfDate>2017-01-02</AsOfDate>
<Instrument>
<CommonCode>GB0</CommonCode>
</Instrument>
<Rate>0.69</Rate>
When I supply the two xml files as arguments to the diffy function:
puts Diffy::Diff.new('xmldoc1', 'xmldoc2', :source => 'files').to_s(:html)
no differences are returned. When I store the two xml files in String variables and supply these variables as arguments to the Diffy function:
puts Diffy::Diff.new(doc1, doc2, :include_plus_and_minus_in_html => true).to_s(:html)
again no differences are returned. To figure out if my xmls were causing the problem, I also tried supplying two different strings to the Diffy function:
puts Diffy::Diff.new("Hello how are you", "Hello how are you\nI'm fine\nThat's great\n")
but this also returned nothing when there are clear differences.
Does anyone know what the problem may be?

Cant Read xml file using sax Parser with Nokogiri

I am using ruby 1.9.3 with rails 3.1. My requirement is that there is a parser file like below. when i opened with browser; Tags are not aligned in order. After the <item>; the data are clubbed format. There is a presence of
<?xml version="1.0" encoding="utf-8"?>
when I opened in sublime text; it shows after the <item>
<![CDATA[<?xml version="1.0" encoding="utf-8"?>
also after the </item> there is ]]> present. The data needs to be parsed are inside this <item></item>. the method called parse_file form Nokogiri called only start_element, end_element. When we tried manually by editing the file via removing the above statements; then it will call the characters method to fetch the data. Below is the example code.is there any other way?.
<batch transactionType="HC"><item><?xml version="1.0" encoding="utf-8"?><C><CI><Ve>00501</Ve></CI></C></item></batch>
You can do it easily using "xml-simple". Assuming your XML file name is "test.xml", first install the gem:
gem install xml-simple
Then, you can try:
require "XmlSimple"
abc = XmlSimple.xml_in File.read("test.xml")
puts abc['item']
The output should be:
{"C"=>[{"CI"=>[{"Ve"=>["00501"]}]}]}

Why can't I get a result from an XPath with namespace in the root element? [duplicate]

This question already has answers here:
Nokogiri/Xpath namespace query
(3 answers)
Closed 8 years ago.
This is probably an XML namespace newbie question but I can't figure out how to get an XPath to work with the following trunctated XML with this particular root element:
<?xml version="1.0" encoding="UTF-8"?>
<CreateOrUpdateEventsRequest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dhamma.org" version="3-0-0">
<LanguageKey>
<IsoCode>en</IsoCode>
</LanguageKey>
<Publish>
<Value>true</Value>
</Publish>
<Events>
<Event>
<EventKey>
<LocationKey>
<SubDomain>rasmi</SubDomain>
</LocationKey>
<EventId>10DayPDFStdTag</EventId>
</EventKey>
</Event>
</Events>
</LanguageKey>
</CreateOrUpdateEventsRequest>
Using Ruby and Nokogiri (with a just updated libxml2), it works fine with XPath only if I delete all the extra info in the root element, making it:
<CreateOrUpdateEventsRequest>
Otherwise nothing works:
$> #doc.xpath("//CreateOrUpdateEventsRequest") #=> [] with original header, an array of nodes with modified header
$> #doc.xpath("//LanguageKey") #=> [] with the original header, an array of nodes with modified header
$> #doc.xpath("//xmlns:LanguageKey") #=> undefined namespace prefix with the original
How do I address namespaces like this with XPath?
Many thanks for the help.
The answer seems to be that the XML re-declared XMLNS when it should have declared the namespace with a prefix as in xmlns:myns.
From www.w3.org:
The XML specification reserves all names beginning with the letters 'x', 'm', 'l' in any combination of upper- and lower-case for use by the W3C. To date three such names have been given definitions—although these names are not in the XML namespace, they are listed here as a convenience to readers and users:
xml: See http://www.w3.org/TR/xml/#NT-XMLDecl and http://www.w3.org/TR/xml-names/#xmlReserved
xmlns: See http://www.w3.org/TR/xml-names/#ns-decl
xml-stylesheet: See The xml-stylesheet processing instruction
I don't use Nokogiri nor Ruby,
but you need to register a prefix for namespace http://dhamma.org
When I read http://nokogiri.org/tutorials/searching_a_xml_html_document.html
I understand you must do something like
$> #doc.xpath('//dha:LanguageKey', 'dha' => 'http://dhamma.org')
Here's some code to consider. Starting with code to create a Nokogiri::XML::Document:
require 'nokogiri'
XML = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<CreateOrUpdateEventsRequest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dhamma.org" version="3-0-0">
<LanguageKey>
<IsoCode>en</IsoCode>
</LanguageKey>
<Publish>
<Value>true</Value>
</Publish>
<Events>
<Event>
<EventKey>
<LocationKey>
<SubDomain>rasmi</SubDomain>
</LocationKey>
<EventId>10DayPDFStdTag</EventId>
</EventKey>
</Event>
</Events>
</LanguageKey>
</CreateOrUpdateEventsRequest>
EOT
doc = Nokogiri::XML(XML)
Here's the root node's name:
doc.root.name # => "CreateOrUpdateEventsRequest"
The docs say:
When using CSS, if the namespace is called “xmlns”, you can even omit the namespace name.
doc.at('CreateOrUpdateEventsRequest').name # => "CreateOrUpdateEventsRequest"
doc.at('LanguageKey').to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
Using XPath, we can specify the default namespace as:
doc.at('//xmlns:LanguageKey').to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
Sometimes, if there are a lot of namespaces it makes sense to use collect_namespaces and pass them in:
name_spaces = doc.collect_namespaces # =>
doc.at('//xmlns:LanguageKey', name_spaces).to_xml # => "<LanguageKey>\n <IsoCode>en</IsoCode>\n </LanguageKey>"
You'll need to look through the documentation for Nokogiri::XML::Node for more information on the various methods.
I recommend using CSS selectors for simplicity and readability over XPath, as a first try. I think XPath has more functionality but it makes my eyes bug out sometimes, so I prefer CSS.

Nokogiri/Xpath namespace query

I'm trying to pull out the dc:title element using an xpath. I can pull out the metadata using the following code.
doc = <<END
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0">
<metadata xmlns:dc="URI">
<dc:title>title text</dc:title>
</metadata>
</package>
END
doc = Nokogiri::XML(doc)
# Awesome this works!
puts '//xmlns:metadata'
puts doc.xpath('//xmlns:metadata')
# => <metadata xmlns:dc="URI"><dc:title>title text</dc:title></metadata>
As you can see the above appears to work correctly. However I don't seem to be able to get the title information from this node tree, all of the below fail.
puts doc.xpath('//xmlns:metadata/title')
# => nil
puts doc.xpath('//xmlns:metadata/dc:title')
# => ERROR: `evaluate': Undefined namespace prefix
puts doc.xpath('//xmlns:dc:title')
# => ERROR: 'evaluate': Invalid expression: //xmlns:dc:title
Could someone please explain how namespaces should be used in an xpath with the above xml doc.
All namespaces need to be registered when parsing. Nokogiri automatically registers namespaces on the root node. Any namespaces that are not on the root node you have to register yourself. This should work:
puts doc.xpath('//dc:title', 'dc' => "URI")
Alternately, you can remove namespaces altogether. Only do this if you are certain there will be no conflicting node names.
doc.remove_namespaces!
puts doc.xpath('//title')
With properly registered prefix opf for 'http://www.idpf.org/2007/opf' namespace URI, and dc for 'URI', you need:
/*/opf:metadata/dc:title
Note: xmlns and xml are reserved prefixes that can't be bound to any other namespace URI than the built-in 'http://www.w3.org/2000/xmlns/' and 'http://www.w3.org/XML/1998/namespace'.
As an alternative to explicitly constructing a hash of namespace URIs, you can retrieve the namespace definitions from the xml element where they're defined.
Using your example:
# First grab the metadata node, because that's where "dc" is defined.
metadata = doc.at_xpath('//xmlns:metadata')
# Pass metadata's namespaces as the resolver.
metadata.at_xpath('dc:title', metadata.namespaces)
Note that the second xpath could've also been:
doc.at_xpath('//dc:title', metadata.namespaces).to_s
But why search from the root when you have a nearer ancestor? Also, you should consider the namespace-defining element plus its children as the "scope" of the namespace. Searching a limited scope is less confusing and avoids subtle bugs.

Associate an XML-Stylesheet with an XML Document with Nokogiri

Is it possible to associate a stylesheet with with Nokogiri, to create this structure?
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="http://www.my-site.com/sitemap.xsl"?>
<root>
...
</root>
OMG, there is so much fail here that I am breaking the unofficial policy of Team Nokogiri and am providing the correct, sane answer to this question:
require "nokogiri"
doc = Nokogiri::XML "<root>foo</root>"
doc.root.add_previous_sibling Nokogiri::XML::ProcessingInstruction.new(doc, "xml-stylesheet", 'type="text/xsl" href="foo.xsl"')
puts doc.to_xml
# => <?xml version="1.0"?>
# <?xml-stylesheet type="text/xsl" href="foo.xsl"?>
# <root>foo</root>
In the future, please ask questions about Nokogiri on the nokogiri-talk mailing list (http://groups.google.com/group/nokogiri-talk), get the correct answer in a timely fashion, and save everyone a little effort.
There is not.
The way I did it:
xml.gsub!("<?xml version=\"1.0\"?>") do |head|
result = head
result << "\n"
result << "<?xml-stylesheet type=\"text/xsl\" href=\"#{stylesheet}\"?>"
end
Cheers.

Resources