Ignore namespaces on xmldocument in nokogiri - ruby

Im trying to learn to parse xml with nokogiri.
I dont have control of how the xml file is generated and it seems the namespaces are causing issues because they are not defined.
Im using the following test code to try to get this to work.
require 'nokogiri'
def getxml
xml_str = <<EOF
<root>
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name1</PART1:Name>
</THING1:things>
<THING2:things type="Container">
<PART2:Id type="Property">2234</PART2:Id>
<PART2:Name type="Property">The Name2</PART2:Name>
</THING2:things>
</root>
EOF
doc = Nokogiri::XML(xml_str)
puts(doc.errors())
doc.xpath('//Id').each do |thing|
puts(thing.inspect)
#puts "ID = " + thing.at_xpath('Id').content
#puts "Name = " + thing.at_xpath('Name').content
end
end
getxml()
I'm getting the following errors:
2:38: ERROR: Namespace prefix THING1 on things is not defined
3:34: ERROR: Namespace prefix PART1 on Id is not defined
4:36: ERROR: Namespace prefix PART1 on Name is not defined
6:38: ERROR: Namespace prefix THING2 on things is not defined
7:34: ERROR: Namespace prefix PART2 on Id is not defined
8:36: ERROR: Namespace prefix PART2 on Name is not defined
How am I suppose to deal with namespaces not defined. Is there a way to ignore namespaces.

Nokogiri does have the remove_namespaces! method, but it wont help in your case as your XML isn’t actually using namespaces.
As there are no namespace declarations, your XML elements are just treated as non-namespaced elements that contain a : character in their name. This makes it difficult to use with XPath as XPath assumes a : indicates a namespace.
One way to get round this is to use the local-name() function to select elements. For example to select all elements named PART1:Id you could use this:
doc.xpath('//*[local-name()="PART1:Id"]')
If you want to select all elements where the final part is Id, regardless of what the prefix is, such as PART1:Id and PART2:Id, you could combine local-name() with substring-after():
doc.xpath('//*[substring-after(local-name(), ":")="Id"]')

Related

XQuery/Xpath referring to xml elements with no namespace, in a namespace environment

In Xquery 3.1 (under eXist-DB 4.7) I receive xml data like this, with no namespace:
<edit-request id="TC9999">
<title-collection>foocolltitle</title-collection>
<title-exempla>fooextitle</title-exempla>
<title-short>fooshorttitle</title-short>
</edit-request>
This is assigned to a variable $content and this statement:
let $collid := $content/edit-request/#id
...correctly returns: TC9999
Now, I need to actually transform all the data in $content into a TEI xml document.
I first need to get some info from an existing TEI file, so I assigned another variable:
let $oldcontent := doc(concat($globalvar:URIdata,$collid,"/",$collid,".xml"))
And then I create the new TEI document, referring to both $content and $oldcontent:
let $xml := <listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="{$collid}">
<bibl>
<idno type="old_sql_id">{$oldcontent//tei:idno[#type="old_sql_id"]/text()}</idno>
<title type="collection">{$content//title-exempla/text()}</title>
</bibl>
</listBibl>
The references to the TEI namespace in $oldcontent come through, but to my surprise the references to $content (no namespace) don't show up:
<listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="TC9999">
<bibl>
<idno type="old_sql_id">1</idno>
<title type="collection"/>
</bibl>
</listBibl>
The question is: how do I refer to the non-namespace elements in $content in the context of let $xml=...?
Nb: the Xquery document has a declaration at the top (as it is the principle namespace of virtually all the documents):
declare namespace tei = "http://www.tei-c.org/ns/1.0";
In essence you are asking how to write an XPath expression to select nodes in an empty namespace in a context where the default element namespace is non-empty. One of the most direct solutions is to use the "URI plus local-name syntax" for writing QNames. Here is an example:
xquery version "3.1";
let $x := <x><y>Jbrehr</y></x>
return
<p xmlns="foo">Hey there,
{ $x/Q{}y => string() }!</p>
If instead of $x/Q{}y the example had used the more common form of the path expression, $x/y, its result would have been an empty sequence, since the local name y used to select the <y> element specifies no namespace and thus inherits the foo element namespace from its context. By using the "URI plus local-name syntax", though, we are able to specify the empty namespace we are looking for.
For more information on this, see the XPath 3.1 specification's discussion of expanded QNames: https://www.w3.org/TR/xpath-31/#doc-xpath31-EQName.

Avoiding Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix

I get the error "Nokogiri::XML::XPath::SyntaxError: ERROR: Undefined namespace prefix" when I do this:
doc.search('//text()[not(ancestor::w:delText]')
Based on this answer: How do I use xpath on nodes with a prefix but without a namespace?
*[name()="w:delText"]
can sort of solve the problem. But how do I do something similar like this to avoid the namespace error:
doc.search('//text()[not(ancestor::*[name()="w:delText"]')
I ended up solving the problem by editing the XML file and adding the namespaces in the root. Here is an example:
temp = Nokogiri::XML(#document_xml)
temp.root['xmlns:w'] = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
#doc = Nokogiri::XML(temp.to_xml(:save_with => Nokogiri::XML::Node::SaveOptions::AS_XML))

I can't extract the node text with a Xpath

I have a XML file (test.xml) like this one:
<?xml version="1.0" encoding="ISO-8859-1"?>
<s2xResponse>
<s2xData>
<Name>This is the name</Name>
<InfocomData>
<DateOfUpdate day="07" month="02" year="2018">20180207</DateOfUpdate>
<CompanyName>MY COMPANY</CompanyName>
<TaxCode FlagCheck="0">XXXYYYWWWZZZ</TaxCode>
</InfocomData>
<AssessmentSummary>
<Rating Code="2">Rating Description for Code 2</Rating>
</AssessmentSummary>
<AssessmentData>
<SectorialDistribution>
<CompaniesNumber>11650</CompaniesNumber>
<ScoreDistribution />
<CervedScoreDistribution>
<DistributionData>
<Rating Code="1">SICUREZZA</Rating>
<Percentage>1.91</Percentage>
</DistributionData>
<DistributionData>
<Rating Code="2">SOLVIBILITA' ELEVATA</Rating>
<Percentage>35.56</Percentage>
</DistributionData>
</CervedScoreDistribution>
</SectorialDistribution>
</AssessmentData>
</s2xData>
</s2xResponse>
I'm trying to get the "Name" node text ("This is the name") with a U-SQL script using the XmlExtractor. The following is the code I'm using:
USE TestXML; // It contains the registered assembly
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
#xml = EXTRACT xml_text string
FROM "textxpath/test.xml"
USING Extractors.Text(rowDelimiter: "^", quoting: false);
#xml_cleaned =
SELECT
xml_text.Replace("\r\n", "").Replace("\t", " ") AS xml_text
FROM #xml;
#values =
SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(xml_text, "s2xResponse/s2xData/Name")[1] AS value
FROM #xml_cleaned;
OUTPUT #values TO #"outputs/test_xpath.txt" USING Outputters.Text(quoting: false);
But I'm getting this runtime error:
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887116,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error
while evaluating expression
Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(xml_text.Replace(\"\r\n\",
\"\").Replace(\"\t\", \" \"),
\"s2xResponse/s2xData/Name\")[1]","description":"Inner exception from
user expression: Index was out of range. Must be non-negative and less
than the size of the collection.
I get the same error even if I use a zero index for the Evaluate result ([0]).
What's wrong with my query?
The problem here is that you are applying the subscript [1] to the result of XPath.Evaluate, which I believe will be returning the Name nodes. However, you are applying the [1] subscript in code, not in XPath, so the subscript is likely to be zero based, and not 1-based as it is in XPath, hence the Index out of range error.
Here's one solution - simply apply the subscript operator in Xpath (where it is still 1-based), and select the text() there
.Evaluate("s2xResponse/s2xData/Name[1]/text()")
Is there a particular reason you want to use the Evaluate method? I got his to work using the XmlDomExtractor, which would allow you to extract multiple values from the xml, eg
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #inputFile string = "/input/input100.xml";
#input =
EXTRACT Name string
FROM #inputFile
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath : "/s2xResponse",
columnPaths : new SQL.MAP<string, string>{
{ "s2xData/Name", "Name" },
}
);
#output =
SELECT *
FROM #input;

How to add a namespace to existing xml file

I want to open this file and get all elements that start with us-gaap.
ftp://ftp.sec.gov/edgar/data/916789/0001558370-15-001143.txt
To get elements I tried like this:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc = Nokogiri::XML(File.read(str))
doc.xpath('//us-gaap:*')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //us-gaap:*
from /Users/ironsand/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:165:in `evaluate'
doc.namespaces returns {}, so I think I have to add namespace us-gaap.
There are some questions about "adding namespace with Nokogiri", but it looks like about how to create a new XML document, not how to add a namespace to existing documents.
How can I add a namespace to existing document?
I know I can remove the namespace by Nokogiri::XML::Document#remove_namespaces!, but I don't want to use it because it removes also necesarry information.
You have asked an XY Problem. You think that the problem is that you need to add a missing namespace; the real problem is that the file you're trying to parse is not valid XML.
require 'nokogiri'
doc = Nokogiri.XML( IO.read('0001558370-15-001143.txt') )
doc.errors.length
#=> 5716
For example, the <ACCEPTANCE-DATETIME> 'element' opened on line 3 is never closed, and on line 16 there is a raw ampersand in the text:
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC HOUSEWARES & FANS [3634]
which ought to be escaped as an entity.
However, the document has valid XML fragments within it! In particular, there is one XML document that defines xmlns:us-gaap namespace, from lines 27243-49312. Let's extract just that, using only the knowledge that the root element defines the namespace we want, and the assumptions that no element with the same name is nested within the document, and that the root element does not have an unescaped > character in any attribute. (These assumptions are valid for this file, but may not be valid for every XML file.)
txt = IO.read('0001558370-15-001143.txt')
gaap_finder = %r{(<(\w+) [^>]+xmlns:us-gaap=.+?</\2>)}m
txt.scan(gaap_finder) do |xml,_|
doc = Nokogiri.XML( xml )
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
#=> 569
end
The code above handles the case where there may be more than one XML document in the txt file, though in this case there is only one.
Decoded, the gaap_finder regex says this:
%r{...}m — this is a regular expression (that allows slashes in it, unescaped) with "multiline mode", where a period will match newline characters
(...) — capture everything we find
< — start with a literal "less-than" symbol
(\w+) — find one or more word characters (the tag name), and save them
— the word characters must be followed by a space (important to avoid capturing the <xsd:xbrl ...> element in this file)
[^>]+ — followed by one or more characters that is NOT a "greater-than" symbol (to ensure that we stay in the same element that we started in)
xmlns:us-gaap\s*= — followed by this literal namespace declaration (which may have whitespace separating it from the equals sign)
.+? — followed by anything (as little as possible)...
</\2> — ...up until you see a closing tag with the same name as what we captured for the name of the starting tag
Because of the way scan works when the regex has capturing groups, each result is a two-element array, where the first element is the entire captured XML and the second element is the name of the tag that we captured (which we "discard" by assigning it to the _ variable).
If you want to be less magic about your capturing, the text file format appears to always wrap each XML document in <XBRL>...</XBRL>. So, you could do this to process every XML file (there are seven, five of which do not happen to have any us-gaap namespaces):
txt = IO.read('0001558370-15-001143.txt')
xbrls = %r{(?<=<XBRL>).+?(?=</XBRL>)}m # find text inside <XBRL>…</XBRL>
txt.scan(xbrls) do |xml|
doc = Nokogiri.XML( xml )
if doc.namespaces["xmlns:us-gaap"]
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
end
end
#=> 569
#=> 0 (for the XML Schema document that defines the namespace)
I couldn't figure out how to update an existing doc with a new namespace, but since Nokogiri will recognize namespaces on the root element, and those namespaces are, syntactically, just attributes, you can update the document with a new namespace declaration, serialize the doc to a string, and re-parse it:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc_without_ns = Nokogiri::XML(str)
doc_without_ns.root['xmlns:us-gaap'] = 'http://your/actual/ns/here'
doc = Nokogiri::XML(doc_without_ns.to_xml)
doc.xpath("//us-gaap:*")
# Returns [#<Nokogiri::XML::Element:0x3ff375583f9c name="foo" namespace=#<Nokogiri::XML::Namespace:0x3ff375583f24 prefix="us-gaap" href="http://your/actual/ns/here"> children=[#<Nokogiri::XML::Text:0x3ff375583768 "foo">]>]

Parse namespaced xml with ruby nokogiri

I have as second of xml
<Environment
Name="test"
xmlns="http://schemas.dmtf.org/ovf/environment/1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:oe="http://schemas.dmtf.org/ovf/environment/1"
oe:id="123456789">
<PropertySection>
<Property oe:key="mykey" oe:value="test"/>
</PropertySection>
</Environment>
I'm using ruby and nokogiri to parse the document. i.e.
file = File.open('/tmp/myxml.xml')
doc = Nokogiri::XML(file)
env = doc.at('Environment')
id = env['id']
printf("ID [%s]\n", id)
properties = env.at('PropertySection')
This works and successfully prints the id from the xml.
I now want to access the Property attribute with the key 'mykey'. I tried the following:
value = properties.at('Property[#key="mykey"]')['value']
printf("Value %s\n", value)
Unfortunately the properties.at method returns a nil object. I tried modifying the xml itself to remove the 'oe' namespace from the attribute 'key'. Re-running my script it works.
How can I get nokogiri to recognise the namespace when calling .at() ?
You should use the Nokogiri namespace syntax: http://nokogiri.org/tutorials/searching_a_xml_html_document.html#namespaces.
First, make sure you have namespaces you can use:
ns = {
'xmlns' => 'http://schemas.dmtf.org/ovf/environment/1',
'oe' => 'http://schemas.dmtf.org/ovf/environment/1'
}
(I'm defining both even though they are the same in this example). You might also look into using the namespaces already available in doc.collect_namespaces.
Then you can just do:
value = properties.at('./xmlns:Property[#oe:key="mykey"]/#oe:value', ns).content
Note that I am using ./ here because, for this specific search, Nokogiri interprets the XPath as CSS without it. You may wish to use .//.

Resources