Xquery the function parse-xml() produces an error on &? - xpath

As XML content in an HTTP POST request, I receive the following which I process in Xquery 3.1 (eXist-db 5.2):
<request id="foo">
<p>The is a description with a line break<br/>and another linebreak<br/>and
here is an ampersand&.</p>
<request>
My objective is to take the node <p> and insert it into a TEI file in eXist-db. If I just insert the fragment as-is, no errors are thrown.
However I need to transform any instances of string <br/> into element <lb/> before adding it to the TEI document. I try that with fn:parse-xml.
Applying the following, however, throws an error on &amp...which surprises me:
let $xml := <request id="foo">
<p>The is a description with a line break<br/>and
another linebreak<br/>and here is an ampersand&.</p>
<request>
let $newxml := <p>{replace($xml//p/text(),"<br/>","<lb/>")}</p>
return <p>{fn:parse-xml($newxml)}</p>
error:
Description: err:FODC0006 String passed to fn:parse-xml is not a well-formed XML document.: Document is not valid.
Fatal : The entity name must immediately follow the '&' in the entity reference.
If I remove & the fragment parses just fine. Why is this producing an error if it is legal XML? How can I achieve the needed result?
Many thanks in advance.
ps. I am open to both Xquery and XSLT solutions.

It seems that the issue is the HTML entities. It would work with numeric entities (i.e. < instead of < and > instead of >), but the XML parser doesn't know about HTML character entities.
Useutil:parse-html() instead of fn:parse-xml().
let $xml := <request id="foo">
<p>The is a description with a line break<br/>and
another linebreak<br/>and here is an ampersand&.</p>
</request>
return <p>{util:parse-html($xml/p/text())/HTML/BODY/node()}</p>

Related

I want to access a node from an XML which value is a string

I have an XML that has a node with an XML as it's
value(<name> <name1> <name2>value</name2> </name1> </name>),
The problem is that this value is an XML not converted to XML.
How can I access this value by node name2?
I used these two functions:
substring-before(string, string)
substring-after(string, string) to return what i want and it works
The problem is that I can't use this kind of function in the application I'm using, because it has a primitive version of XPath that doesn't accept advanced XPath.
<Input>
<Response>
<name> <name1> <name2>value</name2> </name1> </name>
</Response>
</Input>
<name2>value</name2>
Your code is not well-formed XML.
But you can pre-process the entities with sed, for example:
sed -e "s/</</g; s/>/>/g" input.xml
This command changes all < and > entities to < and >, respectively.
After this conversion, you can apply an XML parser to your data.

How to get parent element with attribute using xpath

I have posted sample XML and expected output kindly help to get the result.
Sample XML
<root>
<A id="1">
<B id="2"/>
<C id="2"/>
</A>
</root>
Expected output:
<A id="1"/>
You can formulate this query in several ways:
Find elements that have a matching attribute, only ascending all the time:
//*[#id=1]
Find the attribute, then ascend a step:
//#id[.=1]/..
Use the fn:id($id) function, given the document is validated and the ID-attribute is defined as such:
/id('1')
I think it's not possible what you're after. There's no way of selecting a node without its children using XPATH (meaning that it'd always return the nodes B and C in your case)
You could achieve this using XQuery, I'm not sure if this is what you want but here's an example where you create a new node based on an existing node that's stored in the $doc variable.
declare variable $doc := <root><A id="1"><B id="2"/><C id="2"/></A></root>;
element {fn:node-name($doc/*)} {$doc/*/#*}
The above returns <A id="1"></A>.
is that what you are looking for?
//*[#id='1']/parent::* , similar to //*[#id='1']/../
if you want to verify that parent is root :
//*[#id='1']/parent::root
https://en.wikipedia.org/wiki/XPath
if you need not just parent - but previous element with some attribute: Read about Axis specifiers and use Axis "ancestor::" =)

XPath to only select the text contained within an element

I am new to xpath so I apologize in advance for how basic this question is.
How do I extract just the text from a specific element? For example, how would I extract just "text"
<h1>text</h1>
I tried the following but it seems to select everything including the tags instead of just the text.
//h1/text()
Thanks for your help
`
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File("src/myFile.xml"));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String sessionId = (String) xpath
.evaluate(
"/Envelope/Body/LoginProcessResponse/loginResponse/sessionId",
doc, XPathConstants.STRING);
`
here Envelope is my parent element and i just traversed to the required path(in my case it is sessionid).
Hope it helps
This answer is rather an XSLT answer than an XPath answer, but many of the concepts are nevertheless applicable.
The XPath expression
//h1/text()
seems to be correct. It does select all text() nodes that are direct children of <h1> elements.
But one problem may be, that the XSL default template still copies all the othertext() nodes like described here in the W3C specification:
In the absence of a select attribute, the xsl:apply-templates instruction processes all of the children of the current node, including text nodes.
So to solve your problem, you have to define an explicit template that
ignores all other text() nodes like this:
<xsl:template match="text()" />
If you add this line to your XSL processing, the result will most likely be more pleasant to you.

How to find the parent node by matching text using XPath

I have some XML:
<sys>
<lang>
<employee>
<name>Employee 1</name>
<code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>
</employee>
<employee>
<name>Employee 2</name>
<code>1d960bdc-0853-49af-bb83-18cf92493897</code>
</employee>
</lang>
</syz>
How can I search and get the employee node where name ="Employee 1"?
I tried this but it didn't work:
obj.xpath("//sys/lang[/employee/name = 'Employee 1']")
This XPath
/sys/lang/employee[name = 'Employee 1']
will select the employee element whose name is Employee 1.
Why might OP be getting an "Invalid expression" using the above XPath?
Transcription error.
Resolution: Use copy and paste.
Single quotes around single quotes.
Resolution: Use outer double quotes: "/sys/lang/employee[name = 'Employee 1']"
Smart quotes.
Resolution: Replace ‘ and ’ with single quote '.
Misinterpretation of error message.
Resolution: Carefully check any line number mentioned in error, or carve away surrounding code as much as possible, and see if error goes away.
If none of the above possibilities apply, post a MCVE (Minimal, Complete, and Verifiable Example, including the provided XPath and the calling code -- the complete in MCVE) that produces the invalid expression error, and someone will likely immediately spot the problem.
I'm a big fan of using CSS over XPath for readability reasons. Nokogiri implements a number of jQuery's extensions to make it easier to use CSS for things we'd usually use XPath for.
I'd do it this way:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<sys>
<lang>
<employee>
<name>Employee 1</name>
<code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>
</employee>
<employee>
<name>Employee 2</name>
<code>1d960bdc-0853-49af-bb83-18cf92493897</code>
</employee>
</lang>
</syz>
EOT
emp1 = doc.at('employee name:contains("Employee 1")') # => #<Nokogiri::XML::Element:0x3ffed05285b4 name="name" children=[#<Nokogiri::XML::Text:0x3ffed05283d4 "Employee 1">]>
emp1.to_xml # => "<name>Employee 1</name>"
emp1.parent.to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
Also note, it's not good practice to define the full path in the selector for a node. If the HTML or XML changes the structure that selector will break. Instead, find useful landmarks and hop from one to the next. That way your selector is more likely to survive changes in the markup. I only care about finding the appropriate <employee>...<name> combination, not those two tags embedded under <sys> and <lang>.
Sometimes an alternate way of getting to the information you want is to use search and look at a particular index:
doc.search('employee').first.to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
Or:
doc.at('employee').to_xml # => "<employee>\n <name>Employee 1</name>\n <code>4fdaa994-7015-4ec1-b365-de4ee0279966</code>\n </employee>"
at('some selector') is equivalent to search('some selector').first.

use YQL with substring-before in xpath

I am trying to get a string before '--' within a paragraph in an html page using the xpath and send it to yql
for example i want to get the date from the following article:
<div>
<p>Date --- the body of the article</p>
</div>
I tried this query in yql:
select * from html where url="article url" and xpath="//div/p/text()/[substring-before(.,'--')]"
but it does not work.
how can I get the date of the article which is before the '--'
You can simply use:
substring-before(//div/p,'--')
Use:
substring-before(/div/p/text(), '--')
This XPath expression evaluates to the string immediately preceding '--' in the first text node in the XML document, that is a child of a p that is a child of the div top element.
In case you want to get this value for every such text node, you have to use an expression like:
substring-before((//div/p/text())[$k], '--')
and evaluate this expression $N times, for $k = 1,2, ..., $N
where $N is count(//div/p/text())
Do note: Try to avoid using the // XPath pseudo-operator always when the structure of the XML document is statically known. Using // usually results in big inefficiency (O(N^2)) that are felt especially painful on big XML documents.

Resources