I have quite a large XML structure that in its simplest form looks kinda like this:
<document>
<body>
<section>
<p>Some text</p>
</section>
</body>
<backm>
<section>
<p>Some text</p>
<figure><title>This</title></figure>
</section>
</backm>
</document>
The section levels can be almost limitless (both within the body and backm elements) so I can have a section in section in section in section, etc. and the figure element can be within a numlist, an itenmlist, a p, and a lot more elements.
What I want to do is to check if the title in figure element is somewhere within the backm element. Is this possible?
A document could have multiple <backm> elements and it could have multiple <figure><title>Title</title></figure> elements in it. How you build your query depends on the situations you're trying to distinguish between.
//backm/descendant::figure/title
Will return the <title> elements that are the child of a <figure> element and the descendant of a <backm> element.
So:
count(//backm/descendant::figure/title) > 0
Will return True if there are 1 or more such title elements.
You can also express this using Double Negation
not(//backm[not(descendant::figure/title)])
I'm under the impression that this should have better performance.
//title[parent::figure][ancestor::backm]
Lists all <title> elements with a parent of <figure> and an <backm> ancestor.
Related
I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p
Given a document:
<html>
<body>
<div>
<div>No span</div>
<span>Target</span>
</div>
</body>
</html>
I would like to select the <div> containing the <span>. However, when I use this selector:
//div[//span]
It matches both <div>s:
<div><div>No span</div><span>Target</span></div> <-- what I wanted
<div>No span</div> <-- this is also matched
I tested this on Google Chrome's Devtools, as well as several online XPath evaluators, so I assume this is the correct behavior.
Why is this happening, and how can I fix my selector?
select the <div> containing the <span>
Use relative paths.
//div[.//span]
// starts from the document root. .// starts from the context element.
Predicates evaluate to true when the contained expression selects nodes. This means that //div[//span] is always true when there is a <span> anywhere in the document, in which case all <div>s in the document will be selected. //div[.//span] is only true when there is a <span> anywhere in the respective <div>.
If you mean "has a <span> child" (as opposed to "has a <span> descendant") this will work:
//div[span]
which is a shorthand for this (to underline the difference between / and //):
//div[./span]
Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.
I have an xml structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
<figure id="2121"></figure>
<section>
<p xml:id="somethingagain">Some text</p>
<figure id="939393"></figure>
<p xml:id="countelement"></p>
</section>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<figure id="939394"></figure>
<p xml:id="countelement2"></p>
</section>
</section>
</body>
</document>
How can I count the figure elemtens I have before the <p xml:id="countelement"></p> element using XPath?
Edit:
And i only want to count figure elements within the parent section, in the next section it should start from 0 again.
Given you're using an XPath 2.0 compatible engine, find the count element and call fn:count() for each of them with using all preceding figure-elements as input.
This will return the number of figures preceding each "countelement" on the same level (I guess this is what you actually want):
//p[#xml:id="countelement"]/count(preceding-sibling::figure)
This will return the number of figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding-sibling::figure | parent::*/preceding-sibling::figure)
This will return the number of all preceeding figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding::figure)
If you're bound to XPath 1.0, you won't be able to get multiple results. If #id really is an id (and thus unique), you will be able to use this query:
count(//p[#xml:id="countelement"]/preceding::figure)
If there are "countelements" which are not <p/> elements, replace p by *.
count(id("countelement")/preceding-sibling::figure)
Please note that the xml:id attributes of two different elements cannot the same value, such as "countelement". If you wish two different elements to have a same-named attribute with the same value "countelement", it must be some other attribute perhaps "kind" that is not of DTD attribute type ID. In that case in place of id("countelement") you would use *[#kind="countelement"].
Take a look at the sample XML below--
<div id="main">
<div id="1">
Some random text
</div>
<div id="2">
Some random text
</div>
<div id="3">
Some random text
</div>
<p> Some more random text</p>
<div id="4">
Some random text
</div>
</div>
Now, how do I find out the number of divs within the main div using Xquery? And how to do this in XPath?
You can use the following XPath:
count(div[#id="main"]/div)
The function count does the counting, the main div is selected by its id.
The XPath expressions below can be used both in XPath and XQuery. This is so, because XPath (2.0) is a proper subset of XQuery.
Use:
count(/*//div)
If "the main div" isn't the top element of the XML document, and this is the only div whose id attribute has string value of "main", use:
count((//div[#id='main'])[1]//div)
If it is guaranteed that the div children of the "main div" dont have div descendents, use:
count((//div[#id='main'])[1]/div)
Do note: The XPath pseudo-operator // can be very inefficient -- this is why, always try to avoid using it, whenever the structure of the XML document is statically known and specific paths can be used.