Only xpath for extracting text for multiple conditions in xml - no code possible - xpath

I have an example file with three conditions to be met... I also have no control over the xml file I get:
<?xml version="1.0" encoding="UTF-8"?>
<rootelement>
<Description>
<Note countries="AR,GB,US" >
<P countries="AR" >We want this one as it's AR.</P>
<P countries="US" >We don't want this one as it's not AR.</P>
<P countries="GB" >We don't want this either as it's not AR.</P>
</Note>
</Description>
<Description>
<Note countries="AR,GB,US" >
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
</Note>
</Description>
<Description>
<Note>
<P>No country, that's because it will be treated as AR.</P>
</Note>
</Description>
</rootelement>
The task is threefold:
Extract text from <P> where countries="AR", other values are always ignored
Extract text from <P> where it's parent element (in this example but it's not always the case) contains AR in the countries attribute (countries="AR,GB,US" for example)
Extract text from current element (<P> in this example, not always) when there is no countries attribute present in the current element or it's ancestors
I hope that's clear, I tried to put three examples in the xml above and I need to extract these texts with my rule(s):
<P countries="AR" >We want this one as it's AR.</P>
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
<P>No country, that's because it will be treated as AR.</P>
Ideally I want one rule. But I could use several as the rules are applied hierarchically.
If I use this in the application I'm feeding:
//*[contains(#countries,'AR')]/*
All good to get the first three, but I also get US and GB which I don't want. I can exclude them with this:
//*[contains(#countries,'AR')]/*[not(contains(#countries,'US')) and not(contains(#countries,'GB'))]
But the expression will become unmanageable in practice as there are many languages and I often need to change the ones I'm looking for. I cannot figure out how to say just exclude any that don't contain AR.
And then I still have the last problem of being able to extract if the countries attribute is missing altogether. This bit I'm at a complete loss to know how to resolve without affecting the previous results.

Here's an XPath 1 expression which I think captures the logic you've described:
//*[text()[normalize-space()]]
[
not(ancestor-or-self::*/#countries) or
contains(ancestor-or-self::*[#countries][1]/#countries, 'AR')
]
Any element which has a child text node which is not just white space, and
which has no countries attribute of its own or on any of its ancestor elements, or
has 'AR' either in its own countries attribute or the first countries attribute of any of its ancestors.
NB the ancestor-or-self axis is a 'reverse' axis which means the expression ancestor-or-self::* will return the context node itself, then its parent, then its grand-parent, etc, in that order, finishing at the root element of the document. The expression ancestor-or-self::*[#countries] will filter that list to include only the elements which have a countries attribute, and ancestor-or-self::*[#countries][1] will return the first element in that list. If the element that contains the text has a countries attribute, then it will be first in that list, otherwise the nearest ancestor will be first. I think this "inheritance" is what you're wanting to achieve?
Results:
<P countries="AR">We want this one as it's AR.</P>
<P>Everyone in AR, GB and US gets to buy.</P>
<P>No restrictions for this product in these countries.</P>
<P>No country, that's because it will be treated as AR.</P>

Related

Trying to find two different text nodes from a descendant

Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.
Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]

Avoid parentheses in path using XPath 1.0

The following XML structure represents a website with many articles. Every article contains, among many other things, date of its creation and possibly arbitrarily many dates of its modification. I want to get the date of the last access (either creation or last modification) to every article using XPath 1.0.
<website>
<article>
<date><strong>22.11.2017</strong></date>
<edits>
<edit><strong>17.12.2017</strong></edit>
</edits>
</article>
<article>
<date><strong>17.4.2016</strong></date>
<edits></edits>
</article>
<article>
<date><strong>3.5.2011</strong></date>
<edits>
<edit><strong>4.5.2011</strong></edit>
<edit><strong>12.8.2012</strong></edit>
</edits>
</article>
<article>
<date><strong>12.2.2009</strong></date>
<edits></edits>
</article>
<article>
<date><strong>23.11.1987</strong></date>
<edits>
<edit><strong>3.4.2001</strong></edit>
<edit><strong>11.5.2006</strong></edit>
<edit><strong>13.9.2012</strong></edit>
</edits>
</article>
</website>
In other words, the expected output is:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
So far I've only created this path:
//article/*[self::date or self::edits/edit][last()]
that looks for date and nonempty edits nodes in every article and selects the latter one. But I don't know how to access the latest strong of every such selection and the naive //strong[last()] appended to the end of the path doesn't work.
I found a solution in XPath 2.0. Either of these paths should work, if I'm not mistaken:
//article/(*[self::date or self::edits/edit][last()]//strong)[last()]
//article/(*//strong)[last()]
Such use of parentheses within path is invalid in XPath 1.0 though.
This XPath 1.0 expression
/website/article/descendant::strong[parent::date|parent::edit][last()]
Selects the nodes:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
Tested in http://www.xpathtester.com/xpath/56d8f7bc4b9c8c064fdad16f22469026
Do note: position predicates acts over the context list.
Here is the simple xpath to get your output.
//article/descendant-or-self::strong[last()]

XPATH Select All Attributes attr Except One On Specific Element elem

I was selecting all attributes id and everything was going nicely then one day requirements changed and now I have to select all except one!
Given the following example:
<root>
<structs id="123">
<struct>
<comp>
<data id="asd"/>
</comp>
</struct>
</structs>
</root>
I want to select all attributes id except the one at /root/structs/struct/comp/data
Please note that the Xml could be different.
Meaning, what I really want is: given any Xml tree, I want to select all attributes id except the one on element /root/structs/struct/comp/data
I tried the following:
//#id[not(ancestor::struct)] It kinda worked but I want to provide a full xpath to the ancestor axis which I couldn't
//#id[not(contains(name(), 'data'))] It didn't work because name selector returns the name of the underlying node which is the attribute not its parent element
The following should achieve what you're describing:
//#id[not(parent::data/parent::comp/parent::struct/parent::structs/parent::root)]
As you can see, it simply checks from bottom to top whether the id attribute's parent matches the path root/structs/struct/comp/data.
I think this should be sufficient for your needs, but it does not 100% ensure that the parent is at the path /root/structs/struct/comp/data because it could be, for example, at the path /someOtherHigherRoot/root/structs/struct/comp/data. I'm guessing that's not a possible scenario in your XML structure, but if you had to check for that, you could do this:
//#id[not(parent::data/parent::comp/parent::struct/parent::structs/parent::root[not(parent::*)])]

XPath (1.0) Match consecutive elements until specific child or end

This is for XPath 1.0.
Here is an example of the mark up that I am matching against. The actual number of elements is not known ahead of time and thus varies, but following this sort of of pattern:
<div class="entry">
<p><iframe /></p>
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
<p><iframe /></p>
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
</div>
I am trying to to match every <p> that does not contain an <iframe>, up until the next <p> that does contain an <iframe> or until the end of the enclosing <div> element.
To make things slightly more complicated, for specific reasons I need to use each <iframe> as the base, a la //div[#class='entry']//iframe, so that each nodeset is based from
(//div[#class='entry']//iframe)[1]
(//div[#class='entry']//iframe)[2]
...
and thus, in this case, matching
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
and
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
respectively.
I tried some of the following for testing to no avail:
(//div[#class='entry']//iframe)/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(or for testing):
(//div[#class='entry']//iframe)[1]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(//div[#class='entry']//iframe)[2]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
and some variations thereof but what happens for the first set is it gets all <iframe>-less <p> elements all the way to the end instead of stopping at the next <p> that contains a <iframe>.
I've been at this for a while and even though I'm usually quite handy with this sort of thing, I can't quite work my way thorigh this one and none of the search results from Google and such have helped.
Thanks. Any help is always appreciated.
Edit: It can be assumed that there is only one occurrence of <div class="entry"> in the document.
What you are asking for can't be done in one single XPath 1.0 expression without help. The problem is that the question you want to ask is
Starting from an element X (the p-containing-an-iframe), find the other p elements for which that element's nearest preceding p-with-an-iframe is the original node X
If we had a variable $x holding a reference to the top-level context node (the p[iframe] we're starting from) then you could say something like the following (in XPath 2.0)
following-sibling::p[not(iframe)][preceding-sibling::p[iframe][1] is $x]
XPath 1.0 doesn't have an is operator to compare node identity but there are other proxies you can use for this, for example
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe])
= (count($x/preceding-sibling::p[iframe]) + 1)]
i.e. those following p elements that have one more preceding-sibling::p[iframe] than $x has.
The nub of the problem then is how to get at the outer context node from inside the inner predicate - pure XPath 1.0 has no way to do this. In XSLT you have the current() function, but otherwise you have two basic choices:
If your XPath library allows you to provide variable bindings to your expressions, then inject a variable $x containing the context node and use the expression I've given above.
If you can't inject variables then use two separate XPath queries in sequence.
First execute the expression
count(preceding-sibling::p[iframe]) + 1
with the relevant p[iframe] as context node, and take the result as a number. Or alternatively, if you're already iterating over these p[iframe] elements in your host language then just take the iteration number from there directly, you don't need to count it up using XPath. Either way, you can then build a second expression dynamically:
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe]) = N]
(where N is the result of the first expression/iteration counter) and evaluate that with the same context node, taking the final result as a node set.
I'm not sure I understood completely, but sometimes it helps to comment on an attempted solution rather than trying to explain.
Please try the following XPath expression:
//div[#class='entry']//iframe//p[not(descendant::iframe)]
And let me know if this yields the correct result.
If not,
explain how the result differs from what you need
please show a more complete HTML sample: a reasonable document with multiple div elements, and more than one where div[#class = 'entry'] - and otherwise covering all the complexity you describe.
explain why you added [1] and [2] to your expressions
give more details about the platform you're using XPath with, perhaps post code

Is my understanding of XPath axes correct?

I have made an info-graphic depicting the various axes in XPath. However, I am not sure as to whether they are correct.
I get confused in following, following-sibling, preceding and preceding-sibling
Is my diagram correct ?
The original image is here: http://imgur.com/4ekJxca
(Taken from Pro XML Development with Java)
Here is my understanding of the nodes I get confused in:
descendant:: selects the nodes (element and text only) which are children and grandchildren of the context node.
following:: selects any node (text only) which was not selected by descendant.
following-sibling:: all the 'brothers' of the context node. That is, text and element nodes which are children of the same parent as the context node, after the context node.
preceding::sibling all the 'brothers' of the context node. That is, text and element nodes which are children of the same parent as the context node, before the context node.
preceeding:: all the nodes (text only) that do not appear along the ancestor:: axis and are not nested in any element node. (I am sure I screwed this up)
XML
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns:journal="http://www.apress.com/catalog/journal" >
<journal:journal title="XML" publisher="IBM developerWorks">
<article journal:level="Intermediate"
date="February-2003">
<title>Design XML Schemas Using UML</title>
<author>Ayesha Malik</author>
</article>
</journal:journal>
<journal title="Java Technology" publisher="IBM developerWorks">
<article level="Advanced" date="January-2004">
<title>Design service-oriented architecture
frameworks with J2EE technology</title>
<author>Naveen Balani</author>
</article>
<article level="Advanced" date="October-2003">
<title>Advance DAO Programming</title>
<author>Sean Sullivan </author>
</article>
</journal>
</catalog>
The best way to gain accurate intuition about preceding and following axes is to imagine XML as a set of nested boxes or intervals, where each interval extends from the start tag to its matching end tag. In this picture you can see that any two distinct intervals a and b must be in exactly one of the following relationships:
a contains b (a/descendant::b);
a is contained by b (a/ancestor::b);
a is followed by b (a/following::b).
a is preceded by b (a/preceding::b);
If you keep to this model, you will never have a doubt in the semantics of the XPath axes.
Incidentally, this is why the tree model is bad for your intuition: it doesn't put the "nested boxes" paradigm to the forefront, so it's easy to get confused.

Resources