How do I chose a chunk of tags that are in between of two tags? - xpath

(and including the ending tag)
For example:
<xml>
<a></a>
<a><b></b></a>
<a></a>
<a></a>
<a><c></c></a>
<a></a>
<a><b></b></a>
<a><b></b></a>
<a></a>
<a></a>
<a><b></b></a>
</xml>
I need these three <a> that are after the one that includes <b> and ending with the one that includes <c>.
Or rather "start from one with <c> and select back until you see one with <b> or end of document" that would be even better because there can be case with no 'start' <b> marker.
I need it to write an element-blocking rule for the uBlock Origin Chrome extension.

I think this should do the trick:
//a[c][1] |
//a[c][1]/preceding-sibling::a
[
not(
b or following-sibling::a[b]/following-sibling::a/c
)
]
Explanation:
the first a that contains a c, and also...
the a elements that precede that a, so long as they don't themselves either:
contain a b or
have a following a that contains a b and which is followed by another a that contains a c

I came up with this:
//a[not(b)][c | following-sibling::*[./*][1][./c]]
It takes all not(b) which are either c (the ending) or "that the next tag that includes anything includes c".
Or:
//a[c or not(*) and following-sibling::*[*][1][./c]]

Related

xpath without specificy the tag? [duplicate]

Given this XML, what XPath returns all elements whose prop attribute contains Foo (the first three nodes):
<bla>
<a prop="Foo1"/>
<a prop="Foo2"/>
<a prop="3Foo"/>
<a prop="Bar"/>
</bla>
//a[contains(#prop,'Foo')]
Works if I use this XML to get results back.
<bla>
<a prop="Foo1">a</a>
<a prop="Foo2">b</a>
<a prop="3Foo">c</a>
<a prop="Bar">a</a>
</bla>
Edit:
Another thing to note is that while the XPath above will return the correct answer for that particular xml, if you want to guarantee you only get the "a" elements in element "bla", you should as others have mentioned also use
/bla/a[contains(#prop,'Foo')]
This will search you all "a" elements in your entire xml document, regardless of being nested in a "blah" element
//a[contains(#prop,'Foo')]
I added this for the sake of thoroughness and in the spirit of stackoverflow. :)
This XPath will give you all nodes that have attributes containing 'Foo' regardless of node name or attribute name:
//attribute::*[contains(., 'Foo')]/..
Of course, if you're more interested in the contents of the attribute themselves, and not necessarily their parent node, just drop the /..
//attribute::*[contains(., 'Foo')]
descendant-or-self::*[contains(#prop,'Foo')]
Or:
/bla/a[contains(#prop,'Foo')]
Or:
/bla/a[position() <= 3]
Dissected:
descendant-or-self::
The Axis - search through every node underneath and the node itself. It is often better to say this than //. I have encountered some implementations where // means anywhere (decendant or self of the root node). The other use the default axis.
* or /bla/a
The Tag - a wildcard match, and /bla/a is an absolute path.
[contains(#prop,'Foo')] or [position() <= 3]
The condition within [ ]. #prop is shorthand for attribute::prop, as attribute is another search axis. Alternatively you can select the first 3 by using the position() function.
Have you tried something like:
//a[contains(#prop, "Foo")]
I've never used the contains function before but suspect that it should work as advertised...
John C is the closest, but XPath is case sensitive, so the correct XPath would be:
/bla/a[contains(#prop, 'Foo')]
If you also need to match the content of the link itself, use text():
//a[contains(#href,"/some_link")][text()="Click here"]
/bla/a[contains(#prop, "foo")]
try this:
//a[contains(#prop,'foo')]
that should work for any "a" tags in the document
For the code above...
//*[contains(#prop,'foo')]

Select adjacent sibling elements without intervening non-whitespace text nodes

Given markup like:
<p>
<code>foo</code><code>bar</code>
<code>jim</code> and then <code>jam</code>
</p>
I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.
Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.
Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.
Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.
Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]
Edit: More test cases, for clarity:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.
The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.
I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.
Use:
//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
the contained XPath expression is evaluated and the selected nodes are copied to the output:
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
//code[
(
following-sibling::node()[1][self::code]
or (
following-sibling::node()[1][self::text() and normalize-space() = ""]
and
following-sibling::node()[2][self::code]
)
)
or (
preceding-sibling::node()[1][self::code]
or (
preceding-sibling::node()[1][self::text() and normalize-space() = ""]
and
preceding-sibling::node()[2][self::code]
)
)
]
I think this does what you want, though I won’t claim you’d actually want to use it.
I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.
I think this is what you want:
/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Simple xpath question

I'm thinking this is a very simple xpath question .. I'm just not sure why my xpath isn't working.
Here's what my XML looks like
<A>
<B>foo</B>
</A>
<C>
<A>
<B>foo</B>
</A>
</C>
Now .. I want to grab all "A" elements which contain a "B" with contained text "foo".
//A[B[text()='foo']]
//A matches all As
//A[B] that have a B as a child
//A[B[text()='foo']] which contains foo as text.
I suggest to read the XPath tutorial at w3chools.com

XPath - Get node with no child of specific type

XML: /A/B or /A
I want to get all A nodes that do not have any B children.
I've tried
/A[not(B)]
/A[not(exists(B))]
without success
I prefer a solution with the syntax /*[local-name()="A" and .... ], if possible. Any ideas that works?
Clarification. The xml looks like:
<WhatEver>
<A>
<B></B>
</A>
</WhatEver>
or
<WhatEver>
<A></A>
</WhatEver>
Maybe
*[local-name() = 'A' and not(descendant::*[local-name() = 'B'])]?
Also, there should be only one root element, so for /A[...] you're either getting all your XML back or none. Maybe //A[not(B)] or /*/A[not(B)]?
I don't really understand why /A[not(B)] doesn't work for you.
~/xml% xmllint ab.xml
<?xml version="1.0"?>
<root>
<A id="1">
<B/>
</A>
<A id="2">
</A>
<A id="3">
<B/>
<B/>
</A>
<A id="4"/>
</root>
~/xml% xpath ab.xml '/root/A[not(B)]'
Found 2 nodes:
-- NODE --
<A id="2">
</A>
-- NODE --
<A id="4" />
Try this "/A[not(.//B)]" or this "/A[not(./B)]".
The first / causes XPath to start at the root of the document, I doubt that is what you intended.
Perhaps you meant //A[not(B)] which would find all A nodes in the document at any level that do not have a direct B child.
Or perhaps you are already at a node that contains A nodes in which case you just want A[not(B)] as the XPath.
If you are trying to get A anywhere in the hierarchy from the root, this works (for xslt 1.0 as well as 2.0 in case its used in xslt)
//descendant-or-self::node()[local-name(.) = 'a' and not(count(b))]
OR you can also do
//descendant-or-self::node()[local-name(.) = 'a' and not(b)]
OR also
//descendant-or-self::node()[local-name(.) = 'a' and not(child::b)]
There are n no of ways in xslt to achieve the same thing.
Note: XPaths are case-sensitive, so if your node names are different (which I am sure, no one is gonna use A, B), then please make sure the case matches.
Use this:
/*[local-name()='A' and not(descendant::*[local-name()='B'])]

Resources