Select distinct values with Xpath - xpath

Im using this Xpath query
//li[contains(#class, 'cmil_header')]/span[contains(#class, 'cmil_theatre')] and the result of this query is:
Park
Saga Tokey
Latvia
Latvia
Skande
Paramount
Paramount
Paramount
Oslo
Oslo
...
I have been searching and i have come to conclusion that there is a option to select unique or distinct nodevalues/items with Xpath. But i can't get it to work.
I have managede to be able to select specific item with //li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')] (Park in this case), and i thought //li[contains(#class, 'cmil_header')][distinct-values()]/span[contains(#class, 'cmil_theatre')] would work, but not.
My question:
How would my query be to reproduce:
Park
Saga Tokey
Latvia
Skande
Paramount
Oslo
...
Edit: pastabin with sample
http://pastebin.com/a3x7hRFu

XPath 1.0 solution (where there is no distinct-values function) that relies on the duplicates being sequential:
//li[contains(#class, 'cmil_header')]/span[contains(#class, 'cmil_theatre') and (not(../preceding-sibling::li[contains(#class, 'cmil_header')]) or ../preceding-sibling::li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')]/text() != ./text())]
find all li nodes that contain the cmil_header class: //li[contains(#class, 'cmil_header')]
find the child span nodes that contain the cmil_theatre class: /span[contains(#class, 'cmil_theatre') and
where there is no previous li node containing the cmil_header class: (not(../preceding-sibling::li[contains(#class, 'cmil_header')])
or the previous li node containing the cmil_header class has a span node child that contains the cmil_theatre class: or ../preceding-sibling::li[contains(#class, 'cmil_header')][1]/span[contains(#class, 'cmil_theatre')]
and the text content of that span is not the same as the text content of... : /text() !=
...this span: ./text())]

i thought //li[contains(#class, 'cmil_header')][distinct-values()]/span[contains(#class, 'cmil_theatre')] would work, but not.
No, there is no way this could work. I find it hard to know what you were imagining. The most basic error is that distinct-values() expects an argument. More subtly, you really don't seem to have understood how predicates (expressions in square brackets) work.
What would work -- assuming your XPath processor supports XPath 2.0 -- is
distinct-values(//li[contains(#class, 'cmil_header')]/
span[contains(#class, 'cmil_theatre')])

Related

Using Xpath contains and string() to get the innermost node

Here is the html:
<div id="div_1">
<div id="div_2">
<span>today</span>
<span>tomorror</span>
</div>
</div>
I want to get the innermost node whose children nodes contains the string 'todaytomorrow'
I can use $x('//*[contains(string(), "todaytomorrow")]'), but the div_1 and div_2 are both returned. So how can I return the only div_2 which is the innermost node?
It's not clear how the XPath expression you tried is actually producing the result stated without accounting for whitespace unless you formatted the XML differently in the question. This version of the initial attempt produces the stated undesirable result.
//*[contains(normalize-space(string()), "today tomorrow")]
You can modify this with a predicate to eliminate all but the last selected node:
(//*[contains(normalize-space(string()), "today tomorrow")])[last()]
If the original XPath expression is really selecting both nodes for you, then you can modify that original expression like:
(//*[contains(string(), "todaytomorrow")])[last()]
As suggested in other comments, there are a definitely other solutions such as those taking advantage of the differences between text() and string() node tests.

Extract value of div element which isn't an ID or a CLASS with XPATH

I would like to extract the info of "data-history-node-id" of this kind of code :
<div data-history-node-id="1001" role="article" about="/url-article" typeof="schema:Article" class="main-content">'
here it would be 1001
I know how to select an ID or a CLASS but that, no...
Thanks
Try one of these xpath expressions:
/div/#data-history-node-id
or
/div/data(#data-history-node-id)
Depending on your implementation, at least one should output 1001.

xpath without specificy the tag? [duplicate]

Given this XML, what XPath returns all elements whose prop attribute contains Foo (the first three nodes):
<bla>
<a prop="Foo1"/>
<a prop="Foo2"/>
<a prop="3Foo"/>
<a prop="Bar"/>
</bla>
//a[contains(#prop,'Foo')]
Works if I use this XML to get results back.
<bla>
<a prop="Foo1">a</a>
<a prop="Foo2">b</a>
<a prop="3Foo">c</a>
<a prop="Bar">a</a>
</bla>
Edit:
Another thing to note is that while the XPath above will return the correct answer for that particular xml, if you want to guarantee you only get the "a" elements in element "bla", you should as others have mentioned also use
/bla/a[contains(#prop,'Foo')]
This will search you all "a" elements in your entire xml document, regardless of being nested in a "blah" element
//a[contains(#prop,'Foo')]
I added this for the sake of thoroughness and in the spirit of stackoverflow. :)
This XPath will give you all nodes that have attributes containing 'Foo' regardless of node name or attribute name:
//attribute::*[contains(., 'Foo')]/..
Of course, if you're more interested in the contents of the attribute themselves, and not necessarily their parent node, just drop the /..
//attribute::*[contains(., 'Foo')]
descendant-or-self::*[contains(#prop,'Foo')]
Or:
/bla/a[contains(#prop,'Foo')]
Or:
/bla/a[position() <= 3]
Dissected:
descendant-or-self::
The Axis - search through every node underneath and the node itself. It is often better to say this than //. I have encountered some implementations where // means anywhere (decendant or self of the root node). The other use the default axis.
* or /bla/a
The Tag - a wildcard match, and /bla/a is an absolute path.
[contains(#prop,'Foo')] or [position() <= 3]
The condition within [ ]. #prop is shorthand for attribute::prop, as attribute is another search axis. Alternatively you can select the first 3 by using the position() function.
Have you tried something like:
//a[contains(#prop, "Foo")]
I've never used the contains function before but suspect that it should work as advertised...
John C is the closest, but XPath is case sensitive, so the correct XPath would be:
/bla/a[contains(#prop, 'Foo')]
If you also need to match the content of the link itself, use text():
//a[contains(#href,"/some_link")][text()="Click here"]
/bla/a[contains(#prop, "foo")]
try this:
//a[contains(#prop,'foo')]
that should work for any "a" tags in the document
For the code above...
//*[contains(#prop,'foo')]

XPATH - get all inner nodes except a particular one

this is my HTML
<book>
<div id="name"></div>
<span id="age"></span>
<p id="contact_number"></p>
...
...
(more attributes)
</book>
I need to extract all the text() inside <book></book> except the p with id="contact_number"
so basically i need //book//text() except //book//p[#id="contact_number"]//text()
How can i do this in a single xpath query?
There might be a better way if you can put the requirement differently. Anyway, to answer the question the way it asked, you can try this :
//book//text()[not(ancestor::p/#id='contact_number')]
or maybe just use parent::p instead of ancestor::p :
//book//text()[not(parent::p/#id='contact_number')]
add [normalize-space()] at the end if you need to filter out empty text nodes.
Try the following:
//*[not(self::p[#id = 'contact_number'])]/text()[normalize-space()]

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

Resources