Xpath, How to select elements with a specific attribute? - xpath

I am learning xpath and I am trying to get some data from html using xpath.
How do I select elements "A Number" with C contain C1>1 and Price<20
I want to select the elements:
<A Number=1234 Date=05-25-2007>
<A Number=1235 Date=05-26-2007>
<A Number=1237 Date=05-25-2007>
<A>
<A Number="1234" Date="05-25-2007">
<B>
<B1>Judith Miller</B1>
<Tax N="Yes" Rate="21"/>
</B>
<C>
<C1 x="xxxxx" Price="20"/>
<C1 x="yyyyy" Price="15"/>
</C>
</A>
<A Number="1235" Date="05-26-2007">
<B>
<B1>Herbert Marshall</B1>
<Adress Street="Saint Marc 2250" City="Oslo"/>
<Tax N="Yes" Rate="21"/>
</B>
<C>
<C1 x="yyyy" Price="25"/>
<C1 x="zzzz" Price="12"/>
<C1 x="xxxx" Price="22"/>
</C>
</A>
<A Number="1236" Date="05-26-2007">
<B>
<Nazwa>Judith Miller</Nazwa>
<Adress Street="Kennedy 511" City="Florida"/>
<Tax N="Yes" Rate="21"/>
</B>
<C>
<C1 x="fffff" Price="15"/>
</C>
</A>
<A Number="1237" Date="05-25-2007">
<B>
<B1>Harrison Faber</B1>
<Adress Street="Street 326" City="London"/>
<Tax N="No" Rate="0"/>
</B>
<C>
<C1 x="xxx" Price="20"/>
<C1 x="yyy" Price="9"/>
</C>
</A>
</A>
What is the XPath expression for selecting this elements?
Cheers

After fixing your data to be well-formed and uniform, I was able to get the three A's with
//A[count(C/C1) > 1][C/C1/#Price < 20]
Fixed data:
<root>
<A Number="1234" Date="05-25-2007">
<B>
<B1>Judith Miller</B1>
<Tax due="Yes" Rate="21"/>
</B>
<C>
<C1 x="xxxxx" Price="20"/>
<C1 x="yyyyy" Price="15"/>
</C>
</A>
<A Number="1235" Date="05-26-2007">
<B>
<B1>Herbert Marshall</B1>
<Adress Street="Saint Marc 2250" City="Oslo"/>
<Tax due="Yes" Rate="21"/>
</B>
<C>
<C1 x="yyyy" Price="25"/>
<C1 x="zzzz" Price="12"/>
<C1 x="xxxx" Price="22"/>
</C>
</A>
<A Number="1236" Date="05-26-2007">
<B>
<B1>Judith Miller</B1>
<Adress Street="Kennedy 511" City="Florida"/>
<Tax due="Yes" Rate="21"/>
</B>
<C>
<C1 x="fffff" Price="15"/>
</C>
</A>
<A Number="1237" Date="05-25-2007">
<B>
<B1>Harrison Faber</B1>
<Adress Street="Street 326" City="London"/>
<Tax due="No" Rate="0"/>
</B>
<C>
<C1 x="xxx" Price="20"/>
<C1 x="yyy" Price="9"/>
</C>
</A>
</root>

Related

how to select all items between two specific items based on id tags using xpath/scrapy

I have a following html:
<p id="d117" class="gen">Genre</p>
<a href='...'>Movie 1</a>
<a href='...'>Movie 2</a>
<p id="d127" class="gen">Genre</p>
<li>one</li>
<a href='...'>Movie 3</a>
<a href='...'>Movie 4</a>
<p id="d147" class="gen">Genre</p>
<li>two</li>
<a href='...'>Movie 5</a>
<a href='...'>Movie 6</a>
</root>
I want to select all the nodes after a certain-(p) id number until the next occurrence of the new id, I tried with the following XPath:
//p[#id="d147"]/preceding-sibling::*
//p[#id="d147"]/following-sibling::*
the above -1,2 works fine
But the below syntax does not give me a desired result of getting the lements between two id(intervals) :
//p[#id="d117"]/following-sibling::*[preceding-sibling::p[#id="d127"]]
Please guide me- to get the data between two id's using xpath
//a[preceding-sibling::p[#id='d117'] and following-sibling::p[#id='d127']]

knime xpath node multiple tag selection

I am trying to extract xml codes from html source. source is like this;
.
.
.
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
Actually i need parent child relation so i need to extract node cell with xpath node first. But i couldn't achive to get range of xml code from "h5" to "/ul". So i need "h5" and "ul" tags together. Output must be like this;
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
I searched tons of links and tried everything but none of these xpath codes worked;
/.../*[self::dns:h5 or self::dns:ul]
/.../*[self::dns:h5|self::dns:ul]
/.../*[self::h5 or self::ul]
Any idea, thanks.
If you use Python, you can do this
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
if item.tag == 'h5':
lastName = item.text
else:
links = item.getElementsByTag('a')
print (lastName,links)
result:
A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n '}]
B []
C []
D []

XPath Select parent node based on child node

I need to extract the href where its descendant is: i/[#class="icon-right-open rotate180"]
I tried the following but it didn't work for me
//a[#class="arrowDot "]/#href /descendant::i[#class="icon-right-open rotate180"]
here is the HTML CODE:
<div class="paginationDots sMargTop centered">
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:2:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="arrowDot ">
<i class="icon-left-open rotate180"/>
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
1
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:2:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
2
</a>
<a class="Dots currentDot">
3
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:4:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
4
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:5:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
5
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:6:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
6
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:7:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="Dots ">
7
</a>
<a href="https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:4:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale" class="arrowDot ">
<i class="icon-right-open rotate180"/>
</a>
</div>
expected result is the following URL:
https://www.mubawab.tn/fr/cc/immobilier-a-vendre-all:p:4:sc:apartments-for-sale,commercial-property-for-sale,farms-for-sale,houses-for-sale,land-for-sale,villas-and-luxury-homes-for-sale
but actual output is empty
you almost got it correct. Here's the one you needed.
//a[#class="arrowDot "][descendant::i[#class="icon-right-open rotate180"]]/#href

Xpath - Get parent class by matching two child nodes

I'd like to use xpath to select a link whose class="watchListItem", span="icon icon_checked", and h3="a test". I can use xpath to get either matching link and span, or link and h3, but not link, span, and h3.
Here's what I've tried:
//*[#class = 'watchListItem']/span[#class = 'icon icon_checked']
//*[#class= 'watchListItem']/h3[text()='AA']
I'm looking for something like this:
//*[#class = 'watchListItem']//*[span[#class = 'icon icon_checked'] and h3[text()='AA']]
<li>
<a class="watchListItem" data-id="thisid1" href="javascript:void(0);">
<span class="icon icon_checked"/>
<h3 class="itemList_heading">a test</h3>
</a>
</li>
<li>
<a class="watchListItem" data-id="thisid2" href="javascript:void(0);">
<span class="icon icon_unchecked"/>
<h3 class="itemList_heading">another test</h3>
</a>
</li>
<li>
<a class="watchListItem" data-id="thisid3" href="javascript:void(0);">
<span class="icon icon_checked"/>
<h3 class="itemList_heading">yet another test</h3>
</a>
</li>
You can use the child:: location paths like so:
//a[#class="watchListItem"
and child::span[#class="icon icon_checked"]
and child::h3[text()="another test"]]
This would select the anchor with data-id="thisid3".

Right way to list image banners

I would like to ask if what is the right way to use 'ul'? will it be okay to use 'ul' to list some image banners? ex. i have 3 image banners with titles and all are floated left. I use to encounter this situation every time and the approach i came up with is the first markup using 'ul'.
Is it okay to use the markup below:
<section class="banners">
<ul>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
</ul>
</section>
or should I use:
<section class="banners">
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
</section>
Do they both represent semantic coding?
This is the sample of the image banner
Since the HTML5 spec is so mercurial and the semantics don't seem to play a major role practically, it's hard to say. However, based on your image, it looks like this is a navigation section. As such, you would want to section it with <nav>.
<ul> spec: http://www.w3.org/TR/html5/grouping-content.html#the-ul-element
<figure> spec: http://www.w3.org/TR/html5/grouping-content.html#the-figure-element
I don't think that these are much help. They are both used for grouping content. The order does not matter for <ul>.
From what I've read, it seems to me that the purpose of <figure> is for annotations of a document -- describing related images, etc. The spec specifically says that these could be moved elsewhere, like an appendix, but that doesn't seem to apply to your situation.
I don't think that <figure> is appropriate here. Instead, use <nav>. You can use the <ul> for styling if you need -- it doesn't provide much semantic meaning (just a somewhat generic grouping content element).

Resources