What is the Xpath expression that involves multiple exclusions? - xpath

Suppose I have html like this:
<div id="wrap">
<div id="content">
<span>some content</span>
<div id="s1">
<p> some text </p>
</div>
<h2 id="sec1">
<span> some text </span>
<p> some text </p>
</h2>
<h2 id="sec1">
<span> some text </span>
<div> some more text </div>
<p> some text </p>
</h2>
<h2 id="sec2">
<span> do not select me some text </span>
<div> do not select me some more text </div>
<p> do not select me some text </p>
</h2>
<h2 id="sec3">
<span> do not select me some text </span>
<div> do not select me some more text </div>
<p> do not select me some text </p>
</h2>
</div>
</div>
What is the XPath expression that selects all text node except those that are under h2 id=sec2 and h2 id=sec3 ?

Literally, "all text node except those that are under h2 id=sec2 and h2 id=sec3":
//text()[not(ancestor::h2[#id='sec2' or #id='sec3'])]
However I suspect that you don't really want that, because you would be losing the <span> and <p> structure. Would it be correct to infer that you want to select all the child elements of the content <div>, except for the <h2>s whose id's are sec2 and sec3? If so,
/div/div[#id = 'content']/*[not(self::h2 and (#id = 'sec2' or #id = 'sec3'))]
But you should also be aware that the text content of the <h2> element is merely the title of a section, not the whole text of the section. So it looks like by putting div's and p's inside an h2, you are not using it the way it is designed.

All elements under an <h2> (except …):
//h2[not(#id = 'sec2' or #id = 'sec3')]/*
All <span>, <div> or <p> elements anywhere (except …):
//*[self::span or self::div or self::p][not(parent::h2/#id = 'sec2' or parent::h2/#id = 'sec3')]
alternative notation (note the parens and the slightly changed predicate):
(//span|//div|//p)[not(parent::h2[#id = 'sec2' or #id = 'sec3'])]

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>
This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.

Get Element using XPath in Puppeteer

I am trying to scrape multiple elements with the same class names but each has a different number of children. I am looking for a way to select specific elements using the xpath(this would make it easiest for my loop).
const gameTimeElement = await page.$$('//*[#id="section-content"]/div[2]/div[1]/div/div['+ i + ']');
const gameTimeString = await gameTimeElement[j].$eval('h3', (h3) => h3.innerHTML);
This currently does not work.
After I select the element, I grab the h3 tag inside and evaluate it to get the innerHTML.
Is there a way to do this utilizing xpath?
<div id="section-content" style="display: block;">
</div>
<div class="matches">
<div class="day day-28-1" data-week="1" style="display: block;">
<h4>Sat, March 28, 2020</h4>
<div class="day-wrap">
<div class="match region-7-57d5ab4-9qs98v" data-week="1">
<h3 class="time">2:00PM
<span>(Central Daylight Time)</span>
<span class="fr">Best of 7</span>
</h3>
<div class="row ac ">
<div class="col-xs-3 ar">
<img class="team-logo" src="url"></div>
<div class="col-xs-2 al">
<h4 class="loss">(NA)<br>
<span class="team-name">Team1</span>
<br>
<span class="win spoiler-wrap">0</span>
</h4>
</div>
<div class="col-xs-2">
<img class="league-logo" src="url">
<h4> V.S.</h4>
</div>
<div class="col-xs-2 ar">
<h4 class="">(NA)<br>
<span class="team-name">Team2</span>
<br>
<span class="win spoiler-wrap">4</span>
</h4>
</div>
This is a sample of what I am working with for HTML on the website.
Yes, div class="day-wrap" could have a different number of childs. But I don't think that's a problem.
You want to get game times of all Rocket League matches. As you've noticed, games times are located within h3 elements. You can access it directly with one of the following XPaths :
//div[#id="section-content"]//h3
//div[#class="day-wrap"]//h3
//div[contains(#class,"match region")]//h3
If you want something for a loop then you can try :
(//div[#class="day-wrap"]//h3)[i]
where i is the number to increment (from 1 to x).
Side notes : your sample data looks incorrect (according to your XPath). You have a closing div line 2 and it seems you omit div class="row middle-xs center-xs weeks" before div class="matches".

XPath: how to select elements that are related to other on the same level

The question is simple but I don't have enough practice for this case :)
How to get price text value from every div within "block" if we know that we need only item_promo elements.
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">123</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">456</div>
</div>
<div class="block">
<div class="item_promo">item</div>
<div class="item_price">789</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">222</div>
</div>
<div class="block">
<div class="item">item</div>
<div class="item_price">333</div>
</div>
You could use the xpath :
//div[#class='block']/*[#class='item_promo']/following-sibling::div[#class='item_price']/text()
You look for div elements that has attribute class with value item_promo and look at its following sibling which has an attribute item_price and grab the text.
This XPath,
//div[div/#class='item_promo']/div[#class='item_price']
will return those item_price class div elements with sibling item_promo class div elements:
<div class="item_price">123</div>
<div class="item_price">456</div>
<div class="item_price">789</div>
This will work regardless of label/price order.

xpath string to select a specific node with specific attribute

I have the follwoing HTML:
<div class=""postrow first"">
<h2 class=""title icon"">
This is the title
</h2>
<div class=""content"">
<div id=""post_message_1668079"">
<blockquote class=""postcontent restore "">
<div>Category</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div>
<div class=""postrow"">
<h2 class=""title icon"">
second title
</h2>
<div class=""content"">
<div id=""post_message_1668079"">
<blockquote class=""postcontent restore "">
<div>Category</div>
line 1<br /> line2
</blockquote>
</div>
</div>
</div>
What is the xpath string to select all DIVs with attribute is "postrow" or "postrow "
This answer assumes that for each "", you actually have " in your document.
There are a number of alternative XPaths available to you. Here are just two:
Using a conditional |:
//div[#class = "postrow"] | //div[#class = "postrow "]
Using starts-with:
//div[starts-with(#class, "postrow")]

XPath to look for subtree

I'm scraping an html document, whose structure changes all the time. Css class names even change, so I can't rely on that. However, one thing never changes, the value is always contained in a subtree exactly like the following:
<span>
<span>
<span>wanted value</span>
<span></span>wanted value
</span>
</span>
Can this be expressed as an XPath expression?
It should not match:
<span>
<span>
<span> 1, one too little </span>
<span> 2 </span>
<span> 3, one too many </span>
<span> 4, two too many </span>
</span>
</span>
I plan to do this using lxml for Python.
If the location of the wanted value is always on the third level of span an xpath as follows will work:
//span/span/span[1]
When applied on the next HTML document:
<html>
<head>
<title>Your Title</title>
</head>
<body>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
</body>
</html>
The result will be:
wanted value
wanted value
EDIT
If you only want the values of the first span on the third level when the total of spans equals 2 on the third level you can use the following XPath:
//span/span[count(span) = 2]/span[1]

Resources