XPath to look for subtree - xpath

I'm scraping an html document, whose structure changes all the time. Css class names even change, so I can't rely on that. However, one thing never changes, the value is always contained in a subtree exactly like the following:
<span>
<span>
<span>wanted value</span>
<span></span>wanted value
</span>
</span>
Can this be expressed as an XPath expression?
It should not match:
<span>
<span>
<span> 1, one too little </span>
<span> 2 </span>
<span> 3, one too many </span>
<span> 4, two too many </span>
</span>
</span>
I plan to do this using lxml for Python.

If the location of the wanted value is always on the third level of span an xpath as follows will work:
//span/span/span[1]
When applied on the next HTML document:
<html>
<head>
<title>Your Title</title>
</head>
<body>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
</body>
</html>
The result will be:
wanted value
wanted value
EDIT
If you only want the values of the first span on the third level when the total of spans equals 2 on the third level you can use the following XPath:
//span/span[count(span) = 2]/span[1]

Related

Scrapy xpath select parent element based on text value in subelement and lacking of element

I want to select all elements article that don't contain a span element with class status and where the nested a element contains a href attribute which contains the text "rent.html".
I've managed to get the a element like so:
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]')
But reading here and trying to select the first parent element article like so returns "data=0"
response.xpath('//article[#class="car"]//a[contains(#href,"rent.html")]//parent::article and not //article[#class="car"]//span[#class="status"]')
I also tried this.
response.xpath('//article[#class="car"][//a[contains(#href,"rent.html")]/article and not //article[#class="car"]//span[#class="status"]')')
I don't know what the expression is for my use case.
<article class="car">
<div>
<div class="container">
<a href="/34625030/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34625230/rent.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/12325230/buy.html">
</a>
</div>
</div>
</article>
<article class="car">
<div>
<div class="container">
<a href="/34632230/rent.html">
</a>
</div>
</div>
<span class="status">Rented</span>
</article>
This XPath expression will do the work:
"//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]"
The entire command is:
response.xpath("//article[not(.//span[#class='status'])][.//a[contains(#href,'rent.html')]]")
Explanations:
Translating your requirements into XPath syntax.
"select all elements article" - //article
"that don't contain a span element with class status" - [not(.//span[#class='status'])]
" and where the nested a element contains a href attribute which contains the text "rent.html"" - [.//a[contains(#href,'rent.html')]]
I tested the XPath above on the shared sample XML and it worked properly.

Get Element using XPath in Puppeteer

I am trying to scrape multiple elements with the same class names but each has a different number of children. I am looking for a way to select specific elements using the xpath(this would make it easiest for my loop).
const gameTimeElement = await page.$$('//*[#id="section-content"]/div[2]/div[1]/div/div['+ i + ']');
const gameTimeString = await gameTimeElement[j].$eval('h3', (h3) => h3.innerHTML);
This currently does not work.
After I select the element, I grab the h3 tag inside and evaluate it to get the innerHTML.
Is there a way to do this utilizing xpath?
<div id="section-content" style="display: block;">
</div>
<div class="matches">
<div class="day day-28-1" data-week="1" style="display: block;">
<h4>Sat, March 28, 2020</h4>
<div class="day-wrap">
<div class="match region-7-57d5ab4-9qs98v" data-week="1">
<h3 class="time">2:00PM
<span>(Central Daylight Time)</span>
<span class="fr">Best of 7</span>
</h3>
<div class="row ac ">
<div class="col-xs-3 ar">
<img class="team-logo" src="url"></div>
<div class="col-xs-2 al">
<h4 class="loss">(NA)<br>
<span class="team-name">Team1</span>
<br>
<span class="win spoiler-wrap">0</span>
</h4>
</div>
<div class="col-xs-2">
<img class="league-logo" src="url">
<h4> V.S.</h4>
</div>
<div class="col-xs-2 ar">
<h4 class="">(NA)<br>
<span class="team-name">Team2</span>
<br>
<span class="win spoiler-wrap">4</span>
</h4>
</div>
This is a sample of what I am working with for HTML on the website.
Yes, div class="day-wrap" could have a different number of childs. But I don't think that's a problem.
You want to get game times of all Rocket League matches. As you've noticed, games times are located within h3 elements. You can access it directly with one of the following XPaths :
//div[#id="section-content"]//h3
//div[#class="day-wrap"]//h3
//div[contains(#class,"match region")]//h3
If you want something for a loop then you can try :
(//div[#class="day-wrap"]//h3)[i]
where i is the number to increment (from 1 to x).
Side notes : your sample data looks incorrect (according to your XPath). You have a closing div line 2 and it seems you omit div class="row middle-xs center-xs weeks" before div class="matches".

Get an element from the text of its descendants

I have the text of all the descendants of an element in one line.
How to get the element class="Block" ?
I could find an element with the text of some one descendant, but in another element it can be the same.
1 - It is necessary to use the text of all descendants.
2 - I don't know which tags are the descendants of.
3 - I don't know the elements and descendants positions, they always change
4 - Can be different number of descendants
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
</head>
<body>
<div class="AllBlock">
<div class="Block">
<span>First text</span> <span>different text</span> <a>first link</a>
</div>
<div class="Block">
<span>Second text</span> <span>different text</span> <a>Second link</a>
</div>
<div class="Block">
<span>Third text</span> <span>different text</span> <a>Third link</a>
</div>
<div class="Block">
<span>Fourth text</span> <span>different text</span> <a>Fourth link</a>
</div>
<div class="Block">
<span>Fifth text</span> <span>different text</span> <a>Fifth link</a>
</div>
</div>
</body>
</html>
To select node by its space-normalized string value ignoring innerHTML structure, try below:
//div[#class="Block" and normalize-space()="First text different text first link"]

Marking up HTML code with microdata when there are multiple products on a page

I have a page which compares 4 products at a time in parallel tabular form i.e. It mentions features of each of them one after another. Here is a sample page .
I wish to tag these features so that it becomes easier for search engines to interpret. However, in all the examples given here, you have to mention all the features of a product at a time in a div. This causes a problem for my case, where I mention the features of product together.
A typical example as given goes like this :-
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="name">Blend-O-Matic</span>
<span itemprop="price">$19.95</span>
</div>
However, I would like it to be in this way :-
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="name">Blend-O-Matic</span> // Item 1
</div>
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="name">Blend-O-Matic2</span> // Item 2
</div>
Further followed by :-
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="price">$19.95</span> // Item 1
</div>
<div itemscope itemtype="http://schema.org/Offer">
<span itemprop="price">$21.95</span> // Item 2
</div>
So, in nutshell, is there a way so that I can tag an item with some code and then use it to refer to other details of that item ?
Please comment if I am unclear in asking my doubt !
Use itemref:
<div itemscope itemtype="http://schema.org/Offer" itemref="item1_price">
<span itemprop="name">Blend-O-Matic</span>
</div>
<div id="item1_price">
<span itemprop="price">$19.95</span>
</div>
See results from Google Structured Data Testing Tool here
You might want to have a look at this for SERP. It shows how to have multiple products in a "ItemList"
http://scottgale.com/schema-org-markup-serp/2013/03/17/
Hth
PS: This works without error or issue on the Google Structured Data testing tool over at http://www.google.com/webmasters/tools/richsnippets
But)))
If to be more realistic - You always have WebPage itemtype yes?
So if you have it we have about this:
<div itemscope="" itemtype="http://schema.org/WebPage">
<div itemscope itemtype="http://schema.org/Offer" itemref="item1_price">
<span itemprop="name">Blend-O-Matic</span>
</div>
<div id="item1_price">
<span itemprop="price">$19.95</span>
</div>
</div>
See the google result
And we have a mistake. If we add the same itemscope="" itemtype="http://schema.org/Offer" we will have one full offer and one duplicate with only price. Code:
<div itemscope="" itemtype="http://schema.org/WebPage">
<div itemscope="" itemtype="http://schema.org/Offer" itemref="item1_price">
<span itemprop="name">Blend-O-Matic</span>
</div>
<div itemscope="" itemtype="http://schema.org/Offer">
<span id="item1_price" itemprop="price">$19.95</span>
</div>
</div>
Google result
So we need a different way as I understand, am I right?

What is the Xpath expression that involves multiple exclusions?

Suppose I have html like this:
<div id="wrap">
<div id="content">
<span>some content</span>
<div id="s1">
<p> some text </p>
</div>
<h2 id="sec1">
<span> some text </span>
<p> some text </p>
</h2>
<h2 id="sec1">
<span> some text </span>
<div> some more text </div>
<p> some text </p>
</h2>
<h2 id="sec2">
<span> do not select me some text </span>
<div> do not select me some more text </div>
<p> do not select me some text </p>
</h2>
<h2 id="sec3">
<span> do not select me some text </span>
<div> do not select me some more text </div>
<p> do not select me some text </p>
</h2>
</div>
</div>
What is the XPath expression that selects all text node except those that are under h2 id=sec2 and h2 id=sec3 ?
Literally, "all text node except those that are under h2 id=sec2 and h2 id=sec3":
//text()[not(ancestor::h2[#id='sec2' or #id='sec3'])]
However I suspect that you don't really want that, because you would be losing the <span> and <p> structure. Would it be correct to infer that you want to select all the child elements of the content <div>, except for the <h2>s whose id's are sec2 and sec3? If so,
/div/div[#id = 'content']/*[not(self::h2 and (#id = 'sec2' or #id = 'sec3'))]
But you should also be aware that the text content of the <h2> element is merely the title of a section, not the whole text of the section. So it looks like by putting div's and p's inside an h2, you are not using it the way it is designed.
All elements under an <h2> (except …):
//h2[not(#id = 'sec2' or #id = 'sec3')]/*
All <span>, <div> or <p> elements anywhere (except …):
//*[self::span or self::div or self::p][not(parent::h2/#id = 'sec2' or parent::h2/#id = 'sec3')]
alternative notation (note the parens and the slightly changed predicate):
(//span|//div|//p)[not(parent::h2[#id = 'sec2' or #id = 'sec3'])]

Resources