xpath find attribute by id and get the attribute parent content - xpath

I have an XML-structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
</section>
</section>
</body>
</document>
What i want to is to find search for the attribute xml:id containing 12345678 and once found, get the previous sibling (subtitle) content. Is this possible with xpath? I have this:
//p[contains(#xml:id,'12345678')]/preceding-sibling::subtitle

If I have understood the post correctly, for the specific query that you have put, the expected answer is Something Again2. You can use the following query to do this:
UPDATED as the document schema is changed
//section[section/p[#xml:id="12345678"]]/subtitle

Related

How to properly get the value contained inside a section using XPath?

having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL

How to stop Sphinx to use Pygments for code blocks?

pygments translates code blocks into very ugly, semantically invalid markup.
For example this rst
.. code block:: html
<html>
<head>... head of the document ...</head>
</html>
will be translated into pre wrapped by two divs and will contain spans for each line:
<div class="highlight-html">
<div class="highlight">
<span class="x"><html></span>
<span class="x"><head>... head of the document ...</head></span>
<span class="x"></html></span>
</div>
</div>
Is there a way to translate a code block into code wrapped by pre (as W3C recommends) like below?
<pre>
<code>
<html>
<head>... head of the document ...</head>
</html>
</code>
</pre>

Best way to markup "mainContentOfPage"?

for other areas of a web page it is simple to mark up; i.e. navigation element, header, footer, sidebar
Not so with mainContentOfPage; I've seen a number of different ways to implement this, most recently (and I found this one to be the most strange) on schema.org itself:
<div itemscope itemtype="http://schema.org/Table">
<meta itemprop="mainContentOfPage" content="true"/>
<h2 itemprop="about">list of presidents</h2>
<table>
<tr><th>President</th><th>Party</th><tr>
<tr>
<td>George Washington (1789-1797)</td>
<td>no party</td>
</tr>
<tr>
<td>John Adams (1797-1801)</td>
<td>Federalist</td>
</tr>
...
</table>
</div>
I could use some examples; the main content of my page is in this case a search results page, but I would plan to use this on other pages too (homepage, product page, etc.)
Edit, I found some more examples:
Would this be valid? I found this on a blog:
<div id="main" itemscope itemtype="http://schema.org/WebPageElement" itemprop="mainContentOfPage">
<p>The content</p>
</div>
I also found this even simpler example on another blog (might be too simple?):
<div id="content" itemprop="mainContentOfPage">
<p>The content</p>
</div>
The mainContentOfPage property can be used on WebPage and expects a WebPageElement as value.
But Table is not a child of WebPage and true is not an expected value. So this example is in fact strange, as it doesn’t follow the specification.
A parent WebPage should use Table as value for mainContentOfPage:
<body itemscope itemtype="http://schema.org/WebPage">
<div itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/Table">
</div>
</body>
EDIT: Update
Your second example is the same like mine, it just uses the more general WebPageElement instead of Table. (Of course you’d still need a parent WebPage item, like in my example.)
Your third example is not in line with schema.org’s definition, as the value is Text and not the expected WebPageElement (or child) item.
A valid option would be:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Of course you may add related properties to top-level or nested elements, and change Thing into any other item type listed at Full Hierarchy. I also recommend to use mainEntity, documentation still doesn't clarify if it's really necessary, but according to 1st example here, using WebPage you may want to specify a mainEntity:
<body itemscope itemtype="http://schema.org/WebPage">
<header><h1 itemscope itemprop="mainEntity" itemtype="http://schema.org/Thing">whatever</h1></header>
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h2 itemprop="name">whatever</h2>
</div>
</main>
</body>
Cannot tell if also this would be valid:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="mainEntity" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Documentation doesn't say nothing about setting mainEntity to nested items.
In any case, consider that "[...] Every web page is implicitly assumed to be declared to be of type WebPage [...]" as stated in WebPage description, and use of HTML tags as <main>, <footer> or <header> already gives information about what type of elements are used in a page. So if actually you do not need to add relevant information to those elements or to your web page itself, with a proper use of HTML tags you could easily do without mainContentOfPage or even WebPage.

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?
Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
» Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")
At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.
Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

xpath: find attribute value from identifier in current element attribute

I have an XML structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<p getelement="1234"></p>
</section>
</section>
</body>
</document>
I want to search for the attribut value defined in "getelement". I got this code from a friendly soule here:
//section[section/p[#xml:id=#getelement]]/subtitle
but it doesnt work and i cant use current() since it is not supported in Arbortext.
You are comparing the attributes of the same element, but they are not. You have to find the getelement:
//section[section/p[#xml:id=//#getelement]]/subtitle
Also note that xml:id attributes cannot start with digits.

Resources