xpath: find attribute value from identifier in current element attribute - xpath

I have an XML structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<p getelement="1234"></p>
</section>
</section>
</body>
</document>
I want to search for the attribut value defined in "getelement". I got this code from a friendly soule here:
//section[section/p[#xml:id=#getelement]]/subtitle
but it doesnt work and i cant use current() since it is not supported in Arbortext.

You are comparing the attributes of the same element, but they are not. You have to find the getelement:
//section[section/p[#xml:id=//#getelement]]/subtitle
Also note that xml:id attributes cannot start with digits.

Related

Auto generate XPath for known element in HTML tree using python

Is there any way (libs, not manually) for generating relative XPath for a known element in HTML?
Let say the second P element inside class="content"
<html>
<body>
<div class"title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class"content">
<p>****</p>
<p>****</p>
</div>
</body>
</html>
Use case:
The idea is to guess where are the elements that I might be interested in. For example title, content or author. After I've found the element I want to generate xpath for it and later use Python3.
Try something like this:
from lxml import etree
datum = """
<html>
<body>
<div class="title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class="content">
<p>something</p>
<p>target</p>
</div>
</body>
</html>
"""
root = etree.fromstring(datum)
tree = etree.ElementTree(root)
find_text = etree.XPath("//p[text()='target']")
for target in find_text(root):
print(tree.getpath(target))
Output:
/html/body/div[2]/p[2]

How to stop Sphinx to use Pygments for code blocks?

pygments translates code blocks into very ugly, semantically invalid markup.
For example this rst
.. code block:: html
<html>
<head>... head of the document ...</head>
</html>
will be translated into pre wrapped by two divs and will contain spans for each line:
<div class="highlight-html">
<div class="highlight">
<span class="x"><html></span>
<span class="x"><head>... head of the document ...</head></span>
<span class="x"></html></span>
</div>
</div>
Is there a way to translate a code block into code wrapped by pre (as W3C recommends) like below?
<pre>
<code>
<html>
<head>... head of the document ...</head>
</html>
</code>
</pre>

Best way to markup "mainContentOfPage"?

for other areas of a web page it is simple to mark up; i.e. navigation element, header, footer, sidebar
Not so with mainContentOfPage; I've seen a number of different ways to implement this, most recently (and I found this one to be the most strange) on schema.org itself:
<div itemscope itemtype="http://schema.org/Table">
<meta itemprop="mainContentOfPage" content="true"/>
<h2 itemprop="about">list of presidents</h2>
<table>
<tr><th>President</th><th>Party</th><tr>
<tr>
<td>George Washington (1789-1797)</td>
<td>no party</td>
</tr>
<tr>
<td>John Adams (1797-1801)</td>
<td>Federalist</td>
</tr>
...
</table>
</div>
I could use some examples; the main content of my page is in this case a search results page, but I would plan to use this on other pages too (homepage, product page, etc.)
Edit, I found some more examples:
Would this be valid? I found this on a blog:
<div id="main" itemscope itemtype="http://schema.org/WebPageElement" itemprop="mainContentOfPage">
<p>The content</p>
</div>
I also found this even simpler example on another blog (might be too simple?):
<div id="content" itemprop="mainContentOfPage">
<p>The content</p>
</div>
The mainContentOfPage property can be used on WebPage and expects a WebPageElement as value.
But Table is not a child of WebPage and true is not an expected value. So this example is in fact strange, as it doesn’t follow the specification.
A parent WebPage should use Table as value for mainContentOfPage:
<body itemscope itemtype="http://schema.org/WebPage">
<div itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/Table">
</div>
</body>
EDIT: Update
Your second example is the same like mine, it just uses the more general WebPageElement instead of Table. (Of course you’d still need a parent WebPage item, like in my example.)
Your third example is not in line with schema.org’s definition, as the value is Text and not the expected WebPageElement (or child) item.
A valid option would be:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Of course you may add related properties to top-level or nested elements, and change Thing into any other item type listed at Full Hierarchy. I also recommend to use mainEntity, documentation still doesn't clarify if it's really necessary, but according to 1st example here, using WebPage you may want to specify a mainEntity:
<body itemscope itemtype="http://schema.org/WebPage">
<header><h1 itemscope itemprop="mainEntity" itemtype="http://schema.org/Thing">whatever</h1></header>
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h2 itemprop="name">whatever</h2>
</div>
</main>
</body>
Cannot tell if also this would be valid:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="mainEntity" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Documentation doesn't say nothing about setting mainEntity to nested items.
In any case, consider that "[...] Every web page is implicitly assumed to be declared to be of type WebPage [...]" as stated in WebPage description, and use of HTML tags as <main>, <footer> or <header> already gives information about what type of elements are used in a page. So if actually you do not need to add relevant information to those elements or to your web page itself, with a proper use of HTML tags you could easily do without mainContentOfPage or even WebPage.

Rich Snippets : Microdata itemprop out of the itemtype?

I've recently decided to update a website by adding rich snippets - microdata.
The thing is I'm a newbie to this kind of things and I'm having a small question about this.
I'm trying to define the Organization as you can see from the code below:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
Now, my problems consists in the following: I'd like to also tag the LOGO in order to make a complete Organization profile, but the logo stands in the header of my page, and the div I've posted above stands in the footer and the style/layout of the page doesnt permit me to add the logo in here and also make it visible.
So, how can I solve this thing? What's the best solution?
Thanks.
You can use the itemref attribute.
Give your logo in the header an id and add the corresponding itemprop:
<img src="acme-logo.png" alt="ACME Inc." itemprop="logo" id="logo" />
Now add itemref="logo" to your div in the footer:
<div class="block-content" itemscope itemtype="http://schema.org/Organization" itemref="logo">
…
</div>
If this is not possible in your case, you could "duplicate" the logo so that it’s included in your div, but not visible. Microdata allows meta and link elements in the body for this case. You should use the link element, as http://schema.org/Organization expects an URL for the logo property. (Alternatively, add it via meta as a separate ImageObject).
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
…
<link itemprop="logo" src="logo.png" />
…
</div>
Side note: I don’t think that you are using the hr element correctly in your example. If you simply want to display a horizontal line, you should use CSS (e.g. border-top on the p) instead.
Dan, you could simply add in the logo schema with this code:
<img itemprop="logo" src="http://www.example.com/logo.png" />
So in your example, you could simply tag it as:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<img itemprop="logo" src="http://www.example.com/logo.png" />
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
I believe that should work for your particular case and it won't actually show the logo and you wouldn't have to mark up the logo separately. Hope that helps.

xpath find attribute by id and get the attribute parent content

I have an XML-structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
</section>
</section>
</body>
</document>
What i want to is to find search for the attribute xml:id containing 12345678 and once found, get the previous sibling (subtitle) content. Is this possible with xpath? I have this:
//p[contains(#xml:id,'12345678')]/preceding-sibling::subtitle
If I have understood the post correctly, for the specific query that you have put, the expected answer is Something Again2. You can use the following query to do this:
UPDATED as the document schema is changed
//section[section/p[#xml:id="12345678"]]/subtitle

Resources