Auto generate XPath for known element in HTML tree using python

Auto generate XPath for known element in HTML tree using python - xpath

Is there any way (libs, not manually) for generating relative XPath for a known element in HTML?
Let say the second P element inside class="content"
<html>
<body>
<div class"title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class"content">
<p>****</p>
<p>****</p>
</div>
</body>
</html>
Use case:
The idea is to guess where are the elements that I might be interested in. For example title, content or author. After I've found the element I want to generate xpath for it and later use Python3.

Try something like this:
from lxml import etree
datum = """
<html>
<body>
<div class="title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class="content">
<p>something</p>
<p>target</p>
</div>
</body>
</html>
"""
root = etree.fromstring(datum)
tree = etree.ElementTree(root)
find_text = etree.XPath("//p[text()='target']")
for target in find_text(root):
print(tree.getpath(target))
Output:
/html/body/div[2]/p[2]

Related

Issues with preceding sibiling/parent/ancestor

<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
Note: there are over 20 productHolder classes on the same page.
I am able to get the price data, how can i used parent or preceding sibling to get the href.
I use the following code to get price:
rawPrice = response.xpath("//*[contains(text(),'$')]/text()")[counter].extract()
I've spent 2 hours trying to use preceding sibling, parent and even changing the code to use other values but, I run issues elsewhere.
Any help is appreciated, cheers!

Were you looking for something like:
from io import StringIO
from lxml import etree
html = """
<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
"""
root = etree.parse(StringIO(html), etree.HTMLParser())
print(root.xpath('//*[contains(text(),"$")]/../#href')[0])
Result:
https://ap.com
Of course you can easily build from this:
item = root.xpath('//*[contains(text(),"$")]')
print(item[0].text)
print(item[0].xpath('../#href')[0])
Result:
$980
https://ap.com

Simple dom document iteration

I have an HTML as so:
<html>
<body>
<div class="somethingunneccessary"></div>
<div class="container">
<div>
<p>text1</p>
<p>text2</p>
<p>text3</p>
</div>
<div>
<p>text4/p>
<p>text5</p>
<p>text6</p>
</div>
<div>
<p>text7</p>
<p>text8</p>
<p>text9</p>
</div>
<div>
<p>text10</p>
<p>text11</p>
<p>text12</p>
</div>
<div>
<p>text13</p>
<p>text14</p>
<p>text15</p>
</div>
</div>
</body>
</html>
What I'm trying to accomplish is the following:
1./ Loop over the div elements within the div having a class container.
2./ During the iteration I want to grab the text from the 3rd p tag.
The looping part is essential instead of just slicing out the p tags by themselves
I've got some code done but it doesn't do looping:
$doc=new DOMDocument();
$doc->loadHTML($htmlsource);
$xpath = new DOMXpath($doc);
$commentxpath = $xpath->query("/html/body/div[2]/div[5]/p[3]");
$commentdata = $commentxpath->item(0)->nodeValue;
How do I loop through each inner div element and extract the 3rd p tag.
Like I said, the looping is essential.

During the iteration I want to grab the text from the 3rd p tag
Try:
"//div[#class='container']/div/p[3]"
This should return all third p in all div inside of div with class container.

You may have to query over attributes: php xpath get attribute value
$xpath->query("/html/body/div[#class='container']");

Just try
/html/body/div/div//p
That should return only the p elements XD

Best way to markup "mainContentOfPage"?

for other areas of a web page it is simple to mark up; i.e. navigation element, header, footer, sidebar
Not so with mainContentOfPage; I've seen a number of different ways to implement this, most recently (and I found this one to be the most strange) on schema.org itself:
<div itemscope itemtype="http://schema.org/Table">
<meta itemprop="mainContentOfPage" content="true"/>
<h2 itemprop="about">list of presidents</h2>
<table>
<tr><th>President</th><th>Party</th><tr>
<tr>
<td>George Washington (1789-1797)</td>
<td>no party</td>
</tr>
<tr>
<td>John Adams (1797-1801)</td>
<td>Federalist</td>
</tr>
...
</table>
</div>
I could use some examples; the main content of my page is in this case a search results page, but I would plan to use this on other pages too (homepage, product page, etc.)
Edit, I found some more examples:
Would this be valid? I found this on a blog:
<div id="main" itemscope itemtype="http://schema.org/WebPageElement" itemprop="mainContentOfPage">
<p>The content</p>
</div>
I also found this even simpler example on another blog (might be too simple?):
<div id="content" itemprop="mainContentOfPage">
<p>The content</p>
</div>

The mainContentOfPage property can be used on WebPage and expects a WebPageElement as value.
But Table is not a child of WebPage and true is not an expected value. So this example is in fact strange, as it doesn’t follow the specification.
A parent WebPage should use Table as value for mainContentOfPage:
<body itemscope itemtype="http://schema.org/WebPage">
<div itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/Table">
</div>
</body>
EDIT: Update
Your second example is the same like mine, it just uses the more general WebPageElement instead of Table. (Of course you’d still need a parent WebPage item, like in my example.)
Your third example is not in line with schema.org’s definition, as the value is Text and not the expected WebPageElement (or child) item.

A valid option would be:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Of course you may add related properties to top-level or nested elements, and change Thing into any other item type listed at Full Hierarchy. I also recommend to use mainEntity, documentation still doesn't clarify if it's really necessary, but according to 1st example here, using WebPage you may want to specify a mainEntity:
<body itemscope itemtype="http://schema.org/WebPage">
<header><h1 itemscope itemprop="mainEntity" itemtype="http://schema.org/Thing">whatever</h1></header>
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="about" itemscope="" itemtype="http://schema.org/Thing">
<h2 itemprop="name">whatever</h2>
</div>
</main>
</body>
Cannot tell if also this would be valid:
<body itemscope itemtype="http://schema.org/WebPage">
<main itemprop="mainContentOfPage" itemscope itemtype="http://schema.org/WebPageElement">
<div itemprop="mainEntity" itemscope="" itemtype="http://schema.org/Thing">
<h1 itemprop="name">whatever</h1>
</div>
</main>
</body>
Documentation doesn't say nothing about setting mainEntity to nested items.
In any case, consider that "[...] Every web page is implicitly assumed to be declared to be of type WebPage [...]" as stated in WebPage description, and use of HTML tags as <main>, <footer> or <header> already gives information about what type of elements are used in a page. So if actually you do not need to add relevant information to those elements or to your web page itself, with a proper use of HTML tags you could easily do without mainContentOfPage or even WebPage.

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?

Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
» Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")

At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.

Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

Sinatra indent partial erb at statement

Is it a way to make partials in sinatra be indented at the level where I call it?
Example
<body>
<div>
<%= partial :"mypartial" %>
</div>
</body>
Results in
<body>
<div>
<div id="i am defined in mypartial">
//etc
</div>
</div>
</body>
When I want
<body>
<div>
<div id="i am defined in mypartial">
//etc
</div>
</div>
</body>
This is possible if I indent the partial currectly, but that makes it hard to work with. I want the partial to be starting indentation all the way to the left(in the source file).
Maybe there is some kind of post processor that can format that html for me?
This is for a project that is not going public, but it's important that internal users can read the generated html easily. Including correct indentation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Auto generate XPath for known element in HTML tree using python - xpath

Related

Issues with preceding sibiling/parent/ancestor

Simple dom document iteration

Best way to markup "mainContentOfPage"?

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

Sinatra indent partial erb at statement

Categories

Resources