selecting specific text node with xpath? - xpath

<html>
apple
<Br>
orange
<br>
drugs
</html>
can you do something like
//html/text()[2]
it doesn't work.

<?xml version="1.0"?>
<html>
apple
<br/>
orange
<br/>
drugs
</html>
//html/text()[2]
returns orange for me # http://www.xmlme.com/XpathTool.aspx. What language are you dealing with?

Related

How to get Xpath for the following?

I have a xml file like following
<topic>
<title>Abstract
</title>
<body>
<p>
abstract data
</p>
</body>
</topic>
<topic>
<title>Keywords</title>
<body>
<p>
keywords data
</p>
</body>
</topic>
I have to check if title is "Keywords" than show the <p>text in </p>.
can anyone help me to get the exact xpath for this?
Thanks in advance
Try this one and let me know the result:
//title[text()="Keywords"]/following::p
or
//topic[title[text()="Keywords"]]//p
//title[text()="Keywords"]/body/p
for text only
//title[text()="Keywords"]/body/p/text()
please avoid double slash "//" and following, it will travel all the P tag
Try this below xpath
//title[text()="Keywords"]/following::p
Explanation of xpath:- Start your xpath with <title> along with text method and move ahead to the <p> tag using the following keyword.

Simple wkhtmltopdf conversion with framesets creating empty pdf

We need to convert/provide our html-based in-app HelpSystem to an on-disc pdf for the client to view outside of the application.
I'm trying to use wkhtmltopdf with a very basic file (3 frames with links to simple .html files) but getting an empty .pdf when I run the following from the command line:
wkhtmltopdf "C:\Program Files (x86)\wkhtmltopdf\index.html" "c:\delme\test.pdf"
I know frames are somewhat deprecated but it’s what I’ve got to deal with. Are the frames causing the empty pdf?
Index.html:
<html>
<head>
<title>Help</title>
</head>
<frameset cols="28%, 72%">
<frameset rows="8%, 92%">
<frame noresize="noresize" src="Buttons.html" name="UPPERLEFT" />
<frame noresize="noresize" src="mytest2.html" name="LOWERLEFT" />
</frameset>
<frame noresize="noresize" src="mytest.html" name="RIGHT" />
</frameset>
</html>
mytest.html:
<html>
<body>
<p>
<b>This text is bold</b>
</p>
<p>
<strong>This text is strong</strong>
</p>
<p>
<em>This text is emphasized</em>
</p>
<p>
<i>This text is italic</i>
</p>
<p>
<small>This text is small</small>
</p>
<p>This is
<sub>subscript</sub> and
<sup>superscript</sup></p>
</body>
</html>
mytest2.html:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<h2>The blockquote Element</h2>
<p>The blockquote element specifies a section that is quoted from another source.</p>
<p>Here is a quote from WWF's website:</p>
<blockquote cite="http://www.worldwildlife.org/who/index.html">For 50 years, WWF has been protecting the future of nature. The
world’s leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United
States and close to 5 million globally.</blockquote>
<p>
<b>Note:</b> Browsers usually indent blockquote elements.</p>
<h2>The q Element</h2>
<p>The q element defines a short quotation.</p>
<p>WWF's goal is to:
<q>Build a future where people live in harmony with nature.</q> We hope they succeed.</p>
<p>
<b>Note:</b> Browsers insert quotation marks around the q element.</p>
</body>
</html>
buttons.html:
![<html>
<body>
<center>
<table>
<tr>
<td>
<form method="link" action="mytest.html" target="LOWERLEFT">
<input type="submit" value="Contents" />
</form>
</td>
<td>
<form method="link" action="mytest2.html" target="LOWERLEFT">
<input type="submit" value="Index" />
</form>
</td>
</tr>
</table>
</center>
</body>
</html>][2]
Taken from the official wkhtmltopdf issues area from a code project member’s answer; emphasis is mine:
wkhtmltopdf calculates the TOC based on the H* (e.g. H1, H2 and so on)
tags in the supplied documents. It does not recurse into frames and
iframes.. It will nest dependend on the number, to make sure that it
does the right thing, it is good to make sure that you only have
tags under a tag and not for some k larger
then 1. 2000+ files sounds like a lot. You might run out of memory
while converting the output. If it does not work for you.. you could
try using the switch to dump the outline to a xml file, to see what it
would but into a TOC.

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?
Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
» Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")
At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.
Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

xpath: find attribute value from identifier in current element attribute

I have an XML structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<p getelement="1234"></p>
</section>
</section>
</body>
</document>
I want to search for the attribut value defined in "getelement". I got this code from a friendly soule here:
//section[section/p[#xml:id=#getelement]]/subtitle
but it doesnt work and i cant use current() since it is not supported in Arbortext.
You are comparing the attributes of the same element, but they are not. You have to find the getelement:
//section[section/p[#xml:id=//#getelement]]/subtitle
Also note that xml:id attributes cannot start with digits.

xpath find attribute by id and get the attribute parent content

I have an XML-structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
</section>
</section>
</body>
</document>
What i want to is to find search for the attribute xml:id containing 12345678 and once found, get the previous sibling (subtitle) content. Is this possible with xpath? I have this:
//p[contains(#xml:id,'12345678')]/preceding-sibling::subtitle
If I have understood the post correctly, for the specific query that you have put, the expected answer is Something Again2. You can use the following query to do this:
UPDATED as the document schema is changed
//section[section/p[#xml:id="12345678"]]/subtitle

Resources