XPath syntax to make a query that exclude some specific element - xpath

You can find my test html page at https://sabbiobet.netsons.org/test.html
This is the html markup of the page:
<table border="1" class="class_table">
<tbody>
<tr class="class_tr">
<td class="class_td"> </td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>square</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ko"></span>circle</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>triangle</td>
</tr>
</tbody>
</table>
I need to obtain all the text in <td> with class="class_td" minus the ones that have text empty or or that have a child <span> with class="class_ko".
In other words I want to obtain only these values:
Square
Triangle
Using the importXML function of google sheets and following the suggestion of another user I've tried:
//td[#class='class_td' and span and not(span[#class='class_ko'])]
but it works only if i put some text between "span" and "/span"
Without any text I'll obtain only an empty result.
Can somebody help me?

In case the provided non-well-formed document is corrected to a well-formed one by replacing the undefined entity with the equivalent character entity reference  :
<table border="1" class="class_table">
<tbody>
<tr class="class_tr">
<td class="class_td"> </td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>square</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ko"></span>circle</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>triangle</td>
</tr>
</tbody>
</table>
then this XPath expression:
/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]/>
when evaluated, selects exactly the wanted td elements:
<td class="class_td">
<span class="class_span_ok"/>square</td>
<td class="class_td">
<span class="class_span_ok"/>triangle</td>
XSLT - based verification
This transformation evaluates the above XPath expression and copies the selected elements to the output:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]"/>
</xsl:template>
</xsl:stylesheet>
The wanted, correct result is produced:
<td class="class_td">
<span class="class_span_ok"/>square</td>
<td class="class_td">
<span class="class_span_ok"/>triangle</td>
Note:
If only the string values of the wanted elements are needed, then the XPath expression can be:
/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]/text()

Related

Why Xpath 3.0 works, but Xquery 3.0 doesn't work with the same expression

I launched Xpath in Oxygen. In Xpath 3.0 found what i need but in Xquery 3.0 doesn't find.
This is my Xpath expression
//table[tbody/tr/th/p[contains(text(), 'All Water System Contacts')]]/tbody/tr[3]/td[1]
This is my xml code
I put part code.
<table border="1" cellpadding="1" cellspacing="1" summary="." width="640">
<tbody>
<tr>
<th colspan="3">
<p>All Water System Contacts </p></th>
</tr>
<tr>
<th>Type</th>
<th>Contact</th>
<th>Communication</th>
</tr>
<tr>
<td align="center">AC - Administrative Contact - GENERAL MANAGER </td>
<td align="center">GRANT, JOHN, W <br/> PO BOX 869<br/> BIG SPRING, TX 79721-0869 </td>
<td align="center">
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse"
width="100%">
<tbody>
<tr>
<th><b>Electronic Type</b></th>
<th><b>Value</b></th>
</tr>
</tbody>
</table>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse"
width="100%">
<tbody>
<tr>
<th><b>Phone Type</b></th>
<th><b>Value</b></th>
</tr>
<tr>
<td align="center">BUS - Business</td>
<td align="center">432-267-6341 </td>
</tr>
<tr>
<td align="center">FAX - Facsimile</td>
<td align="center">432-267-3121 </td>
</tr>
<tr>
<td align="center">BUS - Business</td>
<td align="center">432-267-6070 </td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td align="center">OW - Owner </td>
<td align="center">COLORADO RIVER MUNICIPAL WATER DISTRICT <br/> PO BOX 869<br/> BIG
SPRING, TX 79721-0869 </td>
<td align="center"> </td>
</tr>
</tbody>
</table>
I tried different functions.
I don't know why it doesn't work and what difference
Please help me.
I suspect your real, complete input has an XHTML default namespace declaration xmlns="http://www.w3.org/1999/xhtml" and in oXygen for XPath you have the setting enabled to "use the default namespace of the root element" so your path works with XPath out of the box while for XQuery you need to make sure you explicitly set
declare default element namespace 'http://www.w3.org/1999/xhtml';
in the prolog of your XQuery file or code sample.

Xpath's //tr/td/a[text()="France"] shows 3 matches. How to get value of "x" result?

There is for example such piece of code:
<tr>
<td width="270" class="news">
Ukraine
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
USA
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>10.10.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
Germany
<span>17.06.15<span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>23.12.15<span>
</td>
</tr>
In dev panel in Chrome this xpath //tr/td/a[text()="France"] shows me 3 results.
I cant find the way to get the value I need.
For example, how to get the date of 2nd result (10.10.15)?
I tried different ways with position, something like //tr/td/a[text()="France"][position=2]/span/text() and //tr/td/a[position(text()="France")][2]/span/text()etc, but no success. Maybe "position" is not what I need?
EDITED:
Positions of "France" in code differ depends on a page. In the example above right now positions are 2,4,6, but on the other page it can be 3,10,15,25,27. How to get the 2nd result and the last one, without taking into account the position where 'France' actually located?
This is one possible way :
(//tr/td[a='France'])[2]/span/text()
The parentheses before position index is required because ([]) has a higher precedence (priority) than (// and /) *. So this expression //tr/td[a='France'][2]/span/text(), for example, will look for the 2nd td within a single tr parent instead, which doesn't exist in the HTML sample you posted.
*: for further explanation on this matter: How to select specified node within Xpath node sets by index with Selenium?
Your 'xml' was invalid so first I corrected it (I know it isn't valid HTML but good enough to test):
<?xml version="1.0" encoding="utf-8"?>
<form>
<tr>
<td width="270" class="news">
France
<span>17.06.15</span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>10.10.15</span>
</td>
</tr>
<tr>
<td width="270" class="news">
France
<span>23.12.15</span>
</td>
</tr>
</form>
Then, this worked:
//tr[position()=2]/td/a[text()="France"]/../span/text()

Optimal XPath Query for processing the sample HTML fragment

I have a feed that outputs HTML. The following segment is part of the output
<div class="leftnav">
<table border="0" cols="2">
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat1 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle1</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink1
</td>
</tr>
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat2 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle2</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink2
</td>
</tr>
</table>
</div>
I want to process above segment using XPATH so that output looks like this
Article Cat1
ArticleTitle1
ArticleLink1 Article Cat2
ArticleTitle2
ArticleLink2
What is the optimal XPATH that will produce the desired output? I tried //div[#class="leftnav"]/table/tr but this gives all the TR elements. I want to skip the first TR element so that I can get the output in the format I described above.
//div[#class="leftnav"]/table/tr[position() > 1]
Try the above
Stupid simple way:
substring-after(normalize-space(string(//*:div)), normalize-space(string(//*:div/*:table/*[1])))
Result: "Article Cat1 ArticleTitle1 ArticleLink1 nbsp Article Cat2 ArticleTitle2 ArticleLink2"
I don't know why, but (position() > 1) doesn't work in my environment, so I've used strings instead.

How to use "selenium-webdriver-xpath" to get the text value from <td>?

Find the below html code:
<table id="supplier_list_data" cellspacing="1" cellpadding="0" class="data">
<tr class="rowLight">
<td class="extraWidthSmall">
Cdata
</td>
<td class="extraWidthSmall">
xyz
</td>
<td class="extraWidthSmall">
ppm
</td>
</tr>
</table>
Now using xpath how to get the value xyz (means always the second "<td>") . Give me an idea please!
Try //tr/td[2]/data().
//tr/td selects all <td/> elements, [2] the second result inside each <tr/> and data() returns their contents.

Get siblings from header until next header in an unsemantic table

Using Scrapy I'd like to parse a webpage containing a very unsemantic table. What I'm looking for is a "print every following-sibling until you meet the following element"-XPath-query.
<table>
<tr>
<th>Title</th>
<th>Name</th>
<th>Comment</th>
<th>Note</th>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER1</b></td>
</tr>
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER2</b></td>
</tr>
<tr>
<td>Title2.1</td>
<td>Name2.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Title2.2</td>
<td>Name2.2</td>
<td>Info2.2</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER3</b></td>
</tr>
<tr>
<td>Title3.1</td>
<td>Name3.1</td>
<td></td>
<td></td>
</tr>
</table>
I'd like to group every Title, Name, Comment and Note under each header. I have tried with various XPaths (with variations of following-sibling, preceding-sibling and count) but I either get nothing, everything or every tr which is not a header.
I'm currently getting the headers with //tr[#style] or //tr[td[#colspan="4"]].
The following is the parse-function in my Scrapy-spider (which prints the header and all of the tr's which is not a header):
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[#id="content-text"]//tr[td[#colspan="4"]]')
for site in sites:
print site.select('./td/b/text()').extract()
print site.select('./following-sibling::tr[not(td[#colspan])]')
This XPath expression:
/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
[count(. | /*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
=
count(/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
]
selects all tr elements that are between the 1st and 2nd headers:
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td/>
</tr>
To select all tr elements that are between the Kth and (K+1)th headers, simply replace in the above expression 1 with K (the number) and 2 with K+1 (the number).
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
[count(. | /*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
=
count(/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<table>
<tr>
<th>Title</th>
<th>Name</th>
<th>Comment</th>
<th>Note</th>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER1</b>
</td>
</tr>
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER2</b>
</td>
</tr>
<tr>
<td>Title2.1</td>
<td>Name2.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Title2.2</td>
<td>Name2.2</td>
<td>Info2.2</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER3</b>
</td>
</tr>
<tr>
<td>Title3.1</td>
<td>Name3.1</td>
<td></td>
<td></td>
</tr>
</table>
the Xpath expression is evaluated and the selected nodes are copied to the output:
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td/>
</tr>
Explanation:
This is a simple application of the Kayessian (after Dr. Michael Kay) formula for node-set intersection:
$ns1[count(.|$ns2) = count($ns2)]
In this particulat case we substitute $ns1 with:
/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
and we substitute $ns2 with:
/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr

Resources