Get siblings from header until next header in an unsemantic table - xpath

Using Scrapy I'd like to parse a webpage containing a very unsemantic table. What I'm looking for is a "print every following-sibling until you meet the following element"-XPath-query.
<table>
<tr>
<th>Title</th>
<th>Name</th>
<th>Comment</th>
<th>Note</th>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER1</b></td>
</tr>
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER2</b></td>
</tr>
<tr>
<td>Title2.1</td>
<td>Name2.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Title2.2</td>
<td>Name2.2</td>
<td>Info2.2</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER3</b></td>
</tr>
<tr>
<td>Title3.1</td>
<td>Name3.1</td>
<td></td>
<td></td>
</tr>
</table>
I'd like to group every Title, Name, Comment and Note under each header. I have tried with various XPaths (with variations of following-sibling, preceding-sibling and count) but I either get nothing, everything or every tr which is not a header.
I'm currently getting the headers with //tr[#style] or //tr[td[#colspan="4"]].
The following is the parse-function in my Scrapy-spider (which prints the header and all of the tr's which is not a header):
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[#id="content-text"]//tr[td[#colspan="4"]]')
for site in sites:
print site.select('./td/b/text()').extract()
print site.select('./following-sibling::tr[not(td[#colspan])]')

This XPath expression:
/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
[count(. | /*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
=
count(/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
]
selects all tr elements that are between the 1st and 2nd headers:
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td/>
</tr>
To select all tr elements that are between the Kth and (K+1)th headers, simply replace in the above expression 1 with K (the number) and 2 with K+1 (the number).
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
[count(. | /*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
=
count(/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr)
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<table>
<tr>
<th>Title</th>
<th>Name</th>
<th>Comment</th>
<th>Note</th>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER1</b>
</td>
</tr>
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER2</b>
</td>
</tr>
<tr>
<td>Title2.1</td>
<td>Name2.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Title2.2</td>
<td>Name2.2</td>
<td>Info2.2</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4">
<b>HEADER3</b>
</td>
</tr>
<tr>
<td>Title3.1</td>
<td>Name3.1</td>
<td></td>
<td></td>
</tr>
</table>
the Xpath expression is evaluated and the selected nodes are copied to the output:
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td/>
</tr>
Explanation:
This is a simple application of the Kayessian (after Dr. Michael Kay) formula for node-set intersection:
$ns1[count(.|$ns2) = count($ns2)]
In this particulat case we substitute $ns1 with:
/*/tr[#style or td[#colspan='4']][1]/following-sibling::tr
and we substitute $ns2 with:
/*/tr[#style or td[#colspan='4']][2]/preceding-sibling::tr

Related

Correct mrtg cfgmaker file

mrtg cfgmaker does read incorrect values over SNMP V1 and V2 and I need to correct the resulting file.
I would like to run a script after creation and use sed if possible.
Lines that needs to be corrected in my case are for LAG's and normal ports:
MaxBytes[switch01_lag_26]: 125000000 should go to MaxBytes[switch01_lag_26]: 250000000
(switch01_lag_26 can be switch01_lag_1 until switch01_lag_26)
MaxBytes[switch01_g1]: 12500000 should go to MaxBytes[switch01_g1]: 125000000
(switch01_g1 can be switch01_g1 until switch01_g16)
What sed patterns I have to use to analyze if its a lag or port in the square brackets and then replace the number after the : ?
The html part should show the correct speed if possible too, this is original for port g1:
<h1>Traffic Analysis for g1-- switch01</h1>
<div id="sysdetails">
<table>
<tr>
<td>System:</td>
<td>switch01</td>
</tr>
<tr>
<td>Maintainer:</td>
<td></td>
</tr>
<tr>
<td>Description:</td>
<td>1-Gigabit---Level </td>
</tr>
<tr>
<td>ifType:</td>
<td>ethernetCsmacd (6)</td>
</tr>
<tr>
<td>ifName:</td>
<td>g1</td>
</tr>
<tr>
<td>Max Speed:</td>
<td>12.5 MBytes/s</td>
</tr>
<tr>
<td>Ip:</td>
<td>No Ip (No DNS name)</td>
</tr>
</table>
</div>
and should read at the end (Line below "Max Speed" is changed):
<h1>Traffic Analysis for g1-- switch01</h1>
<div id="sysdetails">
<table>
<tr>
<td>System:</td>
<td>switch01</td>
</tr>
<tr>
<td>Maintainer:</td>
<td></td>
</tr>
<tr>
<td>Description:</td>
<td>1-Gigabit---Level </td>
</tr>
<tr>
<td>ifType:</td>
<td>ethernetCsmacd (6)</td>
</tr>
<tr>
<td>ifName:</td>
<td>g1</td>
</tr>
<tr>
<td>Max Speed:</td>
<td>125.0 MBytes/s</td>
</tr>
<tr>
<td>Ip:</td>
<td>No Ip (No DNS name)</td>
</tr>
</table>
</div>
This is original for LAG 1:
<h1>Traffic Analysis for lag 1 -- switch01</h1>
<div id="sysdetails">
<table>
<tr>
<td>System:</td>
<td>switch01</td>
</tr>
<tr>
<td>Maintainer:</td>
<td></td>
</tr>
<tr>
<td>Description:</td>
<td>lag-1 </td>
</tr>
<tr>
<td>ifType:</td>
<td>IEEE 802.3ad Link Aggregate (161)</td>
</tr>
<tr>
<td>ifName:</td>
<td>lag 1</td>
</tr>
<tr>
<td>Max Speed:</td>
<td>125.0 MBytes/s</td>
</tr>
<tr>
<td>Ip:</td>
<td>No Ip (No DNS name)</td>
</tr>
</table>
</div>
which should read at the end (Line below "Max Speed" is changed):
<h1>Traffic Analysis for lag 1 -- switch01</h1>
<div id="sysdetails">
<table>
<tr>
<td>System:</td>
<td>switch01</td>
</tr>
<tr>
<td>Maintainer:</td>
<td></td>
</tr>
<tr>
<td>Description:</td>
<td>lag-1 </td>
</tr>
<tr>
<td>ifType:</td>
<td>IEEE 802.3ad Link Aggregate (161)</td>
</tr>
<tr>
<td>ifName:</td>
<td>lag 1</td>
</tr>
<tr>
<td>Max Speed:</td>
<td>250.0 MBytes/s</td>
</tr>
<tr>
<td>Ip:</td>
<td>No Ip (No DNS name)</td>
</tr>
</table>
</div>
I can change all speeds in HTML using sed -i 's/\([0-9.]\+\) MBytes/125.0 MBytes/' /switch01.cfg but this changes for LAG's too. How to detect if the HTML part belongs to a LAG?

XPath syntax to make a query that exclude some specific element

You can find my test html page at https://sabbiobet.netsons.org/test.html
This is the html markup of the page:
<table border="1" class="class_table">
<tbody>
<tr class="class_tr">
<td class="class_td"> </td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>square</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ko"></span>circle</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>triangle</td>
</tr>
</tbody>
</table>
I need to obtain all the text in <td> with class="class_td" minus the ones that have text empty or or that have a child <span> with class="class_ko".
In other words I want to obtain only these values:
Square
Triangle
Using the importXML function of google sheets and following the suggestion of another user I've tried:
//td[#class='class_td' and span and not(span[#class='class_ko'])]
but it works only if i put some text between "span" and "/span"
Without any text I'll obtain only an empty result.
Can somebody help me?
In case the provided non-well-formed document is corrected to a well-formed one by replacing the undefined entity with the equivalent character entity reference  :
<table border="1" class="class_table">
<tbody>
<tr class="class_tr">
<td class="class_td"> </td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>square</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ko"></span>circle</td>
</tr>
<tr class="class_tr">
<td class="class_td"><span class="class_span_ok"></span>triangle</td>
</tr>
</tbody>
</table>
then this XPath expression:
/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]/>
when evaluated, selects exactly the wanted td elements:
<td class="class_td">
<span class="class_span_ok"/>square</td>
<td class="class_td">
<span class="class_span_ok"/>triangle</td>
XSLT - based verification
This transformation evaluates the above XPath expression and copies the selected elements to the output:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]"/>
</xsl:template>
</xsl:stylesheet>
The wanted, correct result is produced:
<td class="class_td">
<span class="class_span_ok"/>square</td>
<td class="class_td">
<span class="class_span_ok"/>triangle</td>
Note:
If only the string values of the wanted elements are needed, then the XPath expression can be:
/*/*/*/td
[#class='class_td'
and not(span[#class='class_span_ko'])
and normalize-space(translate(., ' ', ''))
]/text()

Using contains returns too many results

In the html below, I'm trying to get the two nodes that contain values for shipment_number, but instead I get 6 <td> nodes - why? Doesn't contains limit the nodes to only those that match the text value? If so the statement below should only return two, not six?
In Chrome dev console:
$x("//tr//td[contains(.,'shipment number')]/following::td[1]")
html:
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Date</td>
<td>11/15/2019</td>
</tr>
<tr>
<td>shipment number</td>
<td>abc_123_florida-45</td>
</tr>
<tr>
<td>Departure time:</td>
<td>0430</td>
</tr>
</tbody>
</table>
</td>
<td>
<table>
<tbody>
<tr>
<td>Time arrival</td>
<td>1715</td>
</tr>
<tr>
<td>customer</td>
<td>bob smith</td>
</tr>
<tr>
<td>box type</td>
<td>square</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr name="laneStop">
<td>box1</td>
<td>23.45</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>17.14</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box3</td>
<td>17.18</td>
<td>lane1</td>
<td>north</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>199.14</td>
<td>lane1</td>
<td>west</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Date</td>
<td>11/16/2019</td>
</tr>
<tr>
<td>shipment number</td>
<td>abc_222_florida-35</td>
</tr>
<tr>
<td>Departure time:</td>
<td>0630</td>
</tr>
</tbody>
</table>
</td>
<td>
<table>
<tbody>
<tr>
<td>Time arrival</td>
<td>1715</td>
</tr>
<tr>
<td>customer</td>
<td>sue smith</td>
</tr>
<tr>
<td>box type</td>
<td>rect</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table border="1">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr name="laneStop">
<td>box1</td>
<td>33.45</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>1.14</td>
<td>lane1</td>
<td>south</td>
</tr>
<tr name="laneStop">
<td>box3</td>
<td>27.18</td>
<td>lane1</td>
<td>north</td>
</tr>
<tr name="laneStop">
<td>box2</td>
<td>299.14</td>
<td>lane1</td>
<td>west</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
You need
//tr//td[contains(text(),'shipment number')]/following::td[1]
That's because contains(., '...') converts . to string by expanding all its text descendants, not just children.
I'm adding this answer because text() node test might conflict with others requirements, mainly those dealing with inline markup.
The reason because you are getting six td elements is that there is six td having "shipment number" as part of theirs string value (concatenation of all descendant text nodes). And that is because you have nested tables, thus nested td elements. So, you want a td element not having a descendant td element.
The expression:
//tr//td[not(.//td)][contains(.,'shipment number')]/following::td[1]
It selects:
<td>abc_123_florida-45</td>
<td>abc_222_florida-35</td>
Check in http://www.xpathtester.com/xpath/37bd889231ad68bb7bfa377433aeca00
Do note that your input sample has a default namespace declaration with the namespace URI http://www.w3.org/1999/xhtml. Because niether your code sample nor your selected answer are ussing namespaces, I asume you know how to work with them.

Why Xpath 3.0 works, but Xquery 3.0 doesn't work with the same expression

I launched Xpath in Oxygen. In Xpath 3.0 found what i need but in Xquery 3.0 doesn't find.
This is my Xpath expression
//table[tbody/tr/th/p[contains(text(), 'All Water System Contacts')]]/tbody/tr[3]/td[1]
This is my xml code
I put part code.
<table border="1" cellpadding="1" cellspacing="1" summary="." width="640">
<tbody>
<tr>
<th colspan="3">
<p>All Water System Contacts </p></th>
</tr>
<tr>
<th>Type</th>
<th>Contact</th>
<th>Communication</th>
</tr>
<tr>
<td align="center">AC - Administrative Contact - GENERAL MANAGER </td>
<td align="center">GRANT, JOHN, W <br/> PO BOX 869<br/> BIG SPRING, TX 79721-0869 </td>
<td align="center">
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse"
width="100%">
<tbody>
<tr>
<th><b>Electronic Type</b></th>
<th><b>Value</b></th>
</tr>
</tbody>
</table>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse"
width="100%">
<tbody>
<tr>
<th><b>Phone Type</b></th>
<th><b>Value</b></th>
</tr>
<tr>
<td align="center">BUS - Business</td>
<td align="center">432-267-6341 </td>
</tr>
<tr>
<td align="center">FAX - Facsimile</td>
<td align="center">432-267-3121 </td>
</tr>
<tr>
<td align="center">BUS - Business</td>
<td align="center">432-267-6070 </td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td align="center">OW - Owner </td>
<td align="center">COLORADO RIVER MUNICIPAL WATER DISTRICT <br/> PO BOX 869<br/> BIG
SPRING, TX 79721-0869 </td>
<td align="center"> </td>
</tr>
</tbody>
</table>
I tried different functions.
I don't know why it doesn't work and what difference
Please help me.
I suspect your real, complete input has an XHTML default namespace declaration xmlns="http://www.w3.org/1999/xhtml" and in oXygen for XPath you have the setting enabled to "use the default namespace of the root element" so your path works with XPath out of the box while for XQuery you need to make sure you explicitly set
declare default element namespace 'http://www.w3.org/1999/xhtml';
in the prolog of your XQuery file or code sample.

Optimal XPath Query for processing the sample HTML fragment

I have a feed that outputs HTML. The following segment is part of the output
<div class="leftnav">
<table border="0" cols="2">
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat1 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle1</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink1
</td>
</tr>
<tr>
<td colspan="2" class="topline"><span style="font-size: 1px"> </span></td>
</tr>
<tr>
<td colspan="2"><span class="bold">Article Cat2 </span></td>
</tr>
<tr>
<td class="date" colspan="2">
ArticleTitle2</td>
</tr>
<tr>
<td width="20"></td>
<td class="date">
ArticleLink2
</td>
</tr>
</table>
</div>
I want to process above segment using XPATH so that output looks like this
Article Cat1
ArticleTitle1
ArticleLink1 Article Cat2
ArticleTitle2
ArticleLink2
What is the optimal XPATH that will produce the desired output? I tried //div[#class="leftnav"]/table/tr but this gives all the TR elements. I want to skip the first TR element so that I can get the output in the format I described above.
//div[#class="leftnav"]/table/tr[position() > 1]
Try the above
Stupid simple way:
substring-after(normalize-space(string(//*:div)), normalize-space(string(//*:div/*:table/*[1])))
Result: "Article Cat1 ArticleTitle1 ArticleLink1 nbsp Article Cat2 ArticleTitle2 ArticleLink2"
I don't know why, but (position() > 1) doesn't work in my environment, so I've used strings instead.

Resources