Selecting all rows under a particular column header (variable column headers) - xpath

I have a table like this:
<table>
<tr class="header">
<td>
Col1
</td>
<td>
Col2
</td>
<td>
Col3
</td>
<td>
Col4
</td>
</tr>
<tr class="rowOdd">
<td>
Value1
</td>
<td>
Value2
</td>
<td>
Value3
</td>
<td>
Value4
</td>
</tr>
<tr class="rowEven">
<td>
Value1
</td>
<td>
Value2
</td>
<td>
Value3
</td>
<td>
Value4
</td>
</tr>
</table>
Essentially, what I want to do is to select all the rows under the designated column headers using Xpath. The column headers are actually shirt sizes from "5XS" to "6XL". The trick is that the number of columns varies per table. Some have 10 columns, others have 5 or less (i.e., less available shirt sizes). But the header names are fixed names: item_name, 5xs, 4xs, and so on. It's just that not all the sizes are always present in every table. So I have to write an xpath such that it searches for a particular header name (or shirt size) and get all the rows under that particular header name. I have tried this code and it seems to work:
*[count(../../tr/td[#class='header'][.='5xs']/preceding-sibling::*)+1]
However, my problem is if the header '5xs' is not present in the table I am currently processing, I get defaulted to the values of column1 (item_name). Is there a way for me to get blanks or not select rows at all if the particular header name I'm looking for is not present in the current table I'm looking at?
Thank you!

You can use xpath function contains. If you have different class names and markup structure like this i.e. you have different class names:
<table>
<tr class="header_Name1">
<td>
Col1
</td>
</tr>
<tr>
<td>
Value1
</td>
</tr>
<tr class="header_Name2">
<td>
Col2
</td>
</tr>
<tr>
<td>
Value1
</td>
</tr>
</table>
The following:
//tr[contains(#class,'header_Name')]
Should return:
["\nCol1\n\n", "\nCol2\n\n"]

Related

When I add a link to a table cell another table is created

I created a website using TYPO3 version 10.4.21. TYPO3 uses the CKEditor as rich text editor.
On one page I need to display many links because it's a summary of archive documents. For that I created a table inside the RTE, each cell contains the doc-name.
The problem is when I create a link on the doc-name it will automatically create another table inside the cell around the text. As my tables are styled it will mess up the whole design. How can I disable that?
I can confirm as NextThursday said that the error only occurs if there is only one word in the cell or if I select the whole cell.
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>Doc1</td>
</tr>
</tbody>
</table>
</td>
<td>Doc2</td>
<td>Doc3</td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
These are the only configurations I made to the CKEditor:
imports:
- { resource: "EXT:rte_ckeditor/Configuration/RTE/Default.yaml" }
- { resource: "EXT:rte_ckeditor_image/Configuration/RTE/Plugin.yaml" }
editor:
externalPlugins:
indent:
resource: "EXT:layout/Resources/Public/RTE/Plugins/indent/plugin.js"
indentblock:
resource: "EXT:layout/Resources/Public/RTE/Plugins/indentblock/plugin.js"
config:
externalPlugins:
- indent
- indentblock
removePlugins: null
processing:
allowAttributes: [class, id, title, dir, lang, xml:lang, itemscope, itemtype, itemprop, style]

xpath that exclude some specific elements

This is a simple version of the HTML of the page that I want analyse:
<table class="class_1">
<tbody>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"> </td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>square</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_7"></span>circle</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>triangle</td>
</tr>
</tbody>
</table>
You can find the page at
https://sabbiobet.netsons.org/test.html
If you try in a google sheets the function:
=IMPORTXML("https://sabbiobet.netsons.org/test.html";"//td[#class='class_5']")
i'll obtain:
square
circle
triangle
I need to obtain all the <td> with class="class_5" minus the ones that have or <span class=class_7>.
In other words I want to obtain only these values:
Square
Triangle
can somebody help me?
The following XPath expression
//td[#class='class_5' and span and not(span[#class='class_7'])]
selects all td elements having an attribute class with value class_5, having a child element span and not having a child element span where its class attribute has the value class_7.
Note that you could also use
//td[#class='class_5' and span[#class='class_6']]
to get the same result in this case.
This should work:
//td[#class='class_5'][not(text()=' ')][not(./span[#class='class_7'])]
where [not(text()=' ')] is not testing for a reqular space but rather for a symbol with Unicode code U+00A0 that you can input from keyboard in windows using alt+0160 where numbers are to be input from numpad.

Xpath help - select a node, then child nodes

I'm stumped on why and how to do this query.
My html structure is like this (tables nested inside tables):
<root>
<table>
</table>
<table>
<tr>
<td>
<table>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</root>
If I start out my xpath like:
var tables = blah.SelectNodes("//table");
which returns me the 3 parent tables, then I want to select the td's from the 2nd tr like this:
var td = tables[2].SelectNodes("//tr[2]/td");
But, when I do this, it goes back to the parent/root, the "blah" level. Why is this, and how can I keep filtering my search results down?
Note: The example xml structure may not directly match the queries written, just trying to give a general idea...
Just keep extending the XPath
This one returns the <tr> items (four of them) of the second table:
/table/tr/td/table/tr
This one returns the second <tr> item:
/table/tr/td/table/tr[2]
Your best bet, though, is to give individual id attributes to each table, so that you can find it directly using that attribute.
Using something like this:
<root>
<table id="1">
</table>
<table id="2">
<tr>
<td>
<table id="3">
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</table>
</td>
</tr>
</table>
</root>
You can get the items in the innermost table with:
//table[#id="3"]
You can get an individual <td> item from that innermost table with:
//table/tr/td/table/tr[2]/td[1]
Assigning an id attribute makes it a little easier (note missing /tr/td items after the first table):
//table[#id="3"]/tr[2]/td[1]

How to combine XPATH "select //td/text() or //td/p/text()" into one sentence?

I want to select text() within each row in the following HTML. However, the text I want is either in the td element or the p element, so I have to write two statements to ensure each row is selected.
How do I combine the two statements into one?
XPATH:
//table/tr/td[not(p)]/text() | //table/tr/td/p/text()
With the result desired:
['1', '2', '3', '4']
Original html:
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>
<p>2
</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>
<p>4
</td>
</tr>
</table>
Probably you want something like this:
//table/tbody/tr/td//text()[normalize-space()]
All non-whitespace-only text nodes one or more levels deep in the //table/tr/td will be found.

tfooter doesn't validate for xhtml?

I had my webpage validated for xhtml transitional till I added this table (see below). Since then it doesn't validate and says "
document type does not allow element "tfoot" here <tfoot>
The element named above was found in a context where it is not allowed. This could mean that you have incorrectly nested elements -- such as a "style" element in the "body" section instead of inside "head" -- or two elements that overlap (which is not allowed).
One common cause for this error is the use of XHTML syntax in HTML documents. Due to HTML's rules of implicitly closed elements, this error can create cascading effects. For instance, using XHTML's "self-closing" tags for "meta" and "link" in the "head" section of a HTML document may cause the parser to infer the end of the "head" section and the beginning of the "body" section (where "link" and "meta" are not allowed; hence the reported error)."
Any ideas as what is happening? I checked for any opened and not closed tags but did not find any so I don't know what else is wrong.
<table>
<caption>
My first table, Anna
</caption>
<thead>
<tr>
<th>
June
</th>
<th>
July
</th>
<th>
August
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Data 1
</td>
<td>
Data 2
</td>
<td>
Data 3
</td>
<td>
Data 4
</td>
</tr>
<tr>
<td>
Data a
</td>
<td>
Date b
</td>
<td>
Data c
</td>
<td>
Data d
</td>
</tr>
<tfoot>
<tr>
<td>
Result1
</td>
</tr>
</tfoot>
</tbody>
</table>
You've got the <tfoot> at the end of the table. It should be between the <thead> and the <tbody>. It will appear at the bottom, but it's coded at the top. One of the original ideas is that as a large table loaded, the heading and footer would be visible quickly, with the rest filling in (esp. useful if the body was scrollable between them). It hasn't quite worked out like that in practice, but it does make more sense if you know that.
In the DTD it lists:
<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
That is, optional caption, then zero-or-more col or colgroup, then optional thead, then optional tfoot, then at least one tbody or tr.
UPDATE: Note that HTML 5 now allows one to put the <tfoot> at the end of the table, instead of before the first <tbody> (or the first <tr> that isn't in a <thead>, <tfoot> or <tbody> and hence in a single implicit <tbody>). As such the code in the question would now be considered valid. The older approach is also still valid, and probably advisable.
The tfoot element should be outside of the tbody element, like this:
<table>
<caption>
My first table, Anna
</caption>
<thead>
<tr>
<th>
June
</th>
<th>
July
</th>
<th>
August
</th>
</tr>
</thead>
<tfoot>
<tr>
<td>
Result1
</td>
</tr>
</tfoot>
<tbody>
<tr>
<td>
Data 1
</td>
<td>
Data 2
</td>
<td>
Data 3
</td>
<td>
Data 4
</td>
</tr>
<tr>
<td>
Data a
</td>
<td>
Date b
</td>
<td>
Data c
</td>
<td>
Data d
</td>
</tr>
</tbody>
Here is a small example of the correct nesting for those who need it.
<table>
<caption></caption>
<thead>
<tr>
<th></th>
</tr>
</thead>
<tfoot>
<tr>
<td></td>
</tr>
</tfoot>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>

Resources