How to I get the <td> text OR a link from 'a/href' inside the <td> if the td has a link in it - xpath

Using IMPORTXML in Google Sheets to get the links from a table/td.
Page Source: (Some of the 'td' contains text with no links, whereas some 'td' have links but no text.)
<tr>
<td>Cameroon - Brazil</td>
</tr>
<tr>
<td>Netherlands - USA</td>
</tr>
<tr>
<td>Winner Group C vs Australia</td>
</tr>
<tr>
<td>France vs Runner-up Group C</td>
</tr>
<tr>
<td>England - Senegal</td>
</tr>
<tr>
<td>Winner Group E vs Runner-up Group F</td>
</tr>
=IMPORTXML(B3,"//table//tr/td/a/#href")
CURRENT OUTPUT : (This extracts all td which has a/href, and omitting which do not have a/href)
1. /match/cameroon-brazil-2022-12-02
2. /match/netherlands-usa-2022-12-03
3. /match/england-senegal-2022-12-04
EXPECTED OUTPUT: (ignore the numbers (#1 to #6) in the output)
Include empty rows (when 'td' doesn't have a/href)
1. /match/cameroon-brazil-2022-12-02
2. /match/netherlands-usa-2022-12-03
3.
4.
5. /match/england-senegal-2022-12-04
6.
OR Include 'td' text (when 'td' doesn't have a/href)
1. /match/cameroon-brazil-2022-12-02
2. /match/netherlands-usa-2022-12-03
3. Winner Group C vs Australia
4. France vs Runner-up Group C
5. /match/england-senegal-2022-12-04
6. Winner Group E vs Runner-up Group F

The second version can be achieved with this XPath-1.0 expression:
//table/tr/td[not(a)] | //table/tr/td/a/#href
It merges the values of <td>s which have no <a> children with the href attributes of the <td>s that have one.
It's output is
/match/cameroon-brazil-2022-12-02
/match/netherlands-usa-2022-12-03
Winner Group C vs Australia
France vs Runner-up Group C
/match/england-senegal-2022-12-04
Winner Group E vs Runner-up Group F

Related

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

Print Checkboxes list vertically

How can I print the checkbox list vertically?
Currently, it is printing horizontally as shown below.
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5
I want it this way
Subject 2
Subject 4
Subject 5
Here is the code I am using to print it.
<tr>
<td>
<form:checkboxes items="${favouriteList}" path="subjects" ></form:checkboxes>
</td>
</tr>

XPath: Find first occurance in children and siblings

So I have some HTML that looks like thus:
<tr class="a">
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>....</td>
<td class="b">A</td>
</tr>
<tr>....</tr>
<tr class="a">
<td class="b">B</td>
<td>....</td>
</tr>
<tr>
<td class="b">Not this</td>
<td>....</td>
</tr>
I'm basically wanting to find the first instance of td class b following a tr with a class of a. Problem comes about is that it could be in either a child of the tr or in the next tr after it.
I can get the second case with:
//tr[#class="a"]//td[#class="b"]
But that misses the first case, because the TD is in a sibling, not a direct descendant. Ideas?
For the 2nd case (td is direct descendant of tr) :
//tr[#class="a"]//td[#class="b"][1]
For the 1st case (td is following tr), that isn't fall in the the 2nd case category :
//tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
Combining the two xpath queries together using union operator (|) yield the expected output :
//tr[#class="a"]//td[#class="b"][1] | //tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
output :
Element='<td class="b">A</td>'
Element='<td class="b">B</td>'

Find all preceding sibling nodes until one is found with a specific child node attribute

I would like to get all table rows after a specific row identifier (an attribute on the row column) until that specific row identifier is found.
Here is the html I'm trying to parse:
<tr>
<td colspan="4">
<h3>Header 1</h3>
</td>
</tr>
<tr>
<td>Item desc - Header 1</td>
<td>more info</td>
<td>30</td>
<td>500</td>
</tr>
<tr>
<td colspan="4">
<h3>Header 2</h3>
</td>
</tr>
<tr>
<td>Item desc - header 2</td>
<td>other</td>
<td>4</td>
<td>49</td>
</tr>
<tr>
<td>Item 2 desc - header 2</td>
<td>other 2</td>
<td>65</td>
<td>87</td>
</tr>
I want to be able to grab the item under header 1 and stop when it finds header 2; then the items under header 2 and stop when it finds a header 3; etc.
Is this possible under xpath? I can't get it to only find the TR nodes until it finds a child node with a specific attribute (of colspan="4").
This is not possible under XPath 1.0. You somehow have to fixate the header tr, because you are trying to find all its following siblings whose first preceding header tr is the original one. Without the reference to the original header, everything is possible. But you probably work in some kind of a language that you can use to remember the value.
For example, in xsh:
for my $x in //tr[td/#colspan="4"] {
echo ($x/td/h3) ;
for $x/following-sibling::tr[count(td)=4
and preceding-sibling::tr[count(td)=1][1]=$x]
echo " " (td) ;
}
Output:
Header 1
Item desc - Header 1 more info 30 500
Header 2
Item desc - header 2 other 4 49
Item 2 desc - header 2 other 2 65 87
This might give you what you're looking for, not the most orthodox means though:
//*/tr/td[not(child::h3)]/ancestor::tr
This will give you all the <td> nodes within a <tr> that isn't a header block.
And you can specify the header with:
//*/tr/td[not(child::h3/text()='Header 1')]/ancestor::tr
Or a more general:
//*/tr/td[not(child::h3[contains(text(),'Header')])]/ancestor::tr

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources