Xpath: Wildcards for descendant nodes not working - xpath

Desired output: 3333
<tbody>
<tr>
<td class="name">
<p class="desc">Intel</p>
</td>
</tr>
Other tr tags
<tr>
<td class="tel">
<p class="desc">3333</p>
</td>
</tr>
</tbody>
I want to select the last tr tag after the tr tag that has "Intel" in the p tag
//tbody//tr[td[p[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
The above works but I don't wish to reference td and p explicitly. I tried wildcards ? or *, but it doesn't work.
//tbody//tr[?[?[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()

"...which contains a text node equal to 'Intel'"
//tbody/tr[.//text() = 'Intel']/following-sibling::tr[last()]/td/p/text()
"...which contains only the string 'Intel', once you remove all insignificant white-space"
//tbody/tr[normalize-space() = 'Intel']/following-sibling::tr[last()]/td/p/text()
I think the key take-away here is that you can use descendant paths (//) and pay attention to context in predicates once you make them relative (.//).

Related

Html Agility Pack search all nodes and save them

I shall search over whole website entries with "00:00-00:01" and replace with "" , like below.
<td id="tb"> Fr, 3.Sep.2021 00:00-00:01 </td>...<td id="tb"> Fr,3.Sep.2021 </td>
or
<td class="tbda">Fr, 3.Sep.2021 00:00-00:01</td>...<class="tbda">Fr, 3.Sep.2021 </td>
or
<b>Fr, 3.Sep.2021 00:00-00:01</b>...<b>Fr, 3.Sep.2021</b>
A single one is no problem but how can I found all and how can I save the path to this?
One way is to use regex:
re.findall(r'<td\s+id="tb">(\w+,\s+\d+\.\w+.2021\s+[0-9:]{4}-[0-9:]{4})</td>',text)
But you want more details, how it was found and where. So find all matched tags first, then find all content between them, then save it with an html tag. Like below:
<div>
<tr> # this is the start tag </tr>
<td id="tb">Fr, 3.Sep.2021 00:00-00:01</td> # this is the end content </td> # this is the end tag </tr>
... more tr ...
</div>
The idea can be found in How to convert an XML file to nice pandas dataframe? .

XPath: Getting a node by attribute value of subnode

People, could you please help me with this XPATH. Lets say I have the following HTML code
<table>
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
<tr>
<td>value2</td>
<td>text</td>
</tr>
</table>
I need to build a XPath that will pick <tr>that have <td> with value text AND attribute class equals clickable.
I tried the following xpath:
//tr[contains(.,'text')][contains(./td/#class,'clickable')]
//tr[contains(.,'text')][contains(td/#class,'clickable')]
but none of those worked
Any help is appreciated
Thanks
You are almost there:
//tr[contains(td/#class,'clickable') and contains(td, 'text')]
Demo using xmllint:
$ xmllint input.xml --xpath "//tr[contains(td/#class,'clickable') and contains(td, 'text')]"
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
If you find tr with a td having value text and a td (maybe, another) with attribute class equals clickable, use answer of #alecxe.
If that is one td with two condition then
//tr[td[.='text' and #class='clickable']]

XPath: Find first occurance in children and siblings

So I have some HTML that looks like thus:
<tr class="a">
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>....</td>
<td class="b">A</td>
</tr>
<tr>....</tr>
<tr class="a">
<td class="b">B</td>
<td>....</td>
</tr>
<tr>
<td class="b">Not this</td>
<td>....</td>
</tr>
I'm basically wanting to find the first instance of td class b following a tr with a class of a. Problem comes about is that it could be in either a child of the tr or in the next tr after it.
I can get the second case with:
//tr[#class="a"]//td[#class="b"]
But that misses the first case, because the TD is in a sibling, not a direct descendant. Ideas?
For the 2nd case (td is direct descendant of tr) :
//tr[#class="a"]//td[#class="b"][1]
For the 1st case (td is following tr), that isn't fall in the the 2nd case category :
//tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
Combining the two xpath queries together using union operator (|) yield the expected output :
//tr[#class="a"]//td[#class="b"][1] | //tr[#class="a" and not(.//td[#class="b"])]/following::td[#class="b"][1]
output :
Element='<td class="b">A</td>'
Element='<td class="b">B</td>'

Finding first matching sibling element while traversing the DOM

I am trying to create an xpath expression that will find the first matching sibling 'down' the dom given an initial sibling (note: initial siblings will be Tom and Steve). For example, I want to find 'jerry1' under the 'Tom' tr. I have looked into the following-sibling argument, but I'm not sure that's the best approach for this? Any ideas?
<tr>
<a title=”Tom”/>
</tr>
<tr>
<a title=”jerry1”/>
</tr>
<tr>
<a title=”jerry2”/>
</tr>
<tr>
<a title=”jerry3”/>
</tr>
<tr>
<a title=”Steve”/>
</tr>
<tr>
<a title=”jerry1”/>
</tr>
<tr>
<a title=”jerry2”/>
</tr>
<tr>
<a title=”jerry3”/>
</tr>
following-sibling will work. This will select the a node with the title "jerry1":
//a[#title='Tom']/../following-sibling::tr/a
The /.. traverses up to Tom's parent <tr>, then following-sibling to the next <tr>, then finally the <a> node within that.
Following XPath worked for me:
(//a[#title='Tom']/parent::*/following-sibling::tr/a[#title= 'jerry1'])[1]
First matching a with title jerry1 following a tr with an a-child with title Tom.
Starting at a[#title='Tom'], going to the parent tr with /parent , selecting all following sibling tr-nodes with ::*/following-sibling::tr, that have an /a[#title= 'jerry1'] as child node. Because this would select 2 jerry1-nodes and the first jerry1 following Tom is searched, selecting the first one by wrapping the XPath with () and choosing the first match with [1].
The following XPath statement finds the first tr element that has an a with the #title "jerry1" that is a following-sibling of the tr element that has an a with the #title of "Tom"
//tr[a/#title='Tom']/following-sibling::tr[a/#title='jerry1'][1]

xpath expression to find url and data

i want to get the values of every table and the href value for every within the table given below.
Being new to xpath, i am finding it difficult to write xpath expression.
However understanding what an xpath expression does lies somewhat in an easier category.
the expected output
http://a.com/ data for a 526735 Z
http://b.com/ data for b 522273 Z
http://c.com/ data for c 513335 Z
<table class = dataTabe>
<tbody>
<tr>
<td>data for a</td>
<td class="numericalColumn">526735</td>
<td class="numericalColumn">Z</td></tr>
<tr>
<td>data for b</td>
<td class="numericalColumn">522273</td>
<td class="numericalColumn">B</td></tr>
<tr>
<td>data for c</td>
<td class="numericalColumn">513335</td>
<td class="numericalColumn">B</td></tr>
</tbody>
</table>
You'll need two things: an XPath query which locates the wanted nodes and a second which outputs the text as you want it. Since you don't give more information about the languages you're using I'm putting together some pseudocode:
foreach node in document.select("//table[class='dataTable']//tr[td/a/#HREF]")
write node.select("concat(td/a/#HREF,' ',.)")
This site has a great free tool for building XPath Expressions (XPath Builder):
http://www.bubasoft.net/
Use this XPath: //tr/td/a/#HREF | //tr//text()

Resources