How to make b and a optional in the following expression?
//td[#class='ttr_interest']/b/a/text()
Basically /b/a may or may not present in tree (only a or only b or both or neither can be present). How in general to specify optional elements?
I want to capture text enclosed into td whether or not that text is additionally enclosed by <a> and <b>.
Sample as requested
<td>
text_to_capture
</td>
<td>
<b>text_to_capture</b>
</td>
<td>
text_to_capture
</td>
Use:
(//td[#class='ttr_interest']
|
//td[#class='ttr_interest']/a
|
//td[#class='ttr_interest']/b/a
)
/text()
This selects any text-node chile of any element selected by one of the three XPath expressions that are union-ed together in the brackets.
You don't say in which context you do this (XSLT?), but here is a Python/lxml suggestion:
from lxml import etree
XML = """
<root>
<td>
text_to_capture
</td>
<td>
<b>text_to_capture</b>
</td>
<td>
text_to_capture
</td>
</root>"""
doc = etree.fromstring(XML)
expr = "//td//text()"
texts = doc.xpath(expr)
print texts # includes whitespace-only nodes
for t in texts:
if t.strip():
print t.strip()
Output:
['\n ', 'text_to_capture', '\n ', '\n ', 'text_to_capture', '\n ', '\n text_to_capture\n ']
text_to_capture
text_to_capture
text_to_capture
This solution selects all text in <td> regardless of the names of any <td> child elements.
EDIT: After comments changed xpath to fit question
<bar>
xxxx
<foo>xxx</foo>
<barfoo>
<foo>xxx</foo>
</barfoo>
</bar>
Use this xpath
//bar//*/text()|//bar/text()
Related
Desired output: 3333
<tbody>
<tr>
<td class="name">
<p class="desc">Intel</p>
</td>
</tr>
Other tr tags
<tr>
<td class="tel">
<p class="desc">3333</p>
</td>
</tr>
</tbody>
I want to select the last tr tag after the tr tag that has "Intel" in the p tag
//tbody//tr[td[p[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
The above works but I don't wish to reference td and p explicitly. I tried wildcards ? or *, but it doesn't work.
//tbody//tr[?[?[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
"...which contains a text node equal to 'Intel'"
//tbody/tr[.//text() = 'Intel']/following-sibling::tr[last()]/td/p/text()
"...which contains only the string 'Intel', once you remove all insignificant white-space"
//tbody/tr[normalize-space() = 'Intel']/following-sibling::tr[last()]/td/p/text()
I think the key take-away here is that you can use descendant paths (//) and pay attention to context in predicates once you make them relative (.//).
I shall search over whole website entries with "00:00-00:01" and replace with "" , like below.
<td id="tb"> Fr, 3.Sep.2021 00:00-00:01 </td>...<td id="tb"> Fr,3.Sep.2021 </td>
or
<td class="tbda">Fr, 3.Sep.2021 00:00-00:01</td>...<class="tbda">Fr, 3.Sep.2021 </td>
or
<b>Fr, 3.Sep.2021 00:00-00:01</b>...<b>Fr, 3.Sep.2021</b>
A single one is no problem but how can I found all and how can I save the path to this?
One way is to use regex:
re.findall(r'<td\s+id="tb">(\w+,\s+\d+\.\w+.2021\s+[0-9:]{4}-[0-9:]{4})</td>',text)
But you want more details, how it was found and where. So find all matched tags first, then find all content between them, then save it with an html tag. Like below:
<div>
<tr> # this is the start tag </tr>
<td id="tb">Fr, 3.Sep.2021 00:00-00:01</td> # this is the end content </td> # this is the end tag </tr>
... more tr ...
</div>
The idea can be found in How to convert an XML file to nice pandas dataframe? .
I have an extremely long HTML file with many different tables. I want to parse only certain tables, but unfortunately the <table> tag is of no help here.
The tables I do want to parse look like this:
<tr>
<td> TEXT1 </td>
<td> <a class='unique identifier' ...> TEXT2 </a></td>
</tr>
I want both "TEXT1" and "TEXT2". I know how to get "TEXT2": It is always in an <a> tag and my solution so far is
//a[(#class="unique identifier")]
Note: Sometimes "TEXT1" is in a <p> tag, sometimes it isn't. Sometimes there are other tags after it like <b>s or <br>s or <em>, etc. I thought that I would need to get the previous <td> content, after a every <a> that I have found, but ignore any other elements that are inbetween.
How can I tell Nokogiri that for every "TEXT2" that I have found to go back and get the previous <td> as well, so that I can get "TEXT1"?
I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr>
<td> TEXT1 </td>
<td> <a class='uid'> TEXT2 </a></td>
</tr>
EOT
wrapping_tr = doc.at('//a[#class="uid"]/../..')
nodes = wrapping_tr.search('td')
nodes.map(&:text)
# => [" TEXT1 ", " TEXT2 "]
I'd recommend spending time reading the XPath documentation as this is pretty elementary.
I am trying to create an xpath expression that will find the first matching sibling 'down' the dom given an initial sibling (note: initial siblings will be Tom and Steve). For example, I want to find 'jerry1' under the 'Tom' tr. I have looked into the following-sibling argument, but I'm not sure that's the best approach for this? Any ideas?
<tr>
<a title=”Tom”/>
</tr>
<tr>
<a title=”jerry1”/>
</tr>
<tr>
<a title=”jerry2”/>
</tr>
<tr>
<a title=”jerry3”/>
</tr>
<tr>
<a title=”Steve”/>
</tr>
<tr>
<a title=”jerry1”/>
</tr>
<tr>
<a title=”jerry2”/>
</tr>
<tr>
<a title=”jerry3”/>
</tr>
following-sibling will work. This will select the a node with the title "jerry1":
//a[#title='Tom']/../following-sibling::tr/a
The /.. traverses up to Tom's parent <tr>, then following-sibling to the next <tr>, then finally the <a> node within that.
Following XPath worked for me:
(//a[#title='Tom']/parent::*/following-sibling::tr/a[#title= 'jerry1'])[1]
First matching a with title jerry1 following a tr with an a-child with title Tom.
Starting at a[#title='Tom'], going to the parent tr with /parent , selecting all following sibling tr-nodes with ::*/following-sibling::tr, that have an /a[#title= 'jerry1'] as child node. Because this would select 2 jerry1-nodes and the first jerry1 following Tom is searched, selecting the first one by wrapping the XPath with () and choosing the first match with [1].
The following XPath statement finds the first tr element that has an a with the #title "jerry1" that is a following-sibling of the tr element that has an a with the #title of "Tom"
//tr[a/#title='Tom']/following-sibling::tr[a/#title='jerry1'][1]
I am trying to get the text "Weeeeee" but when i use //td[#class='something']/text() I got nothing
<td class="something">
<a href='http://www.google.com'>Google</a>
Weeeeee
<div>
<a>something</a>
</div>
</td>
Try
//td[#class='something']/text()[normalize-space() != ''][1]
as there are three text nodes in your example, the first and the last one consist of whitespace only.
Highlighted with square brackets:
<td class="something">[\n
----]<a href='http://www.google.com'>Google</a>[\n
----Weeeeee\n
----]<div>
<a>something</a>
</div>[\n
]</td>