Html Agility Pack search all nodes and save them - xpath

I shall search over whole website entries with "00:00-00:01" and replace with "" , like below.
<td id="tb"> Fr, 3.Sep.2021 00:00-00:01 </td>...<td id="tb"> Fr,3.Sep.2021 </td>
or
<td class="tbda">Fr, 3.Sep.2021 00:00-00:01</td>...<class="tbda">Fr, 3.Sep.2021 </td>
or
<b>Fr, 3.Sep.2021 00:00-00:01</b>...<b>Fr, 3.Sep.2021</b>
A single one is no problem but how can I found all and how can I save the path to this?

One way is to use regex:
re.findall(r'<td\s+id="tb">(\w+,\s+\d+\.\w+.2021\s+[0-9:]{4}-[0-9:]{4})</td>',text)
But you want more details, how it was found and where. So find all matched tags first, then find all content between them, then save it with an html tag. Like below:
<div>
<tr> # this is the start tag </tr>
<td id="tb">Fr, 3.Sep.2021 00:00-00:01</td> # this is the end content </td> # this is the end tag </tr>
... more tr ...
</div>
The idea can be found in How to convert an XML file to nice pandas dataframe? .

Related

Xpath: Wildcards for descendant nodes not working

Desired output: 3333
<tbody>
<tr>
<td class="name">
<p class="desc">Intel</p>
</td>
</tr>
Other tr tags
<tr>
<td class="tel">
<p class="desc">3333</p>
</td>
</tr>
</tbody>
I want to select the last tr tag after the tr tag that has "Intel" in the p tag
//tbody//tr[td[p[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
The above works but I don't wish to reference td and p explicitly. I tried wildcards ? or *, but it doesn't work.
//tbody//tr[?[?[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
"...which contains a text node equal to 'Intel'"
//tbody/tr[.//text() = 'Intel']/following-sibling::tr[last()]/td/p/text()
"...which contains only the string 'Intel', once you remove all insignificant white-space"
//tbody/tr[normalize-space() = 'Intel']/following-sibling::tr[last()]/td/p/text()
I think the key take-away here is that you can use descendant paths (//) and pay attention to context in predicates once you make them relative (.//).

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

XPath: Getting a node by attribute value of subnode

People, could you please help me with this XPATH. Lets say I have the following HTML code
<table>
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
<tr>
<td>value2</td>
<td>text</td>
</tr>
</table>
I need to build a XPath that will pick <tr>that have <td> with value text AND attribute class equals clickable.
I tried the following xpath:
//tr[contains(.,'text')][contains(./td/#class,'clickable')]
//tr[contains(.,'text')][contains(td/#class,'clickable')]
but none of those worked
Any help is appreciated
Thanks
You are almost there:
//tr[contains(td/#class,'clickable') and contains(td, 'text')]
Demo using xmllint:
$ xmllint input.xml --xpath "//tr[contains(td/#class,'clickable') and contains(td, 'text')]"
<tr>
<td class="clickable">text</td>
<td>value1</td>
</tr>
If you find tr with a td having value text and a td (maybe, another) with attribute class equals clickable, use answer of #alecxe.
If that is one td with two condition then
//tr[td[.='text' and #class='clickable']]

Parsing the previous <td> of an element (ignoring other elements inbetween)

I have an extremely long HTML file with many different tables. I want to parse only certain tables, but unfortunately the <table> tag is of no help here.
The tables I do want to parse look like this:
<tr>
<td> TEXT1 </td>
<td> <a class='unique identifier' ...> TEXT2 </a></td>
</tr>
I want both "TEXT1" and "TEXT2". I know how to get "TEXT2": It is always in an <a> tag and my solution so far is
//a[(#class="unique identifier")]
Note: Sometimes "TEXT1" is in a <p> tag, sometimes it isn't. Sometimes there are other tags after it like <b>s or <br>s or <em>, etc. I thought that I would need to get the previous <td> content, after a every <a> that I have found, but ignore any other elements that are inbetween.
How can I tell Nokogiri that for every "TEXT2" that I have found to go back and get the previous <td> as well, so that I can get "TEXT1"?
I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr>
<td> TEXT1 </td>
<td> <a class='uid'> TEXT2 </a></td>
</tr>
EOT
wrapping_tr = doc.at('//a[#class="uid"]/../..')
nodes = wrapping_tr.search('td')
nodes.map(&:text)
# => [" TEXT1 ", " TEXT2 "]
I'd recommend spending time reading the XPath documentation as this is pretty elementary.

How to get any text between an opening and closing node with xpath?

I want to get the specified text as in example but when I used strong[3] but it returns "Text5:" as expected. How can I get the airport name section with xpath?
Code:
<tr>
<td>
<strong>Text1 </strong>Text2
<strong> Text3: </strong>Text4
<strong>Text5:</strong> Text_Text_Text_Text_Text
</td>
</tr>
The part that I need:
Text_Text_Text_Text_Text
The solution is /tr/td/text()[3]

Resources