Parsing the previous <td> of an element (ignoring other elements inbetween) - ruby

I have an extremely long HTML file with many different tables. I want to parse only certain tables, but unfortunately the <table> tag is of no help here.
The tables I do want to parse look like this:
<tr>
<td> TEXT1 </td>
<td> <a class='unique identifier' ...> TEXT2 </a></td>
</tr>
I want both "TEXT1" and "TEXT2". I know how to get "TEXT2": It is always in an <a> tag and my solution so far is
//a[(#class="unique identifier")]
Note: Sometimes "TEXT1" is in a <p> tag, sometimes it isn't. Sometimes there are other tags after it like <b>s or <br>s or <em>, etc. I thought that I would need to get the previous <td> content, after a every <a> that I have found, but ignore any other elements that are inbetween.
How can I tell Nokogiri that for every "TEXT2" that I have found to go back and get the previous <td> as well, so that I can get "TEXT1"?

I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<tr>
<td> TEXT1 </td>
<td> <a class='uid'> TEXT2 </a></td>
</tr>
EOT
wrapping_tr = doc.at('//a[#class="uid"]/../..')
nodes = wrapping_tr.search('td')
nodes.map(&:text)
# => [" TEXT1 ", " TEXT2 "]
I'd recommend spending time reading the XPath documentation as this is pretty elementary.

Related

Xpath: Wildcards for descendant nodes not working

Desired output: 3333
<tbody>
<tr>
<td class="name">
<p class="desc">Intel</p>
</td>
</tr>
Other tr tags
<tr>
<td class="tel">
<p class="desc">3333</p>
</td>
</tr>
</tbody>
I want to select the last tr tag after the tr tag that has "Intel" in the p tag
//tbody//tr[td[p[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
The above works but I don't wish to reference td and p explicitly. I tried wildcards ? or *, but it doesn't work.
//tbody//tr[?[?[contains(text(),'Intel')]]]/followingsibling::tr[position()=last()]//p/text()
"...which contains a text node equal to 'Intel'"
//tbody/tr[.//text() = 'Intel']/following-sibling::tr[last()]/td/p/text()
"...which contains only the string 'Intel', once you remove all insignificant white-space"
//tbody/tr[normalize-space() = 'Intel']/following-sibling::tr[last()]/td/p/text()
I think the key take-away here is that you can use descendant paths (//) and pay attention to context in predicates once you make them relative (.//).

Html Agility Pack search all nodes and save them

I shall search over whole website entries with "00:00-00:01" and replace with "" , like below.
<td id="tb"> Fr, 3.Sep.2021 00:00-00:01 </td>...<td id="tb"> Fr,3.Sep.2021 </td>
or
<td class="tbda">Fr, 3.Sep.2021 00:00-00:01</td>...<class="tbda">Fr, 3.Sep.2021 </td>
or
<b>Fr, 3.Sep.2021 00:00-00:01</b>...<b>Fr, 3.Sep.2021</b>
A single one is no problem but how can I found all and how can I save the path to this?
One way is to use regex:
re.findall(r'<td\s+id="tb">(\w+,\s+\d+\.\w+.2021\s+[0-9:]{4}-[0-9:]{4})</td>',text)
But you want more details, how it was found and where. So find all matched tags first, then find all content between them, then save it with an html tag. Like below:
<div>
<tr> # this is the start tag </tr>
<td id="tb">Fr, 3.Sep.2021 00:00-00:01</td> # this is the end content </td> # this is the end tag </tr>
... more tr ...
</div>
The idea can be found in How to convert an XML file to nice pandas dataframe? .

How to get any text between an opening and closing node with xpath?

I want to get the specified text as in example but when I used strong[3] but it returns "Text5:" as expected. How can I get the airport name section with xpath?
Code:
<tr>
<td>
<strong>Text1 </strong>Text2
<strong> Text3: </strong>Text4
<strong>Text5:</strong> Text_Text_Text_Text_Text
</td>
</tr>
The part that I need:
Text_Text_Text_Text_Text
The solution is /tr/td/text()[3]

XPath to get siblings between two elements

With the following markup I need to get the middle tr's
<tr class="H03">
<td>Artist</td>
...
<tr class="row_alternate">
<td>LIMP</td>
<td>Orion</td>
...
</tr>
<tr class="row_normal">
<td>SND</td>
<td>Tender Love</td>
...
</tr>
<tr class="report_total">
<td> </td>
<td> </td>
...
</tr>
That is every sibling tr between <tr class="H03"> and <tr class="report_total">. I'm scraping using mechanize and nokogiri, so am limited to their xpath support. My best attempt after looking at various StackOverflow questions is
page.search('/*/tr[#class="H03"]/following-sibling::tr[count(. | /*/tr[#class="report_total"]/preceding-sibling::tr)=count(/*/tr[#class="report_total"]/preceding-sibling::tr)]')
which returns an empty array, and is so ridiculously complicated that my limited xpath fu is completely overwhelmed!.
You can try the following xpath :
//tr[#class='H03']/following-sibling::tr[following-sibling::tr[#class='report_total']]
Above xpath select all <tr> following tr[#class='H03'], where <tr> have following sibling tr[#class='report_total'] or in other words selected <tr> are located before tr[#class='report_total'].
Mechanize has a few helper methods here that would be useful to employ.
presuming you are doing something like the following:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.website.com')
start_tr = page.at('.H03')
At this point, tr will be a nokogiri xml element of the first tr you list in your question.
You can then iterate through siblings with:
next_tr = start_tr.next_sibling
Do this until you hit the tr at which you want to stop.
trs = Array.new
until next_tr.attributes['class'].name == 'report_total'
next_tr = next_tr.next_sibling
trs << next_tr
end
If you want the range to be inclusive of the start and stop trs (H03 and report_total) just tweak the code above to include them in the trs array.

Watir: Search table cell by bgcolor tag and get column number

Consider the following html as an example. its a scratch sheet I made to practice, but it has a snippet of the real html I am trying to work with.
http://www.carbide-red.com/prog/test_table.html
I am trying to locate a column and the only consistant identifier I can find is the background color (bgcolor).
<tr bgcolor="#ffffcc">
<td bgcolor="yellow" class="date" align=center>Equipment</td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/12/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/19/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>8/26/12</font></td>
<td bgcolor="#ccccff" align=center class="date"><font color=black>9/2/12</font></td>
<td bgcolor="red" align=center class="date"><font color=yellow>9/9/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/16/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/23/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>9/30/12</font></td>
<td bgcolor="#ccffcc" align=center class="date"><font color=black>10/7/12</font></td>
</tr>
I'm trying to find the <td> that has bgcolor=red. I would then like to save the column index of that cell, so that I can then use it to select the same column of the following rows.
But I can't seem to find a way to search for the bgcolor= tag. And I have not been able to find a way to get Watir to report back the column/row indexs to store in a variable. But if I can find the bgcolor= tag then I can search for like "equipment" and then count until I find the correct tag.
I know the html code is not ideal due to there note being any "name" or anything unique identifier, but I can't change that.
I am very new to Ruby & Watir. I tried to manipulate a website in Perl and it was was not going very well, and I discovered Watir and it did exactly what I needed (and suprisingly easy), but now I am trying to understand Ruby as well as the finer semantics.
Thanks for any help!
To get text of <td bgcolor="red"> try this:
browser.element(:css => "td[bgcolor=red]").text
You should get back "9/9/12". To click the element, replace text with click.
To put it's index in variable index try this:
index = nil
browser.tds.each_with_index {|td, i| index = i if td.attribute_value("bgcolor") == "red" or td.attribute_value("bgcolor") == "#ff0000"}
index variable should be 5.
I would use nokogiri if I were you:
doc = Nokogiri::HTML #browser.html
td = doc.at('td[#bgcolor="red"]')
index = td.search('./preceding-sibling::td').length
Unless there's tricky javascript on the page you're probably better off with mechanize than watir.
Yes the webpage I'm dealing with uses Javascript which is why I had a very hard time useing Mechanize::Firefox under Perl. Watir worked much more smoothly.
Thank you for your suggestion! It didn't work at first, but it helped me with Google searches and I was able to get a working version.
require "watir"
require "nokogiri"
browser = Watir::Browser.new
browser.goto "http://www.carbide-red.com/prog/test_table.html"
doc = Nokogiri::HTML.parse(browser.html)
td = doc.at('td[#bgcolor="red"]')
columnindex = td.search('./preceding-sibling::td').length
puts columnindex
browser.close
This returned "5"
Update:
For the sake of others who may find this while searching and learning. To use columnindex variable to find a specific column within a row use this code:
textvariable = browser.td(:text => "A58004").parent.td(:index => "#{columnindex}").text
puts "Textvariable: #{textvariable}"
This finds a <td> that contains the term "A58004" and then goes to the 5th column (0-5) over and returns the value of that cell. Using the webpage linked in my original question that would be "W=Sa"

Resources