Using Ruby / Nokogiri to parse randomized class names - ruby

I've been doing calculations by hand when it comes to the remaining percentage of the US Presidential election votes in various states. With so many updates and states – this is getting tiring. So why not automate the process?
Here's what I'm looking at:
The problem is that the class names have been randomized. For example, here's the one I'm interested in:
<td class="jsx-3768461732 votes votes-row">2,450,186</td>
Playing around in irb, I tried to use a wildcard on "votes votes-row", since this only appears when I need it in the doc:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css("[td*='votes-row']")
...which yields no results (=> [])
What am I doing wrong and how to fix? I'm ok with xpath – I just want to make sure changes made elsewhere in the doc don't affect finding these elements.

There's probably a better way but...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map { |row| row.css('td').map { |cell| cell.content } }
biden_row = votes.find_index { |row| row[0] =~ /biden/i }
trump_row = votes.find_index { |row| row[0] =~ /trump/i }
biden_votes = votes[biden_row][1].split('%')[1]
trump_votes = votes[trump_row][1].split('%')[1]
Edit: from the HTML source the relevant table looks like:
<table class="jsx-1526769828 candidate-table">
<thead class="jsx-3554868417 table-head">
<tr class="jsx-3554868417">
<th class="table-header jsx-3554868417 candidate-name">
<h5 class="jsx-3554868417">Candidate</h5>
</th>
<th class="table-header jsx-3554868417 percent">
<h5 class="jsx-3554868417">Pct.</h5>
</th>
<th class="table-header jsx-3554868417 vote-bar"></th>
</tr>
</thead>
<tbody class="jsx-2085888330 table-head">
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Biden</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label dem">dem</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,450,193</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar dem"></div>
</td>
</tr>
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Trump*</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label gop">gop</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,448,635</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar gop"></div>
</td>
</tr>
</tbody>
</table>
So you could probably use the candidate-votes-next-to-percent to get this value. e.g.:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map do |row|
[
row.css('div[class*="candidate-short-name"]').first.content,
row.css('div[class*="candidate-votes-next-to-percent"]').first.content
]
end
# => [["Biden", "2,450,193"], ["Trump*", "2,448,635"]]

Related

Locating table cell using the header cell text

I have table kind of appearance as shown below but it's not a single table. Header is in one table and rows are in another table.
The header has Primary, Language and the add button which is one table and rest of the two rows are in another table. Now I have to identify the cell using the header text. For an example, If I give 1 Language then it has to locate the first row second cell in which Arabic is chosen. Likewise, If I give 2 Primary it has locate the second row first column.
The HTML code is shown in the pic below. If it's possible to solve this problem, then I will give the actual code.
<div id="d78Pt30" style="width: 35em;" class="gridMaxHeight z-grid">
<div id="d78Pt30-head" class="z-grid-header" style="">
<table id="d78Pt30-headtbl" style="table-layout: fixed;" width="100%">
<colgroup id="d78Px30-hdfaker">
<col id="d78Py30-hdfaker" style="width: 61px;">
<col id="d78Pz30-hdfaker" style="">
<col id="d78P_40-hdfaker" style="width: 50px;">
<col id="d78Px30-hdfaker-bar" style="width: 0px">
</colgroup>
<tbody id="d78Pt30-headrows">
<tr id="d78Px30" class="z-columns" style="text-align: left;">
<th id="d78Py30" class="z-column">
<div id="d78Py30-cave" class="z-column-content">
<div class="z-column-sorticon"><i id="d78Py30-sort-icon"></i></div>
Primary
</div>
</th>
<th id="d78Pz30" class="z-column">
<div id="d78Pz30-cave" class="z-column-content">
<div class="z-column-sorticon"><i id="d78Pz30-sort-icon"></i></div>
Language
</div>
</th>
<th id="d78P_40" class="z-column">
<div id="d78P_40-cave" class="z-column-content">
<div class="z-column-sorticon"><i id="d78P_40-sort-icon"></i></div>
<a id="d78P040" class="z-a" href="javascript:;"><img src="assets/images/add.png"
align="absmiddle"></a></div>
</th>
<th id="d78Px30-bar" class="z-columns-bar"></th>
</tr>
</tbody>
</table>
</div>
<div class="z-grid-header-border"></div>
<div id="d78Pt30-body" class="z-grid-body" style="overflow: auto;">
<table id="d78Pt30-cave" style="table-layout: fixed;" width="100%">
<colgroup id="d78Px30-bdfaker">
<col id="d78Py30-bdfaker" style="width: 61px;">
<col id="d78Pz30-bdfaker" style="">
<col id="d78P_40-bdfaker" style="width: 50px;">
</colgroup>
<tbody id="d78Pi50" class="z-rows">
<tr id="d78P260" class="gridMaxHeight z-row">
<td id="d78P360-chdextr" class="z-row-inner">
<div id="d78P360-cell" class="z-row-content"><span id="d78P360"
class="z-radio z-radio-default"><input
type="radio" id="d78P360-real" name="d78P360" checked="checked"><label for="d78P360-real"
id="d78P360-cnt"
class="z-radio-content"></label></span>
</div>
</td>
<td id="d78P460-chdextr" class="z-row-inner">
<div id="d78P460-cell" class="z-row-content"><span id="d78P460" class="z-combobox"
style="width: 225px;"><input id="d78P460-real"
class="z-combobox-input"
autocomplete="off"
value="" type="text"
size="20"
style="width: 196px;"><a
id="d78P460-btn" class="z-combobox-button"><i id="d78P460-icon"
class="z-combobox-icon z-icon-caret-down"></i></a><div
id="d78P460-pp" style="display: none;"></div></span></div>
</td>
<td id="d78Py60-chdextr" class="z-row-inner">
<div id="d78Py60-cell" class="z-row-content">
<div id="d78Py60" class="z-hlayout">
<div id="d78Pz60-chdex" class="z-hlayout-inner" style=""><a id="d78Pz60" class="z-a"
href="javascript:;"><img
src="assets/images/delete.png" align="absmiddle"></a></div>
</div>
</div>
</td>
</tr>
</tbody>
<tbody class="z-grid-emptybody">
<tr>
<td id="d78Pt30-empty" style="display: none;" colspan="3">
<div id="d78Pt30-empty-content" class="z-grid-emptybody-content">No data available</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
As you can see in the HTML, there are two tables and first one is having the header and second one is having the rest of the rows.
Since the full markup of the table was not provided, I've used the HTML at the end of the answer to illustrate the possible solutions.
Solution 1 - Hardcode the column index
If the table columns are _static, the easiest solution is to hardcode an index lookup in your code. This solution also makes it easier to deal with header cells that do not have text - eg the add icon.
For example, you know that "Language" header is always column index 1 and "Primary" is always column index 0. Therefore, you know that if you want the "Language" of a data row, it will be the 2nd cell in the row.
def cell_by_header_text(browser, data_row_index, header_text)
columns = ['Primary', 'Language', 'Add'] # must match the order on the page
column_index = columns.index(header_text)
data_table = browser.div(class: 'z-grid-body').table
data_table[data_row_index - 1][column_index] # returns Watir::Cell
end
p cell_by_header_text(browser, 1, 'Language').html
#=> "<td><select><option selected=\"selected\">Arabic</option><option>Bengali</option></select></td>"
p cell_by_header_text(browser, 2, 'Primary').html
#=> "<td><input type=\"radio\" checked=\"\"></td>"
Solution 2 - Dynamic lookup of column index
If the table columns are dynamic or you want a more general solution, you can lookup the column index from the header table.
def cell_by_header_text(browser, data_row_index, header_text)
header_table = browser.div(class: 'z-grid-header').table
column_index = header_table.tds.find_index { |td| td.text == header_text }
data_table = browser.div(class: 'z-grid-body').table
data_table[data_row_index - 1][column_index] # returns Watir::Cell
end
Solution 3 - Domain-specific collection
If you want to improve readability and have more flexibility, you could take it a step further and create a domain-specific collection for the grid:
class LanguageRowCollection
include Enumerable
def initialize(browser)
#browser = browser
end
def each
data_rows.map { |data| yield LanguageRow.new(header_row, data) }
end
def [](value)
to_a[value]
end
private
def header_row
#browser.div(class: 'z-grid-header').table.tr
end
def data_rows
#browser.div(class: 'z-grid-body').table.trs
end
end
class LanguageRow
def initialize(header_row, tr)
#header_row = header_row
#tr = tr
end
def primary_cell
#tr.tds[#header_row.tds.map(&:text).index('Primary')]
end
def primary?
primary_cell.radio.selected?
end
def set_primary(value)
primary_cell.radio.set(value)
end
def language_cell
#tr.tds[#header_row.tds.map(&:text).index('Language')]
end
def language
language_cell.select.text
end
def set_language(value)
language_cell.select.set(value)
end
def remove_cell
# Locating the 3rd column by it's image since it doesn't have text
#tr.tds[#header_row.tds.find_index { |td| td.image(class: 'add').exists? }]
end
def remove
remove_cell.link.click
end
end
def languages(browser)
grid = browser.div(class: 'z-grid')
LanguageRowCollection.new(grid)
end
You get a more readable way to get/set values:
# Get/set the language of the first row (note the 0-based index)
languages(browser)[0].language
#=> "Arabic"
languages(browser)[0].set_language('Bengali')
You also get the flexibility of locating rows based on their values:
# Get the primary language
languages(browser).find(&:primary?).language
#=> "Bengali"
# Remove the Arabic row
languages(browser).find { |l| l.language == 'Arabic' }.remove
HTML Example
The following HTML was used for the above examples.
<html>
<body>
<div id="d78Pt30" class="gridMaxHeight -grid">
<div class="z-grid-header">
<table>
<tr>
<td>Primary</td>
<td>Language</td>
<td>Add</td>
</tr>
</table>
</div>
<div class="z-grid-header-border"></div>
<div class="z-grid-body">
<table>
<tr>
<td><input type="radio"></td>
<td><select><option selected="selected">Arabic</option><option>Bengali</option></select></td>
<td>Minus</td>
</tr>
<tr>
<td><input type="radio" checked></td>
<td><select><option>Arabic</option><option selected="selected">Bengali</option></select></td>
<td>Minus</td>
</tr>
</div>
</div>
</body>
</html>

How to parse text within span tags using Nokogiri

I want to build an application displaying artists from a popular venue and want to extract only the artist's name.
Here is my code:
data.css('.headliner').each do |artist|
puts artist
end
It's currently returning:
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
Some elements have more than one span tag and I'm having trouble getting the data I want. All I want returned is the artist's name such as 'London Grammar', 'Hozier', 'Ben Howard', and 'Dr. Dog'.
Currently, when I run artist.text it returns "Rescheduled DateLondon Grammar" and so on.
<table class="concert_calendar" cellspacing="0" width="720" style="margin-top:35px;">
<tbody><tr><td class="noborder"><img src="images/title_date2.gif" alt="Date"></td>
<td class="noborder" colspan="2"><img src="images/title_show2.gif" alt="Show"></td>
<td class="noborder"><img src="images/title_time2.gif" alt="Time"></td>
<td class="noborder"><img src="images/title_tickets2.gif" alt="Tickets"></td></tr>
<tr><td colspan="5" class="noborder"><hr size="1" color="#550818" noshade="" style="margin:0px; padding:0px;"></td></tr>
<tr><td style="width:100px;" class="">Saturday,<br>February 7</td>
<td style="width:115px;" valign="top" class=""><img src="http://www.apeconcerts.com/concertimages/LondonGrammar_100.jpg" alt="London Grammar"></td>
<td valign="top" style="width:345px; padding-right:10px;" class="">
<a href="popartist.php?cID=4600&KeepThis=true&TB_iframe=true&height=600&width=475" style="text-decoration:none;" class="thickbox">
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span></a>
<div><span class="warmup">Until The Ribbon Breaks</span><br>
<span class="warmup"></span></div></td>
<td style="width:80px;">show<br>8:00PM</td>
<td style="width:80px;">
<img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!"> </td></tr>
<tr><td style="width:100px;">Tuesday,<br>February 10</td>
<td style="width:115px;" valign="top"><img src="http://www.apeconcerts.com/concertimages/Hozier_1001.jpg" alt="Hozier"></td>
<td valign="top" style="width:345px; padding-right:10px;" class="">
<a href="popartist.php?cID=4733&KeepThis=true&TB_iframe=true&height=600&width=475" style="text-decoration:none;" class="thickbox">
<span class="headliner">Hozier</span></a>
<div class=""><span class="warmup">Ásgeir</span><br>
<span class="warmup"></span></div></td>
<td style="width:80px;">show<br>8:00PM</td>
<td style="width:80px;">
<img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!"> </td></tr>
All I want returned is the artist's name such as 'London Grammar',
'Hozier', 'Ben Howard', and 'Dr. Dog'
Here's one way:
require 'nokogiri'
html = %q{
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
}
html_doc = Nokogiri::HTML(html)
headliners = html_doc.css('.headliner')
headliners.each do |headliner|
headliner.css('i').each do |i|
i.content = ''
end
puts headliner.text
end
--output:--
London Grammar
Hozier
Ben Howard
Dr. Dog
If all you're trying to do is remove the <i> tag's content, then just remove the tags entirely:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
EOT
doc.search('.headliner i').map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <span class="headliner"><span class="prepend"></span><br>London Grammar</span>
# >> <span class="headliner">Hozier</span>
# >> <span class="headliner"><span class="prepend"></span><br>Ben Howard<br><span class="append"><br></span></span>
# >> <span class="headliner">Dr. Dog</span>
# >> </body></html>
At that point it's really easy to iterate over the .headliner tags and output their content:
puts doc.search('.headliner').map(&:text)
# >> London Grammar
# >> Hozier
# >> Ben Howard
# >> Dr. Dog
I'd probably do it a little different for a big page consisting of a lot of tags matching .headliner but this is sufficient for normal pages.
See "How to avoid joining all text from Nodes when scraping" also.

Getting attributed html element

I'm trying to get table with content of MMEL codes from this site and I'm trying to accomplish it with CSS Selectors.
What I've got so far is:
require_relative 'sources/Downloader'
require 'nokogiri'
html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)
tmp = parsed_html.css("tr[*]")
puts tmp.text
And I'm getting error while trying to get this tr with attribute. How can I complete this task to get this table in simple form because I want to parse it to JSON. It would be nice go get this in sections and call it in.each block.
EDIT:
I'd be nic if I can get things in block like this (look into pages source)
<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP" COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog. They shall be illustrated, showing the part number, Legend and Location. The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations. Those required by government regulations shall be so identified.</FONT></TD>
</TR>
This should print all those TR's from source at line 96. There are three tables in that page and table[1] has all the text you needed:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
puts i #=> prints the exact html between TR tags (including)
puts i.text #=> prints the text
end
For instance:
puts doc.css("table")[1].css("tr")[2]
prints the following:
<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit. Includes dimensions and
areas, lifting and shoring, leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>
You could do the same using xpath also:
Below is the content from the first table of the webpage given in the post by OP:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="33%" valign="MIDDLE" colspan="2">
<p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
</td>
<td width="67%" valign="MIDDLE">
<b><i><font face="Arial" color="#0000ff">
<p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
</td>
</tr>
Now, if you want to collect the last table rows, then do:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2">E-mail S-Tech</font></b></p>
</td>
</tr>

xpath selecting text from link in <td> & text from <td>

I have the following code which works very well:
rows = diary_HTML.xpath('//*[#id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]/text()'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
However the "Food" td does not only include text, but also a link from which I want to get the text.
I know I can use 'td[1]/a/text()'to get the link text, but how do I do both?
'td[1]/a/text()' or 'td[1]/text()'
EDITED - Added Snippet.
I am trying to include the <tr class="meal_header">
<td class="first alt">Breakfast</td> on the first row, all lines with other regular tds on other rows whilst excluding td1 on the bottom row.
<tr class="meal_header">
<td class="first alt">Breakfast</td>
<td class="alt">Calories</td>
<td class="alt">Carbs</td>
<td class="alt">Fat</td>
<td class="alt">Protein</td>
<td class="alt">Sodium</td>
<td class="alt">Sugar</td>
</tr>
<tr>
<td class="first alt">
<a onclick="showEditFood(3992385560);" href="#">Hovis (Uk - White Bread (40g) Toasted With Flora Light Marg, 2 slice</a> </td>
<td>262</td>
<td>36</td>
<td>9</td>
<td>7</td>
<td>0</td>
<td>3</td>
</tr>
<tr class="bottom">
<td class="first alt" style="z-index: 10">
Add Food
<div class="quick_tools">
Quick Tools
<div id="quick_tools_0" class="quick_tools_options hidden">
<ul>
<li><a onclick="showLightbox(200, 250, '/food/quick_add?meal=0&date=2013-04-15'); return false;">Quick add calories</a></li>
<li>Remember meal</li>
<li>Copy yesterday</li>
<li>Copy from date</li>
<li>Copy to date</li>
</ul>
</div>
<div id="recent_meals_0" class="recent_meal_options hidden">
<ul id="recent_meal_options_0">
<li class="header">Copy from which date?</li>
<li>Sunday, April 14</li>
<li>Saturday, April 13</li>
</ul>
</div>
</div>
</td>
<td>285</td>
<td>39</td>
<td>9</td>
<td>10</td>
<td>0</td>
<td>3</td>
<td></td>
The short answer is: use Nokogiri::XML::Element#text, it will give the text of the element plus subelements (your a for example).
You can also clean that code up quite a bit:
keys = ["Food", "Calories", "Carbs", "Fat", "Protein", "Cholest"]
food_diary = rows.collect do |row|
Hash[keys.zip row.search('td').map(&:text)]
end
And as a final tip, avoid using xpath with html, css is so much nicer.
I think you can achieve this by altering the logic to look at element content when you don't have an explicit text() extraction in the xpath
rows = diary_HTML.xpath('//*[#id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
if xpath.include?('/text()')
detail[name] = row.at_xpath(xpath).to_s.strip
else
detail[name] = row.at_xpath(xpath).content.strip
end
end
detail
end
You could also add e.g. a symbol to the array, to describe how you were extracting the data, and have a case block which handled items depending on what the last stage was to do following the xpath
Note you could also do what you want by walking the node structure returned by xpath recursively, but that seems like overkill if you just want to ignore markup, links etc.

Nokogiri next_element with filter

Let's say I've got an ill formed html page:
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
On BeautifulSoup, we were able to get the <th> and then call findNext("td"). Nokogiri has the next_element call, but that might not return what I want (in this case, it would return the tr element).
Is there a way to filter the next_element call of Nokogiri? e.g. next_element("td")?
EDIT
For clarification, I'll be looking at many sites, most of them ill formed in different ways.
For instance, the next site might be:
<table>
<th class="what_I_need">Super sweet text<th>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
I can't assume any structure other than there will be trs below the item that has the class what_I_need
First, note that your closing th tag is malformed: <th>. It should be </th>. Fixing that helps.
One way to do it is to use XPath to navigate to it once you've found the th node:
require 'nokogiri'
html = '
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<tr>
</table>
'
doc = Nokogiri::HTML(html)
th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n I also need this\n "
This is taking advantage of Nokogiri's ability to use either CSS accessors or XPath, and to do it pretty transparently.
Once you have the <th> node, you could also navigate using some of Node's methods:
th.parent.next_element.at('td').text # => "\n I also need this\n "
One more way to go about it, is to start at the top of the table and look down:
table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n I also need this\n "
If you need to access all <td> tags within a table you can iterate over them easily:
table.search('td').each do |td|
# do something with the td...
puts td.text
end
If you want the contents of all <td> by their containing <tr> iterate over the rows then the cells:
table.search('tr').each do |tr|
cells = tr.search('td').map(&:text)
# do something with all the cells
end

Resources