Nokogiri parsing HTML - ruby

I am using Nokogiri to parse my HTML code. My HTML looks like this:
<table>
<tr>
<td>
<p>Important Preferences</p>
To see as much as possible
<br />Relaxation
<br />Quality of accommodation
<br />Quality of activities
<br />Independence & flexibility
<br />Safety & security
</td>
<td>
<p>Budget Preferences</p>
4000 to 5000 USD per person
<br />5000 to 6000 USD per person
<br />Above 6000 USD per person
</td>
</tr>
</table>
I am trying to make a hash from it, which would be like this:
{
"Important Preferences" => "To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
"Budget Preferences" => "4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"
}
I tried:
params = {}
Nokogiri::HTML("my HTML pls see above").css("td p").each do |item|
params.merge!({item.text => item.next.text})
end
But I couldn't collect values inside <BR>.
My result was:
{
"Important Preferences" => "To see as much as possible",
"Budget Preferences" => "4000 to 5000 USD per person"
}

At the first step find out all <td> tags with xpath('//td'). Then, for each, iterate on its children and collect its content, if the child it Nokogiri::XML::Text (you don't want to collect <br> tags):
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
p = td.at_xpath('p')
a = []
td.children.each do |child|
if Nokogiri::XML::Text === child
t = child.text.strip
a << t unless t.empty?
end
end
h[p.text] = a.join(', ')
end
result:
{"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
"Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"}
or in more compressed form, without using the strict loops:
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
h[td.at_xpath('p').text] = td.children
.select{|x| Nokogiri::XML::Text === x && !x.text.strip.empty?}
.map{|x| x.text.strip}.join(', ')
end

You basically want to get all siblings of td p
You can get list of all siblings and remove p.
item.parent.children.to_a - [item]

I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<table>
<tr>
<td>
<p>Important Preferences</p>
To see as much as possible
<br />Relaxation
<br />Quality of accommodation
<br />Quality of activities
<br />Independence & flexibility
<br />Safety & security
</td>
<td>
<p>Budget Preferences</p>
4000 to 5000 USD per person
<br />5000 to 6000 USD per person
<br />Above 6000 USD per person
</td>
</tr>
</table>
EOT
doc.search('td').map { |td|
key = td.at('p').text
[
key,
td.text.sub(/#{key}/, '').lstrip.gsub(/\n +/, ', ')
]
}.to_h
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}
If you're on an older version of Ruby that doesn't have to_h, use:
Hash[
doc.search('td').map { |td|
key = td.at('p').text
[
key,
td.text.sub(/#{key}/, '').lstrip.gsub(/\n +/, ', ')
]
}
]
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}

Related

How can I add paging for results in a table created in Classic ASP?

I have some code done in VBScript that creates a table. Specifically, the code pulls information from a database and then loops through the result adding them to a table. The problem is that there are 14,000 rows in this table. Every time this page tries to load, I get a 500 Internal Server error which I assume is due to lack of memory.
For the loop, I have this:
<%
fHideNavBar = False
fHideNumber = False
fHideRequery = False
fHideRule = False
stQueryString = ""
fEmptyRecordset = False
fFirstPass = True
fNeedRecordset = False
fNoRecordset = False
tBarAlignment = "Left"
tHeaderName = "DataRangeHdr1"
tPageSize = 0
tPagingMove = ""
tRangeType = "Text"
tRecordsProcessed = 0
tPrevAbsolutePage = 0
intCurPos = 0
intNewPos = 0
fSupportsBookmarks = True
fMoveAbsolute = False
If IsEmpty(Session("DataRangeHdr1_Recordset")) Then
fNeedRecordset = True
Else
If Session("DataRangeHdr1_Recordset") Is Nothing Then
fNeedRecordset = True
Else
Set DataRangeHdr1 = Session("DataRangeHdr1_Recordset")
End If
End If
If fNeedRecordset Then
Set DataConn = Server.CreateObject("ADODB.Connection")
DataConn.Open "DSN=MYDSN","MyUserName","MyPassword"
Set cmdTemp = Server.CreateObject("ADODB.Command")
Set DataRangeHdr1 = Server.CreateObject("ADODB.Recordset")
cmdTemp.CommandText = "SELECT PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID"
cmdTemp.CommandType = 1
Set cmdTemp.ActiveConnection = DataConn
DataRangeHdr1.Open cmdTemp, , 0, 1
End If
On Error Resume Next
If DataRangeHdr1.BOF And DataRangeHdr1.EOF Then fEmptyRecordset = True
On Error Goto 0
If Err Then fEmptyRecordset = True
If Not IsEmpty(Session("DataRangeHdr1_Filter")) And Not fEmptyRecordset Then
DataRangeHdr1.Filter = Session("DataRangeHdr1_Filter")
If DataRangeHdr1.BOF And DataRangeHdr1.EOF Then fEmptyRecordset = True
End If
If fEmptyRecordset Then
fHideNavBar = True
fHideRule = True
End If
Do
If fEmptyRecordset Then Exit Do
If Not fFirstPass Then
DataRangeHdr1.MoveNext
Else
fFirstPass = False
End If
If DataRangeHdr1.EOF Then Exit Do
%>
<tr>
<td><p align="center"><%= DataRangeHdr1("FIRM") %></td>
<td><p align="center"><%= DataRangeHdr1("PHONE") %></td>
<td><p align="center"><%= DataRangeHdr1("FAX") %></td>
<%end if%>
</tr>
<%
Loop%>
Now, I believe that the programmer before me essentially copied the code from this website: http://www.nnybe.com/board%20members/DEFAULT.ASP
In fact, I actually changed the column names in my loop to match the website, since it was so similar (my real column names are different). After the loop, the code I have is as follows:
</TABLE>
<%
If tRangeType = "Table" Then Response.Write "</TABLE>"
If tPageSize > 0 Then
If Not fHideRule Then Response.Write "<HR>"
If Not fHideNavBar Then
%>
<TABLE WIDTH=100% >
<TR>
<TD WIDTH=100% >
<P ALIGN=<%= tBarAlignment %> >
<FORM <%= "ACTION=""" & Request.ServerVariables("PATH_INFO") & stQueryString & """" %> METHOD="POST">
<INPUT TYPE="Submit" NAME="<%= tHeaderName & "_PagingMove" %>" VALUE=" << ">
<INPUT TYPE="Submit" NAME="<%= tHeaderName & "_PagingMove" %>" VALUE=" < ">
<INPUT TYPE="Submit" NAME="<%= tHeaderName & "_PagingMove" %>" VALUE=" > ">
<% If fSupportsBookmarks Then %>
<INPUT TYPE="Submit" NAME="<%= tHeaderName & "_PagingMove" %>" VALUE=" >> ">
<% End If %>
<% If Not fHideRequery Then %>
<INPUT TYPE="Submit" NAME="<% =tHeaderName & "_PagingMove" %>" VALUE=" Requery ">
<% End If %>
</FORM>
</P>
</TD>
<TD VALIGN=MIDDLE ALIGN=RIGHT>
<FONT SIZE=2>
<%
If Not fHideNumber Then
If tPageSize > 1 Then
Response.Write "<NOBR>Page: " & Session(tHeaderName & "_AbsolutePage") & "</NOBR>"
Else
Response.Write "<NOBR>Record: " & Session(tHeaderName & "_AbsolutePage") & "</NOBR>"
End If
End If
%>
</FONT>
</TD>
</TR>
</TABLE>
<%
End If
End If
%>
</TABLE>
I'm guessing from the < and > around the PagingMove part, this is supposed to allow paging. However, I'm not even seeing this on my page. I don't know if the code on the link above works on their website, but for my own website I'd ask:
How can I modify this code to provide an option to click through pages of the data result so the server doesn't run out of memory?
If there is a more elegant solution to this that can accomplish the same thing, I'd appreciate that as well!!!
In your SQL you could add a LIMIT offset
SELECT PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID LIMIT 0,10 ' Results 1 to 10
SELECT PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID LIMIT 10,10 ' 11 - 20
SELECT PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID LIMIT 20,10 ' 21 - 30
...
If you're using MySQL you can use...
SELECT SQL_CALC_FOUND_ROWS PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID LIMIT 0,10
... to get a total count of the results and calculate the number of page links to display:
(total_results/results_per_page) ' and round up.
Then link to the pages below the results table and pass the page numbers as a query string:
default.asp?page=1
default.asp?page=2
default.asp?page=3
...
Have some code at the top of your page that gets the requested page number and calculates the correct offset value:
<%
Const results_per_page = 10
Dim limit_offset, page_num
limit_offset = 0 ' default
page_num = request.querystring("page")
if isNumeric(page_num) then
page_num = int(page_num)
if page_num > 0 then
limit_offset = (page_num-1)*results_per_page
else
page_num = 1 ' default
end if
else
page_num = 1 ' default
end if
%>
Finally, apply the limit offset to your SQL:
cmdTemp.CommandText = "SELECT PHONE, FAX, FIRM, ID FROM NNYBEA ORDER BY ID LIMIT " & limit_offset & "," & results_per_page
You could also use GetRows() to convert the recordset to a 2D array and apply a limit when looping
Dim r, rs_loops, theData
theData = DataRangeHdr1.getRows()
rs_loops = page_num*results_per_page
if rs_loops > uBound(theData,2) then rs_loops = uBound(theData,2)
for r = limit_offset to rs_loops
' output data from the DataRangeHdr1 recordset
%>
<tr>
<td><p align="center"><%= theData(2,r) ' firm %></td>
<td><p align="center"><%= theData(0,r) ' phone %></td>
<td><p align="center"><%= theData(1,r) ' fax %></td>
</tr>
<%
next
But this would mean storing large amounts of unseen data in memory. Using a LIMIT offset in the SQL would make more sense.

How can I search a table faster?

I am trying to search a table for specific a specific value using Ruby and Selenium-webdriver. I have a method that works but takes a lot of time for some reason. It is a one row table and the page HTML looks like this:
<div id="permitGridContainer">
<table id="calendar" class="items" style="width:430px;" name="calendar">
<thead>
<tbody>
<tr>
<td id="avail1" class="status r slct" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 0 of 4");">
<div class="permitStatus">R</div>
</td>
<td id="avail2" class="status r" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 0 of 4");">
<div class="permitStatus">R</div>
</td>
<td id="avail3" class="status a" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 89 of 99");">
<a onclick="javascript:setNewArrivalDate("Sun Sep 06 2015", 2);return false;" href="#">
A
<br>
<small>89</small>
</a>
</td>
<td id="avail4" class="status a" onmouseout="return nd();" onmouseover="return overlib("Available Quota<br>River Launches : 97 of 99");">
</tr>
</tbody>
</table>
</div>
... I shortened the table it has 14 columns.
I am looking for a column that has an Item available and I am checking the class for this, but the text also changes so there are other things I could look for.
This is the code I am using, but it visibly slow. I used puts statements to see the progress. My sense is that is has to do with time accessing the element. So I was hoping there is a better way to process the table quickly. Thank you.
for j in 1..days_to_check[i]
check_avail = driver.find_element(id: "avail#{j}")
check_availclass = check_avail.attribute ("class")
if check_availclass == "status a" or check_availclass == "status a slct"
#process if
end
Depending on your comment I would suggest to use the following xpath. I find this is often easier and feasible to use better xpath than looping though the html table
//td[(#class='status a') or (#class='status A')]
This xpath finds the class with status a or status A

Advice for replacing img tags with text in Ruby?

I'm trying to work out how to store an html table of drive stats in a database, but the developers have been a bit clever, and started using gifs to represent pass/fail/health stats
Here's a snippet of what I've got:
<tr class="status">
<td class="status"><img border="0" src="/tick_green.gif"></td>
<td class="status">8</td>
<td class="status">Ready</td>
<td class="status"><img border="0" src="/bar10.gif"></td>
<td class="status">SEAGATE ST3146807FC</td>
<td class="status">10000 RPM</td>
<td class="status">3HY61AG9</td>
<td class="status">XR12</td>
<td class="status">286749488</td>
<td class="status"> 28.0°C</td>
<td class="status" style="background-color: #00fa00"> 
</td>
**
And here's some of the ruby that I've written so far to strip the tags:
table = page.parser.xpath('//table/caption[contains(.,"Drive")]/..')
table.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
puts cell.to_html.gsub(/<a[^>]+>/,'').gsub(/<td[^>]+>/,'').gsub(/<\/td[^>]*>/,'').gsub(/<\/a[^>]*>/,'')
#puts cell.text
end
end
I can now get semi-rational output
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
But I want to replace a couple of other cell elements with other bits
For example, the tick_green can also be '/cross_red.gif' or '/caution.gif' which I want to replace with regular text, likewise, the img bar10.gif, I want to replace with just text of '10'
Is it best to come up with a whole bunch of values for all of my special cases?
I'd do some 'gsub'iing.
E.g.:
example = <<-STRING
<img border="0" src="/tick_green.gif">
15
Ready
<img border="0" src="/bar10.gif">
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
STRING
replace = Hash.new("#unknown")
replace['tick_green.gif'] = "[OK]"
replace['bar10.gif'] = "[10]"
regex = /<img [^>]* src="\/(.*)">/
result = example.gsub(regex) { replace[$1] }
Somehow the I'd like to replace the $1 with a named backreference, but don't know how yet.
http://ruby-doc.org/core-1.9.3/String.html#method-i-gsub
edit: result from above
[OK]
15
Ready
[10]
SEAGATE ST3146807FC
10000 RPM
3HY61ASW
XR12
286749488
29.0°C
 
A case statement will clean that up a little but:
row.css('td').each do |td|
img = td.at('img')
puts case
when img && img[:src][/bar(\d+)\.gif/] then $1
when img && img[:src][/tick_green/] then 'ok'
else td.text.strip
end
end

Can't assign value to Variable: undefined method `[]' for nil:NilClass (NoMethodError)

I am completely stumped on this one.
I have the following code:
puts block.at_xpath("*/img")["width"].to_i
but when I change it to
width = block.at_xpath("*/img")["width"].to_i
I get this error:
NokogiriTUT.rb:70:in `blockProcessor': undefined method `[]' for nil:NilClass (NoMethodError)
When I have the puts in there it returns the expected value.
Update:
def blockProcessor(block)
header = block.xpath('td[#class="default"]/*/span[#class="comhead"]')
array = header.text.split
if array[0] != nil #checks to make sure we aren't at the top of the parent list
### Date and Time ###
if array[2] == 'hours' || array[2] == 'minutes'
date = Time.now
else
days = (array[1].to_i * 24 * 60 * 60)
date = Time.now - days
end
##Get Comment##
comment = block.at_xpath('*/span[#class="comment"]')
hash = comment.text.hash
#puts hash
##Manage Parent Here##
width = block.at_xpath("*/img")["width"].to_i
prevlevel = #parent_array[#parent_array.length-1][1]
if width == 0 #has parents
parentURL = header.xpath('a[#href][3]').to_s
parentURL = parentURL[17..23]
parentURL = "http://news.ycombinator.com/item?id=#{parentURL}"
parentdoc = Nokogiri::HTML(open(parentURL))
a = parentdoc.at_xpath("//html/body/center/table/tr[3]/td/table/tr")
nodeparent = blockProcessor(a)
#parent_array = []
node = [hash, width, nodeparent] #id, level, parent
#parent_array.push node
elsif width > prevlevel
nodeparent = #parent_array[#parent_array.length-1][0]
node = [hash, width, nodeparent]
#parent_array.push node
elsif width == prevlevel
nodeparent = #parent_array[#parent_array.length-1][2]
node = [hash, width, nodeparent]
#parent_array.push node
elsif width < prevlevel
until prevlevel == w do
#parent_array.pop
prevlevel = #parent_array[#parent_array.length-1][1]
end
nodeparent = #parent_array[#parent_array.length-1][2]
node = [hash, width, nodeparent]
#parent_array.push node
end
puts "Author: #{array[0]} with hash #{hash} with parent: #{nodeparent}"
##Handles Any Parents of Existing Comments ##
return hash
end
end
end
Here is the block that it is acting on.
<tr>
<td><img src="http://ycombinator.com/images/s.gif" height="1" width="0"></td>
<td valign="top"><center>
<a id="up_3004849" href="vote?for=3004849&dir=up&whence=%2f%78%3f%66%6e%69%64%3d%34%6b%56%68%71%6f%52%4d%38%44"><img src="http://ycombinator.com/images/grayarrow.gif" border="0" vspace="3" hspace="2"></a><span id="down_3004849"></span>
</center></td>
<td class="default">
<div style="margin-top:2px; margin-bottom:-10px; "><span class="comhead">patio11 12 days ago | link | parent | on: Ask HN: What % of your job interviewees pass FizzB...</span></div>
<br><span class="comment"><font color="#000000">Every time FizzBuzz problems come up among engineers, people race to solve them and post their answers, then compete to see who can write increasingly more nifty answers for a question which does not seek niftiness at all.<p>I'm all for intellectual gamesmanship, but these are our professional equivalent of a doctor being asked to identify the difference between blood and water. You can do it. <i>We know</i>. Demonstrating that you can do it is not the point of the exercise. We do it to have a cheap-to-administer test to exclude people-who-cannot-actually-program-despite-previous-job-titles from the expensive portions of the hiring process.</p></font></span><p><font size="1"><u>reply</u></font></p>
</td>
</tr>
Your basic problem is that you don't understand XPath. (You are in good company there; XPath is quite confusing.) Your selectors simply don't match what you think they match. In particular, the one that blows up
*/img
should be
//img
or something like that.
Now, because the xpath selector doesn't match anything, the value of this Ruby statement
block.at_xpath("*/img")
is nil. And nil doesn't support [], so when you try to call ["width"] on it, Ruby complains with a undefined method [] for nil:NilClass error.
And as for why it only blows up when you assign it to a variable... yeah, that's not actually what's happening. You probably changed something else too.
And now, please allow me to make some other hopefully constructive code criticisms:
Your question was apparently designed to make it difficult to answer. In the future, please isolate the code in question, don't just paste in your whole homework assignment (or whatever this screen scraper is for).
It would be extra great if you made it into a single runnable Ruby file that we can execute verbatim on our computers, e.g.:
.
require "nokogiri"
doc = Nokogiri.parse <<-HTML
<tr>
<td><img src="http://ycombinator.com/images/s.gif" height="1" width="0"></td>
<td valign="top"><center>
<a id="up_3004849" href="vote?for=3004849&dir=up&whence=%2f%78%3f%66%6e%69%64%3d%34%6b%56%68%71%6f%52%4d%38%44"><img src="http://ycombinator.com/images/grayarrow.gif" border="0" vspace="3" hspace="2"></a><span id="down_3004849"></span>
</center></td>
<td class="default">
<div style="margin-top:2px; margin-bottom:-10px; ">
<span class="comhead">
patio11 12 days ago | link | parent | on: Ask HN: What % of your job interviewees pass FizzB...
</span>
</div>
<br><span class="comment"><font color="#000000">Every time FizzBuzz problems come up among engineers, people race to solve them and post their answers, then compete to see who can write increasingly more nifty answers for a question which does not seek niftiness at all.<p>I'm all for intellectual gamesmanship, but these are our professional equivalent of a doctor being asked to identify the difference between blood and water. You can do it. <i>We know</i>. Demonstrating that you can do it is not the point of the exercise. We do it to have a cheap-to-administer test to exclude people-who-cannot-actually-program-despite-previous-job-titles from the expensive portions of the hiring process.</p></font></span><p><font size="1"><u>reply</u></font></p>
</td>
</tr>
HTML
width = doc.at_xpath("*/img")["width"].to_i
That way we can debug with our computers, not just with our minds.
You're writing Ruby now, not Java, so conform to Ruby's spacing and naming conventions: file names are snake_case, indentation is 2 spaces, no tabs, etc. It really is difficult to read code that's formatted wrong -- where "wrong" means "non-standard."
Everywhere you have one of those descriptive comments (### Date and Time ###) is an opportunity to extract a method (def date_and_time(array)) and make your code cleaner and easier to debug.

How do I parse a plain HTML table with Nokogiri?

I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:
Today,3,455,34
Today,1,1300,3664
Today,10,100000,3444,
Yesterday,3454,5656,3
Yesterday,3545,1000,10
Yesterday,3411,36223,15
From this HTML:
<div id="__DailyStat__">
<table>
<tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
<tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
<tr class="blr">
<td>3</td>
<td>455</td>
<td>34</td>
<td class="r">3454</td>
<td class="r">5656</td>
<td class="r">3</td>
</tr>
<tr class="bla">
<td>1</td>
<td>1300</td>
<td>3664</td>
<td class="r">3545</td>
<td class="r">1000</td>
<td class="r">10</td>
</tr>
<tr class="blr">
<td>10</td>
<td>100000</td>
<td>3444</td>
<td class="r">3411</td>
<td class="r">36223</td>
<td class="r">15</td>
</tr>
</table>
</div>
As a quick and dirty first pass I'd do:
html = <<EOT
<div id="__DailyStat__">
<table>
<tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
<tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
<tr class="blr">
<td>3</td>
<td>455</td>
<td>34</td>
<td class="r">3454</td>
<td class="r">5656</td>
<td class="r">3</td>
</tr>
<tr class="bla">
<td>1</td>
<td>1300</td>
<td>3664</td>
<td class="r">3545</td>
<td class="r">1000</td>
<td class="r">10</td>
</tr>
<tr class="blr">
<td>10</td>
<td>100000</td>
<td>3444</td>
<td class="r">3411</td>
<td class="r">36223</td>
<td class="r">15</td>
</tr>
</table>
</div>
EOT
# Today Yesterday
# Qnty Size Length Length Size Qnty
# 3 455 34 3454 5656 3
# 1 1300 3664 3545 1000 10
# 10 100000 3444 3411 36223 15
require 'nokogiri'
doc = Nokogiri::HTML(html)
Use CSS to find the start of the table, and define some places to hold the data we're capturing:
table = doc.at('div#__DailyStat__ table')
today_data = []
yesterday_data = []
Loop over the rows in the table, rejecting the headers:
table.search('tr').each do |tr|
next if (tr['class'] == 'blh')
Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:
today_td_data = [ 'Today' ]
yesterday_td_data = [ 'Yesterday' ]
tr.search('td').each do |td|
if (td['class'] == 'r')
yesterday_td_data << td.text.to_i
else
today_td_data << td.text.to_i
end
end
today_data << today_td_data
yesterday_data << yesterday_td_data
end
And output the data:
puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }
> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15
Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:
[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]
Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:
tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }
today_td_data = [ 'Today', *tr_data[0, 3] ]
yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]
In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.
And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.

Resources