Scraping multiple table row siblings with Nokogiri

Scraping multiple table row siblings with Nokogiri - ruby

I’m trying to parse a table with the following markup.
<table>
<tr class="athlete">
<td colspan="2" class="name">Alex</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="run">
<td>5.20</td>
<td>10.50</td>
</tr>
<tr class="end"></tr>
<tr class="athlete">
<td colspan="2" class="name">John</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="end"></tr>
</table>
I need to loop through each .athlete table row and get each sibling .run table row underneath until I reach the .end row. Then repeat for the next athlete and so on. Some .athlete rows have two .run rows, others have one.
Here’s what I have so far. I loop through the athletes:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://myurl.com"
doc = Nokogiri::HTML(open(url))
doc.css(".athlete").each do |athlete|
puts athlete.at_css("name").text
# Loop through the sibling .run rows until I reach the .end row
# output the value of the td’s in the .run row
end
I can’t figure out how to get each sibling .run row, and stop at the .end row. I feel like it would be easier if the table was better formed, but unfortunately I don’t have control of the markup. Any help would be greatly appreciated!

Voilà
require 'nokogiri'
doc = <<DOC
<table>
<tr class="athlete">
<td colspan="2" class="name">Alex</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="run">
<td>5.20</td>
<td>10.50</td>
</tr>
<tr class="end"></tr>
<tr class="athlete">
<td colspan="2" class="name">John</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="end"></tr>
</table>
DOC
doc = Nokogiri::HTML(doc)
# You can exclude .end, if it is always empty? and not required
trs = doc.css('.athlete, .run, .end').to_a
# This will return [['athlete', 'run', ...,'end'], ['athlete', 'run', ...,'end'] ...]
athletes = trs.slice_before{ |elm| elm.attr('class') =='athlete' }.to_a
athletes.map! do |athlete|
{
name: athlete.shift.at_css('.name').text,
runs: athlete
.select{ |tr| tr.attr('class') == 'run' }
.map{|run| run.text.to_f }
}
end
puts athletes.inspect
#[{:name=>"Alex", :runs=>[5.0, 5.2]}, {:name=>"John", :runs=>[5.0]}]

I would process the table as followed:
Locate the table you want to process
table = doc.at_css("table")
Get all the immediate rows in the table
rows = table.css("> tr")
Group the rows with boundary .athlete and .end
grouped = [[]]
rows.each do |row|
if row['class'] == 'athlete' and grouped.last.empty?
grouped.last << row
elsif row['class'] == 'end' and not grouped.last.empty?
grouped.last << row
grouped << []
elsif not grouped.last.empty?
grouped.last << row
end
end
grouped.pop if grouped.last.empty? || grouped.last.last['class'] != 'end'
Process the grouped rows
grouped.each do |group|
puts "BEGIN: >> #{group.first.text} <<"
group[1..-2].each do |row|
puts " #{row.text.squeeze}"
end
puts "END: >> #{group.last.text} <<"
end

Related

Ruby code to display table element details

I have a HTML which displays the Product Details in the following way:
<div class="column">
<h3 class="hidden-xs">Product Details</h3>
<table class="table table-striped">
<tbody>
<tr class="header-row hidden-xs">
<th>Product</th>
<th>Duration</th>
<th>Unit Price</th>
<th>Price</th>
</tr>
<tr>
<td>google.com</td>
<td>1 Year</td>
<td class="hidden-xs">$5</td>
<td>$5</td>
</tr>
</tbody>
</table>
<div class="totals text-right">
<p>Subtotal: $5</p>
<p>Total: $5</p>
</div>
</div>
Ruby code is given below:
require 'watir'
browser = Watir::Browser.new(:chrome)
browser.goto('file:///C:/Users/Ashwin/Desktop/text.html')
browser.table(:class, 'table table-striped').trs.each do |tr|
p tr[0].text
p tr[1].text
p tr[2].text
p tr[3].text
end
I am getting the output this way:
"Product"
"Duration"
"Unit Price"
"Price"
"google.com"
"1 Year"
"$5"
"$5"
But I want the details to be displayed as below:
Product : google.com
Duration : 1 Year
Unit Price : $5
Price : $5
Can anyone please help?

The table looks quite simple, so you can use the Table#strings method to convert the table into an array of strings. Then you can output each column header with each row value.
# Get the table you want to work with
table = browser.table(class: 'table table-striped')
# Get the text of the table
rows = table.strings
# Get the column headers and determine the longest one
header = rows.shift
column_width = header.max { |a, b| a.length <=> b.length }.length
# Loop through the data rows and output the header/value
rows.each do |row|
header.zip(row) do |header, value|
puts "#{header.ljust(column_width)} : #{value}"
end
end
#=> Product : google.com
#=> Duration : 1 Year
#=> Unit Price : $5
#=> Price : $5

This code is only for the given table with two rows
require 'watir'
browser = Watir::Browser.new(:chrome)
browser.goto('file:///C:/Users/Ashwin/Desktop/text.html')
browser.table(:class, 'table table-striped').rows.each_with_index do |row,index|
if index.eql?0
firstRow=row
next
end
p firstRow[0].text+":"+row[0].text
p firstRow[1].text+":"+row[1].text
p firstRow[2].text+":"+row[2].text
p firstRow[3].text+":"+row[3].text
end

Display Nokogiri children nodes as raw HTML instead of >tag<

I am changing an XML table into an HTML table, and have to do some rearranging of nodes.
To accomplish the transformation, I scrape the XML, put it into a two-dimensional array, and then build the new HTML to output.
But some of the cells have HTML tags in them, and after my conversion <su> becomes >su<.
The XML data is:
<BOXHD>
<CHED H="1">Disc diameter, inches (cm)</CHED>
<CHED H="1">One-half or more of disc covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot <SU>3</SU>/min.</CHED>
<CHED H="1">Disc not covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot<SU>3</SU>/min.</CHED>
</BOXHD>
The steps I'm taking to convert this to an HTML table are:
class TableCell
attr_accessor :text, :rowspan, :colspan
def initialize(text='')
#text = text
#rowspan = 1
#colspan = 1
end
end
#frag = Nokogiri::HTML(xml)
# make a 2d array to store how the cells should be arranged
column = 0
prev_row = -1
#frag.xpath("boxhd/ched").each do |ched|
row = ched.xpath("#h").first.value.to_i - 1
if row <= prev_row
column +=1
end
prev_row = row
#data[row][column] = TableCell.new(ched.inner_html)
end
# methods to find colspan and rowspan, put them in #data
# ... snip ...
# now build an html table
doc = Nokogiri::HTML::DocumentFragment.parse ""
Nokogiri::HTML::Builder.with(doc) do |html|
html.table {
#data.each do |tr|
html.tr {
tr.each do |th|
next if th.nil?
html.th(:rowspan => th.rowspan, :colspan => th.colspan).table_header th.text
end
}
end
}
end
This gives the following HTML (notice the superscripts are escaped):
<table>
<tr>
<th rowspan="2" colspan="1" class="table_header">Disc diameter, inches (cm)</th>
<th rowspan="1" colspan="2" class="table_header">One-half or more of disc covered</th>
<th rowspan="1" colspan="2" class="table_header">Disc not covered</th>
</tr>
<tr>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su> </th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot <su>3</su>/min.</th>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su></th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot<su>3</su>/min.</th>
</tr>
</table>
How do I get the raw HTML instead of the entities?
I've tried these with no success
#data[row][column] = TableCell.new(ched.children)
#data[row][column] = TableCell.new(ched.children.to_s)
#data[row][column] = TableCell.new(ched.to_s)

This might help you understand what's happening:
require 'nokogiri'
doc = Nokogiri::XML('<root><foo></foo></root>')
doc.at('foo').content = '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo><html><body>bar</body></html></foo>\n</root>\n"
doc.at('foo').children = '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo>\n <html>\n <body>bar</body>\n </html>\n </foo>\n</root>\n"
doc.at('foo').children = Nokogiri::XML::Document.new.create_cdata '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo><![CDATA[<html><body>bar</body></html>]]></foo>\n</root>\n"

I abandoned the builder, and simply built the HTML:
headers = html_headers()
def html_headers()
rows = Array.new
#data.each do |row|
cells = Array.new
row.each do |cell|
next if cell.nil?
cells << "<th rowspan=\"%d\" colspan=\"%d\">%s</th>" %
[cell.rowspan,
cell.colspan,
cell.text]
end
rows << "<tr>%s</tr>" % cells.join
end
rows.join
end
def replace_nodes(headers)
# ... snip ...
#frag.xpath("boxhd").each do |old|
puts "replacing boxhd..."
old.replace headers
end
# ... snip ...
end
I don't understand why, but it appears that the text I replaced the <BOXHD> tags with are parsed and searchable, as I was able to change tag names from data in cell.text.

Parsing bank statement and returning values from 2d array

I'm trying to parse my online bank statement, retrieve the values, and then get the individual values. Here's a sample statement. otherrefcode stands for the money I sent, and refcode stands for the money I received.
Date Description Type [?] In (£) Out (£) Balance (£)
29 Aug 13 person1 otherrefcode 29AUG13 18:23 FPO 42.81 662.68
29 Aug 13 person2 otherrefcode 29AUG13 18:21 FPO 599.91 705.49
29 Aug 13 person3 refcode TFR 30.80 1,305.40
28 Aug 13 person4 otherrefcode 28AUG13 14:23 FPO 25.27 1,336.20
28 Aug 13 person5 refcode TFR 41.08 1,361.47
And here's my ruby code. How do I grab the individual values?
require 'watir-webdriver'
require 'nokogiri'
def toprice(data)
data.to_s.match(/\d\d\.\d\d/).to_s
end
$browser = Watir::Browser.new :firefox
$browser.goto("bankurl")
$page_html = Nokogiri::HTML.parse($browser.html)
table_array = Array.new
table = $browser.table(:class,'statement smartRewardsOffers')
table.rows.each do |row|
row_array = Array.new
row.cells.each do |cell|
row_array << cell.text
end
table_array << row_array
end
puts "1strun"
puts table_array[1..4][1]
puts "2ndrun"
puts table_array[1][1..4]
That outputs
1strun
person1 otherrefcode 29AUG13 18:23
FPO
42.81
2ndrun
29 Aug 13
person2 otherrefcode 29AUG13 18:21
FPO
599.91
705.49
The HTML of the statement (well, the first 3 transactions - warning, 76 lines long.)
<table id="pnlgrpStatement:conS1:tblTransactionListView" class="statement smartRewardsOffers" summary="Table displaying the statement for your account Classic xxxxxxxxx xxxxxxxxx">
<thead>
<tr>
<th class="{sorter:false} first" scope="col">
<form id="pnlgrpStatement:conS1:tblTransactionListView:frmToggle" class="validationName:(pnlgrpStatement:conS1:tblTransactionListView:frmToggle) validate:()" enctype="application/x-www-form-urlencoded" autocomplete="off" action="/personal/a/viewproductdetails/ViewProductDetails.jsp" method="post" name="pnlgrpStatement:conS1:tblTransactionListView:frmToggle">
<input id="pnlgrpStatement:conS1:tblTransactionListView:frmToggle:btnASCSortStatements" class="tableSorter tableSorterReverse" type="image" title="Sort by oldest first" alt="Sort by oldest first" src="/wps/wcm/connect/xxxxxxxxxxxx/sort_arrow_up-8-1375113571.png?MOD=AJPERES&CACHEID=xxxxxxxxxxx" name="pnlgrpStatement:conS1:tblTransactionListView:frmToggle:btnASCSortStatements">
Date
<input type="hidden" value="pnlgrpStatement:conS1:tblTransactionListView:frmToggle" name="pnlgrpStatement:conS1:tblTransactionListView:frmToggle">
<input type="hidden" value="xxxxxxx" name="submitToken">
<input type="hidden" name="hasJS" value="true">
</form>
</th>
<th class="{sorter:false} description" scope="col">Description</th>
<th class="{sorter:false} transactionType" scope="col">
Type
<span class="cxtHelp">
<a class="cxtTrigger" href="#transForView" title="Click to find out more about transaction types">[?]</a>
</span>
</th>
<th class="{sorter:false} numeric" scope="col">In (£)</th>
<th class="{sorter:false} numeric" scope="col">Out (£)</th>
<th class="{sorter:false} numeric" scope="col">Balance (£)</th>
</tr>
</thead>
<tbody>
<tr class="alt">
<th class="first">29 Aug 13</th>
<td>
<span class="splitString">person1</span>
<span class="splitString"> </span>
<span class="splitString">ref</span>
<span class="splitString"> </span>
<span class="splitString">29AUG13 18:23</span>
<span class="splitString"> </span>
</td>
<td>
<abbr title="Faster Payments Outgoing">FPO</abbr>
</td>
<td class="numeric"></td>
<td class="numeric">42.81</td>
<td class="numeric">662.68</td>
</tr>
<tr>
<th class="first">29 Aug 13</th>
<td>
<span class="splitString">person2</span>
<span class="splitString"> </span>
<span class="splitString">ref</span>
<span class="splitString"> </span>
<span class="splitString">29AUG13 18:21</span>
<span class="splitString"> </span>
</td>
<td>
<abbr title="Faster Payments Outgoing">FPO</abbr>
</td>
<td class="numeric"></td>
<td class="numeric">599.91</td>
<td class="numeric">705.49</td>
</tr>
<tr class="alt">
<th class="first">29 Aug 13</th>
<td>
<span class="splitString">person3</span>
<span class="splitString"> </span>
<span class="splitString">ref>
</td>
<td>
<abbr title="Transfer">TFR</abbr>
</td>
<td class="numeric"></td>
<td class="numeric">30.80</td>
<td class="numeric">1,305.40</td>
</tr>
</tbody>
</table>

You have already gotten the text of each cell into the table_array. You just need to get the right cell. It is a 2D array, so the first index is the row and the second index is the column. Note that the array is 0-based index (ie 0 represents the first row/column).
# type in the first row
puts table_array[1][2]
#=> "FPO"
# person in the first row
puts table_array[1][1].split[0]
#=> "person2"
# out value in the second row
puts table_array[2][4]
#=> "599.91"
Working with these indicies is not so nice. As well, the splitting of the description column is harder at this point. Instead, I would suggest creating a hash for each row.
table_array = Array.new
table_rows = $browser.table(:class,'statement smartRewardsOffers')
table_rows.rows.to_a[1..-1].each do |row|
row_hash = Hash.new
row_hash[:date] = row.cell(:index => 0).text
row_hash[:person] = row.cell(:index => 1).span(:index => 0).text
row_hash[:code] = row.cell(:index => 1).span(:index => 2).text rescue ''
row_hash[:time] = row.cell(:index => 1).span(:index => 4).text rescue ''
row_hash[:type] = row.cell(:index => 2).text
row_hash[:in] = row.cell(:index => 3).text
row_hash[:out] = row.cell(:index => 4).text
row_hash[:balance] = row.cell(:index => 5).text
table_array << row_hash
end
# First data row's information
row = 0 # Note that the rows are 0-based index
puts table_array[row][:date] #=> "29 Aug 13"
puts table_array[row][:person] #=> "person1"
puts table_array[row][:code] #=> "ref"
puts table_array[row][:time] #=> "29AUG13 18:23"
puts table_array[row][:type] #=> "FPO"
puts table_array[row][:in] #=> ""
puts table_array[row][:out] #=> "42.81"
puts table_array[row][:balance] #=> "662.68"

Scraping Table with Nokogiri and need JSON output

So, I have a table with multiple rows and columns.
<table>
<tr>
<th>Employee Name</th>
<th>Reg Hours</th>
<th>OT Hours</th>
</tr>
<tr>
<td>Employee 1</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>Employee 2</td>
<td>5</td>
<td>10</td>
</tr>
</table>
There is also another table:
<table>
<tr>
<th>Employee Name</th>
<th>Revenue</th>
</tr>
<td>Employee 2</td>
<td>$10</td>
</tr>
<tr>
<td>Employee 1</td>
<td>$50</td>
</tr>
</table>
Notice that the employee order may be random between the tables.
How can I use nokogiri to create a json file that has each employee as an object, with their total hours and revenue?
Currently, I'm able to just get the individual table cells with some xpath. For example:
puts page.xpath(".//*[#id='UC255_tblSummary']/tbody/tr[2]/td[1]/text()").inner_text
Edit:
Using the page-object gem and the link from #Dave_McNulla, I tried this piece of code just to see what I get:
class MyPage
include PageObject
table(:report, :id => 'UC255_tblSummary')
def get_some_information
report_element[1][2].text
end
end
puts get_some_information
Nothing's being returned, however.
Data: https://gist.github.com/anonymous/d8cc0524160d7d03d37b
There's a duplicate of the hours table. The first one is fine. The other table needed is the accessory revenue table. (I'll also need the activations table, but I'll try to merge that from the code that merges the hours and accessory revenue tables.

I think the general approach is:
Create a hash for each table where the key is the employee
Merge the results from both tables together
Convert to JSON
Create a hash for each table where the key is the employee
This part you can do in Watir or Nokogiri. It only makes sense to use Nokogiri if Watir is giving poor performance due large tables.
Watir:
#I assume you would have a better way to identify the tables than by index
hours_table = browser.table(:index, 0)
wage_table = browser.table(:index, 1)
#Turn the tables into a hash
employee_hours = {}
hours_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Nokogiri:
page = Nokogiri::HTML.parse(browser.html)
hours_table = page.search('table')[0]
wage_table = page.search('table')[1]
employee_hours = {}
hours_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Merge the results from both tables together
You want to merge the two hashes together so that for a specific employee, the hash will include their hours as well as their revenue.
employee = employee_hours.merge(employee_wage){ |key, old, new| new.merge(old) }
#=> {"Employee 1"=>{"Revenue"=>"$50", "Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Revenue"=>"$10", "Reg Hours"=>"5", "OT Hours"=>"10"}}
Convert to JSON
Based on this previous question, you can then convert the hash to json.
require 'json'
employee.to_json

watir-webdriver iterating table for field comparison

How can I iterate this table's rows and compare the value of each first column,
i.e 'Wallet1' == 'Wallet2' ?
And when the field has the same value, compare the 4th column value?
<table class="simple">
<tbody><tr>
<th class="labelCenter">Saldo</th>
(...)
</tr>
<tr class="odd">
<td class="labelCenter">Wallet 1</td>
<td class="labelCenter">Decresing</td>
<td class="labelCenter">16/02/2012 19:06:01</td>
<td class="labelCenter">19/02/2012 14:03:01</td>
<td class="labelCenter">
</td>
<td class="labelRight">
78,90
</td>
<td class="labelRight">
0,00
</td>
<td class="labelCenter">Value</td>
</tr>
<tr class="even">
<td class="labelCenter">Wallet 2</td>
<td class="labelCenter">Increasing</td>
<td class="labelCenter">16/02/2012 19:06:01</td>
<td class="labelCenter">19/02/2012 11:09:01</td>
<td class="labelCenter">
</td>
<td class="labelRight">
0,00
</td>
<td class="labelRight">
0,00
</td>
<td class="labelCenter">Value</td>
</tr>
</tbody></table>
My first approach used variations of,
$browser.table(:class, 'simple').rows[1..-1].each do |row|
but I'm facing a roadblock.
Also, why does not this work?
$browser.tr(:class => /odd|even/) do |row|
puts row(:index => 0).text

This could probably be made faster by sorting the collection first or something, or maybe extracting all the values into arrays and using some array functions to find the matching rows. I'm sure there is room to make this more elegant
start with the first row, compare to every row below it,
move to next row, repeat
rows = $browser.table(:class, 'simple').rows
last = rows.length -1
last.times do |current|
remaining = last - current
remaining.times do |j|
if rows[current].cell.text == rows[j+1].cell.text
if rows[current].cell(:index => 3).text == rows[j+1].cell(:index => 3).text
#do something
end
end
end
end

$browser.table.trs(:class => /odd|even/).each do |tr|
puts tr.td(:index => 0).text
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scraping multiple table row siblings with Nokogiri - ruby

Related

Ruby code to display table element details

Display Nokogiri children nodes as raw HTML instead of >tag<

Parsing bank statement and returning values from 2d array

Scraping Table with Nokogiri and need JSON output

watir-webdriver iterating table for field comparison

Categories

Resources