Scraping Table with Nokogiri and need JSON output - ruby

So, I have a table with multiple rows and columns.
<table>
<tr>
<th>Employee Name</th>
<th>Reg Hours</th>
<th>OT Hours</th>
</tr>
<tr>
<td>Employee 1</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>Employee 2</td>
<td>5</td>
<td>10</td>
</tr>
</table>
There is also another table:
<table>
<tr>
<th>Employee Name</th>
<th>Revenue</th>
</tr>
<td>Employee 2</td>
<td>$10</td>
</tr>
<tr>
<td>Employee 1</td>
<td>$50</td>
</tr>
</table>
Notice that the employee order may be random between the tables.
How can I use nokogiri to create a json file that has each employee as an object, with their total hours and revenue?
Currently, I'm able to just get the individual table cells with some xpath. For example:
puts page.xpath(".//*[#id='UC255_tblSummary']/tbody/tr[2]/td[1]/text()").inner_text
Edit:
Using the page-object gem and the link from #Dave_McNulla, I tried this piece of code just to see what I get:
class MyPage
include PageObject
table(:report, :id => 'UC255_tblSummary')
def get_some_information
report_element[1][2].text
end
end
puts get_some_information
Nothing's being returned, however.
Data: https://gist.github.com/anonymous/d8cc0524160d7d03d37b
There's a duplicate of the hours table. The first one is fine. The other table needed is the accessory revenue table. (I'll also need the activations table, but I'll try to merge that from the code that merges the hours and accessory revenue tables.

I think the general approach is:
Create a hash for each table where the key is the employee
Merge the results from both tables together
Convert to JSON
Create a hash for each table where the key is the employee
This part you can do in Watir or Nokogiri. It only makes sense to use Nokogiri if Watir is giving poor performance due large tables.
Watir:
#I assume you would have a better way to identify the tables than by index
hours_table = browser.table(:index, 0)
wage_table = browser.table(:index, 1)
#Turn the tables into a hash
employee_hours = {}
hours_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.trs.drop(1).each do |tr|
tds = tr.tds
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Nokogiri:
page = Nokogiri::HTML.parse(browser.html)
hours_table = page.search('table')[0]
wage_table = page.search('table')[1]
employee_hours = {}
hours_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}
end
#=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
employee_wage = {}
wage_table.search('tr').drop(1).each do |tr|
tds = tr.search('td')
employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}
end
#=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
Merge the results from both tables together
You want to merge the two hashes together so that for a specific employee, the hash will include their hours as well as their revenue.
employee = employee_hours.merge(employee_wage){ |key, old, new| new.merge(old) }
#=> {"Employee 1"=>{"Revenue"=>"$50", "Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Revenue"=>"$10", "Reg Hours"=>"5", "OT Hours"=>"10"}}
Convert to JSON
Based on this previous question, you can then convert the hash to json.
require 'json'
employee.to_json

Related

Calculate and display the balances from bottom to top in ruby on rails

In order to display the records in descending order, I used "created_at DESC" and it worked for all the entries of the table, that is for Date column, Particulars column, Debit and Credit columns except for the Balance and it is still calculated and displayed from top to bottom. But I want to calculate and display from bottom to top. This can be seen in the below image.
expenses_controller.rb
class ExpensesController < ApplicationController
def index
#expenses = Expense.order("created_at DESC")
end
For better understanding, find the below image of the Bank statement, as I need to achieve the same.
index.html.erb
<% balance = 0 %>
<div class="container">
<table style="width:100%">
<thead>
<tr>
<th>Date</th>
<th>Particulars</th>
<th>Debit</th>
<th>Credit</th>
<th>Balance</th>
</tr>
</thead>
<tbody>
<% #expenses.each do |expense| %>
<tr>
<td><%= expense.date.strftime('%d/%m/%Y') %></td>
<td><%= expense.particulars %></td>
<td class="pos"><%= expense.debit %></td>
<td class="neg"><%= expense.credit %></td>
<% balance += expense.debit.to_f-expense.credit.to_f %>
<% color = balance >= 0 ? "pos" : "neg" %>
<td class="<%= color %>"><%= number_with_precision(balance.abs, :delimiter => ",", :precision => 0) %></td>
</tr>
<% end %>
</tbody>
</table>
</div>
Any suggestions are most welcome.
Thank you in advance.
I still don't understand how balance column [2500, 1500, 2000] is calculated, but I could argue something from the screenshot.
Basically you are sorting by a column not existing in the model. So, first you need to build that helper column, populate it, then sort by that column.
It should be possible to do it in SQL, but I'm showing in plain Ruby using a Hash as fake database. You can adapt it to your case easily or look for a most efficient way (SQL).
Let's say data are the following:
expenses = [{date: 1, narration: :a, debit: 3.0, credit: 0},
{date: 2, narration: :b, debit: 0.15, credit: 0},
{date: 3, narration: :c, debit: 75.0, credit: 0}]
And the initial balance is:
balance = 1434.64
Now lets loop the data adding the new field balance and sorting at the end of the loop:
expenses.each do |h|
balance += h[:credit] - h[:debit]
h[:balance] = balance
end.sort!{ |h| h[:balance]}
Now your sorted expenses are:
[
{:date=>3, :narration=>:a, :debit=>75.0, :credit=>0, :balance=>1356.49}
{:date=>2, :narration=>:b, :debit=>0.15, :credit=>0, :balance=>1431.49}
{:date=>1, :narration=>:c, :debit=>3.0, :credit=>0, :balance=>1431.64}
]
You can do calculation in the controller, then pass expenses to the view and loop without any need of calculation there.
For your rails app, you could implement as follow.
Add the temporary field balance to your model (no need to add a column to the database) and initialize to value 0:
class Expense < ApplicationRecord
attr_accessor :balance
after_initialize :init
def init
self.balance = 0
end
end
Do the calculation in controller, I'm using an initial value of balance, just to emulate the example:
def index
#expenses = Expense.all
balance = 1434.64
#expenses.each do |e|
balance += e.credit - e.debit
e.balance = balance
end
#expenses = #expenses.sort{ |e| e.balance }
end
Then in your view, just loop:
<% #expenses.each do |expense| %>
<tr>
<td><%= expense.narration %></td>
<td><%= expense.debit %></td>
<td><%= expense.credit %></td>
<td><%= expense.balance %></td>
</tr>
<% end %>
If you insert the records as in your example, you should end up with this result:
# ["c", "0.0", "75.0", "1356.49"]
# ["b", "0.0", "0.15", "1431.49"]
# ["a", "0.0", "3.0", "1431.64"]
If you need to order by creation date first and by balance second, you could use
#expenses = Expense.order('created_at DESC, balance DESC')
Since you told it to order by Expense.order('created_at desc'), then that's what it's doing. If you want to order by balance, you must instead say Expense.order('balance desc')

Ruby code to display table element details

I have a HTML which displays the Product Details in the following way:
<div class="column">
<h3 class="hidden-xs">Product Details</h3>
<table class="table table-striped">
<tbody>
<tr class="header-row hidden-xs">
<th>Product</th>
<th>Duration</th>
<th>Unit Price</th>
<th>Price</th>
</tr>
<tr>
<td>google.com</td>
<td>1 Year</td>
<td class="hidden-xs">$5</td>
<td>$5</td>
</tr>
</tbody>
</table>
<div class="totals text-right">
<p>Subtotal: $5</p>
<p>Total: $5</p>
</div>
</div>
Ruby code is given below:
require 'watir'
browser = Watir::Browser.new(:chrome)
browser.goto('file:///C:/Users/Ashwin/Desktop/text.html')
browser.table(:class, 'table table-striped').trs.each do |tr|
p tr[0].text
p tr[1].text
p tr[2].text
p tr[3].text
end
I am getting the output this way:
"Product"
"Duration"
"Unit Price"
"Price"
"google.com"
"1 Year"
"$5"
"$5"
But I want the details to be displayed as below:
Product : google.com
Duration : 1 Year
Unit Price : $5
Price : $5
Can anyone please help?
The table looks quite simple, so you can use the Table#strings method to convert the table into an array of strings. Then you can output each column header with each row value.
# Get the table you want to work with
table = browser.table(class: 'table table-striped')
# Get the text of the table
rows = table.strings
# Get the column headers and determine the longest one
header = rows.shift
column_width = header.max { |a, b| a.length <=> b.length }.length
# Loop through the data rows and output the header/value
rows.each do |row|
header.zip(row) do |header, value|
puts "#{header.ljust(column_width)} : #{value}"
end
end
#=> Product : google.com
#=> Duration : 1 Year
#=> Unit Price : $5
#=> Price : $5
This code is only for the given table with two rows
require 'watir'
browser = Watir::Browser.new(:chrome)
browser.goto('file:///C:/Users/Ashwin/Desktop/text.html')
browser.table(:class, 'table table-striped').rows.each_with_index do |row,index|
if index.eql?0
firstRow=row
next
end
p firstRow[0].text+":"+row[0].text
p firstRow[1].text+":"+row[1].text
p firstRow[2].text+":"+row[2].text
p firstRow[3].text+":"+row[3].text
end

Display Nokogiri children nodes as raw HTML instead of >tag<

I am changing an XML table into an HTML table, and have to do some rearranging of nodes.
To accomplish the transformation, I scrape the XML, put it into a two-dimensional array, and then build the new HTML to output.
But some of the cells have HTML tags in them, and after my conversion <su> becomes >su<.
The XML data is:
<BOXHD>
<CHED H="1">Disc diameter, inches (cm)</CHED>
<CHED H="1">One-half or more of disc covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot <SU>3</SU>/min.</CHED>
<CHED H="1">Disc not covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot<SU>3</SU>/min.</CHED>
</BOXHD>
The steps I'm taking to convert this to an HTML table are:
class TableCell
attr_accessor :text, :rowspan, :colspan
def initialize(text='')
#text = text
#rowspan = 1
#colspan = 1
end
end
#frag = Nokogiri::HTML(xml)
# make a 2d array to store how the cells should be arranged
column = 0
prev_row = -1
#frag.xpath("boxhd/ched").each do |ched|
row = ched.xpath("#h").first.value.to_i - 1
if row <= prev_row
column +=1
end
prev_row = row
#data[row][column] = TableCell.new(ched.inner_html)
end
# methods to find colspan and rowspan, put them in #data
# ... snip ...
# now build an html table
doc = Nokogiri::HTML::DocumentFragment.parse ""
Nokogiri::HTML::Builder.with(doc) do |html|
html.table {
#data.each do |tr|
html.tr {
tr.each do |th|
next if th.nil?
html.th(:rowspan => th.rowspan, :colspan => th.colspan).table_header th.text
end
}
end
}
end
This gives the following HTML (notice the superscripts are escaped):
<table>
<tr>
<th rowspan="2" colspan="1" class="table_header">Disc diameter, inches (cm)</th>
<th rowspan="1" colspan="2" class="table_header">One-half or more of disc covered</th>
<th rowspan="1" colspan="2" class="table_header">Disc not covered</th>
</tr>
<tr>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su> </th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot <su>3</su>/min.</th>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su></th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot<su>3</su>/min.</th>
</tr>
</table>
How do I get the raw HTML instead of the entities?
I've tried these with no success
#data[row][column] = TableCell.new(ched.children)
#data[row][column] = TableCell.new(ched.children.to_s)
#data[row][column] = TableCell.new(ched.to_s)
This might help you understand what's happening:
require 'nokogiri'
doc = Nokogiri::XML('<root><foo></foo></root>')
doc.at('foo').content = '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo><html><body>bar</body></html></foo>\n</root>\n"
doc.at('foo').children = '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo>\n <html>\n <body>bar</body>\n </html>\n </foo>\n</root>\n"
doc.at('foo').children = Nokogiri::XML::Document.new.create_cdata '<html><body>bar</body></html>'
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo><![CDATA[<html><body>bar</body></html>]]></foo>\n</root>\n"
I abandoned the builder, and simply built the HTML:
headers = html_headers()
def html_headers()
rows = Array.new
#data.each do |row|
cells = Array.new
row.each do |cell|
next if cell.nil?
cells << "<th rowspan=\"%d\" colspan=\"%d\">%s</th>" %
[cell.rowspan,
cell.colspan,
cell.text]
end
rows << "<tr>%s</tr>" % cells.join
end
rows.join
end
def replace_nodes(headers)
# ... snip ...
#frag.xpath("boxhd").each do |old|
puts "replacing boxhd..."
old.replace headers
end
# ... snip ...
end
I don't understand why, but it appears that the text I replaced the <BOXHD> tags with are parsed and searchable, as I was able to change tag names from data in cell.text.

Scraping multiple table row siblings with Nokogiri

I’m trying to parse a table with the following markup.
<table>
<tr class="athlete">
<td colspan="2" class="name">Alex</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="run">
<td>5.20</td>
<td>10.50</td>
</tr>
<tr class="end"></tr>
<tr class="athlete">
<td colspan="2" class="name">John</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="end"></tr>
</table>
I need to loop through each .athlete table row and get each sibling .run table row underneath until I reach the .end row. Then repeat for the next athlete and so on. Some .athlete rows have two .run rows, others have one.
Here’s what I have so far. I loop through the athletes:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://myurl.com"
doc = Nokogiri::HTML(open(url))
doc.css(".athlete").each do |athlete|
puts athlete.at_css("name").text
# Loop through the sibling .run rows until I reach the .end row
# output the value of the td’s in the .run row
end
I can’t figure out how to get each sibling .run row, and stop at the .end row. I feel like it would be easier if the table was better formed, but unfortunately I don’t have control of the markup. Any help would be greatly appreciated!
Voilà
require 'nokogiri'
doc = <<DOC
<table>
<tr class="athlete">
<td colspan="2" class="name">Alex</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="run">
<td>5.20</td>
<td>10.50</td>
</tr>
<tr class="end"></tr>
<tr class="athlete">
<td colspan="2" class="name">John</td>
</tr>
<tr class="run">
<td>5.00</td>
<td>10.00</td>
</tr>
<tr class="end"></tr>
</table>
DOC
doc = Nokogiri::HTML(doc)
# You can exclude .end, if it is always empty? and not required
trs = doc.css('.athlete, .run, .end').to_a
# This will return [['athlete', 'run', ...,'end'], ['athlete', 'run', ...,'end'] ...]
athletes = trs.slice_before{ |elm| elm.attr('class') =='athlete' }.to_a
athletes.map! do |athlete|
{
name: athlete.shift.at_css('.name').text,
runs: athlete
.select{ |tr| tr.attr('class') == 'run' }
.map{|run| run.text.to_f }
}
end
puts athletes.inspect
#[{:name=>"Alex", :runs=>[5.0, 5.2]}, {:name=>"John", :runs=>[5.0]}]
I would process the table as followed:
Locate the table you want to process
table = doc.at_css("table")
Get all the immediate rows in the table
rows = table.css("> tr")
Group the rows with boundary .athlete and .end
grouped = [[]]
rows.each do |row|
if row['class'] == 'athlete' and grouped.last.empty?
grouped.last << row
elsif row['class'] == 'end' and not grouped.last.empty?
grouped.last << row
grouped << []
elsif not grouped.last.empty?
grouped.last << row
end
end
grouped.pop if grouped.last.empty? || grouped.last.last['class'] != 'end'
Process the grouped rows
grouped.each do |group|
puts "BEGIN: >> #{group.first.text} <<"
group[1..-2].each do |row|
puts " #{row.text.squeeze}"
end
puts "END: >> #{group.last.text} <<"
end

watir-webdriver iterating table for field comparison

How can I iterate this table's rows and compare the value of each first column,
i.e 'Wallet1' == 'Wallet2' ?
And when the field has the same value, compare the 4th column value?
<table class="simple">
<tbody><tr>
<th class="labelCenter">Saldo</th>
(...)
</tr>
<tr class="odd">
<td class="labelCenter">Wallet 1</td>
<td class="labelCenter">Decresing</td>
<td class="labelCenter">16/02/2012 19:06:01</td>
<td class="labelCenter">19/02/2012 14:03:01</td>
<td class="labelCenter">
</td>
<td class="labelRight">
78,90
</td>
<td class="labelRight">
0,00
</td>
<td class="labelCenter">Value</td>
</tr>
<tr class="even">
<td class="labelCenter">Wallet 2</td>
<td class="labelCenter">Increasing</td>
<td class="labelCenter">16/02/2012 19:06:01</td>
<td class="labelCenter">19/02/2012 11:09:01</td>
<td class="labelCenter">
</td>
<td class="labelRight">
0,00
</td>
<td class="labelRight">
0,00
</td>
<td class="labelCenter">Value</td>
</tr>
</tbody></table>
My first approach used variations of,
$browser.table(:class, 'simple').rows[1..-1].each do |row|
but I'm facing a roadblock.
Also, why does not this work?
$browser.tr(:class => /odd|even/) do |row|
puts row(:index => 0).text
This could probably be made faster by sorting the collection first or something, or maybe extracting all the values into arrays and using some array functions to find the matching rows. I'm sure there is room to make this more elegant
start with the first row, compare to every row below it,
move to next row, repeat
rows = $browser.table(:class, 'simple').rows
last = rows.length -1
last.times do |current|
remaining = last - current
remaining.times do |j|
if rows[current].cell.text == rows[j+1].cell.text
if rows[current].cell(:index => 3).text == rows[j+1].cell(:index => 3).text
#do something
end
end
end
end
$browser.table.trs(:class => /odd|even/).each do |tr|
puts tr.td(:index => 0).text
end

Resources