How do I parse HTML using Nokogiri?

How do I parse HTML using Nokogiri? - ruby

I'd like to parse an HTML file extracting relevant data to use in my research. Here's a piece of the HTML:
<td class="color_line1" valign="center"><a class="linkpadrao" href="javascript:Direciona('5453*SERRA#TALHADA');">Serra Talhada</a></td>
<td class="color_line" valign="center" align="center">9</td>
<td class="color_line" valign="center" align="center">2,973</td>
<td class="color_line" valign="center" align="center">0,016</td>
<td class="color_line" valign="center" align="center">2,939</td>
<td class="color_line" valign="center" align="center">3,000</td>
<td class="color_line" valign="center" align="center">0,572</td>
<td class="color_line" valign="center" align="center">2,401</td>
<td class="color_line" valign="center" align="center">0,024</td>
<td class="color_line" valign="center" align="center">2,378</td>
<td class="color_line" valign="center" align="center">2,426</td>
</tr>
Being more specific, I'd like to get the "Serra Talhada" (as a city name), and also all of the numbers below the city name (it's the max, min and average price of gas).
I tried this so far:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
url = "http://www.anp.gov.br/preco/prc/Resumo_Por_Estado_Municipio.asp"
agent = Mechanize.new
parameters = {'selSemana' => '737*De+28%2F07%2F2013+a+03%2F08%2F2013',
'desc_semana' => 'de+28%2F07%2F2013+a+03%2F08%2F2013',
'cod_Semana' => '737',
'tipo' => '1',
'Cod_Combustivel' => 'undefined',
'selEstado' => 'PE*PERNAMBUCO',
'selCombustivel' => '487*Gasolina',
}
municipio = []
page = agent.post(url, parameters)
extrair = page.parser
extrair.css('.linkpadrao').each do |posto|
# Municipios
municipio << posto.text
end
I can't figure out how to get the numbers as they have the same HTML structure.
Any thoughts?!

Since you need to find the cells with respect to the city link, you should find their common ancestor - in this case their tr.
Using xpath, you can locate a specific cell by its text:
# This is the table that contains all of the city data
data_table = extrair.at_css('.table_padrao')
# This is the specific row that contains the specified city
row = data_table.xpath('//tr[td/a[#class="linkpadrao" and text()="Serra Talhada"]]')
# This is the data in the specific row
data = row.css(".color_line").map{|e| e.text }
#=> ["9", "2,973", "0,016", "2,939", "3,000", "0,572", "2,401", "0,024", "2,378", "2,426"]

You can get the numbers following each posto with:
posto.parent.search('~ td').map &:text

Related

How to merge 3 hashes?

I have been trying to get some information from a table into a hash so this is the code I have a HTML table like below, and Im extracting party_names and types and merging them in the single hash. Now I need to merge another hash with party addresses. I am able to get the address but the table structure is a bit unusual so I'm not sure how to merge the party address with the party names the one who has the address.
require 'nokogiri'
html = ' <table class="detailRecordTable"><tbody><tr>
<td width="3%" class="detailSeperator" style="width:3%;"></td>
<td width="30%" class="detailSeperator" style="width:30%;text-align:left">
SMALL , DANIEL, Appellant &nbsp </td> <td width="20%" class="detailSeperator" style="width:20%;font-weight: normal"> represented by
</td>
<td width="47%" class="detailSeperator" style="width:47%;text-align:left">
KELLY , MARK EDWARD
, Attorney for Appellant
</td>
</tr>
<tr>
<td width="3%" class="detailData" style="width:3%;text-align:right">
</td>
<td width="30%" class="detailData">
</td> <td width="20%" class="detailData">
</td><td width="47%" class="detailData">
134 N WATER STREET<br>
LIBERTY,
MO
64068<br> <br>
<p></p>
</td>
</tr>
<tr>
<td width="3%" class="detailData"> </td>
<td width="30%" class="detailData"> </td>
<td width="20%" class="detailData"> </td>
<td width="47%" class="detailData"></td>
</tr>
<tr>
<td class="detailSeperator" style="width:3%;text-align:right"></td>
<td class="detailSeperator" style="width:30%;text-align:left"></td>
<td class="detailSeperator" style="width:20%;font-weight: normal">co-counsel</td>
<td class="detailSeperator" style="width:47%;text-align:left">
PITTMAN , KRISTI LANAE , Co-Counsel for Appellant</td>
</tr>
<tr>
<td width="3%" class="detailData"> </td>
<td width="30%" class="detailData"> </td>
<td width="20%" class="detailData"> </td>
<td width="47%" class="detailData">
134 NORTH WATER STREET<br>
LIBERTY,
MO
64068<br> <br>
</td>
</tr>
<tr>
<td width="3%" class="detailSeperator" style="width:3%;">
</td>
<td width="30%" class="detailSeperator" style="width:30%;text-align:left">
RED SIMPSON, INC.
, Respondent
</td>
<td width="20%" class="detailSeperator" style="width:20%;font-weight: normal"> represented by
</td>
<td width="47%" class="detailSeperator" style="width:47%;text-align:left">
GREENWALD , DOUGLAS MARK
, Attorney for Respondent
</td>
</tr>
<tr>
<td width="3%" class="detailData" style="width:3%;text-align:right">
</td>
<td width="30%" class="detailData">
</td>
<td width="20%" class="detailData">
</td>
<td width="47%" class="detailData">
10 EAST CAMBRIDGE CIRCLE DRIVE<br>
KANSAS CITY,
KS
66103<br><br>
<p></p>
</td>
</tr>
<tr>
<td width="3%" class="detailData"> </td>
<td width="30%" class="detailData"> </td>
<td width="20%" class="detailData"> </td>
<td width="47%" class="detailData"></td>
</tr>
<tr>
<td class="detailSeperator" style="width:3%;text-align:right"></td>
<td class="detailSeperator" style="width:30%;text-align:left"></td>
<td class="detailSeperator" style="width:20%;font-weight: normal">co-counsel</td>
<td class="detailSeperator" style="width:47%;text-align:left">
BENJAMIN, SAMANTHA NICOLE
, Co-Counsel for Respondent</td>
</tr>
<tr>
<td width="3%" class="detailData"> </td>
<td width="30%" class="detailData"> </td>
<td width="20%" class="detailData"> </td>
<td width="47%" class="detailData">
MCANANY VAN CLEVE AND PHILLIPS<br>
10 E CAMBRIDGE CIRCLE DR<br>
STE 300<br>
KANSAS CITY,
KS
66103<br>
<b>Business: </b>
(913)
573-3319 <br> <br>
</td>
</tr>
</tbody></table>'
doc = Nokogiri::HTML(html)
rows = doc.xpath("//table[#class='detailRecordTable']//tr")
# address2 = doc.css('td:nth-of-type(4)').text.strip
# puts address2
#party_names = []
#party_types = []
#party_des = []
rows.each do |row|
nodes = row.css('.detailSeperator:nth-of-type(2), .detailSeperator:nth-of-type(4)')
nodes.each do |node|
name = node.text.strip.gsub("\n", '').gsub("\t", '')
parts = name.split(',')
name = if parts.length == 3
"#{parts[0]}, #{parts[1]}"
else
parts[0]
end
party_type = parts[-1].strip if parts && parts.length >= 2
addr = ("#{parts[0]}, #{parts[1]}" if parts.length == 2)
#party_names << name
#party_types << party_type
#party_des << addr
end
address = row.css('td:nth-of-type(2),td:nth-of-type(4)')
address.each do |node|
addr = node.text.strip.gsub("\n", '').gsub("\t", '')
parts = addr.split(',')
addr = ("#{parts[0]}, #{parts[1]}" if parts.length == 2)
#party_des << addr
end
end
#party_names.compact!
#party_names.reject(&:empty?)
#party_types.compact!
#party_des.compact!
#party_names_and_types = #party_names.zip(#party_types).map { |name, type| { part_name: name, party_type: type } }
The out put I have currrently is like this
{:part_name=>"SMALL, DANIEL", :party_type=>"Appellant &nbsp"}
{:part_name=>"KELLY, MARK EDWARD", :party_type=>"Attorney for Appellant"}
{:part_name=>"PITTMAN, KRISTI LANAE", :party_type=>"Co-Counsel for Appellant"}
{:part_name=>"RED SIMPSON, INC.", :party_type=>"Respondent "}
{:part_name=>"GREENWALD, DOUGLASMARK", :party_type=>"Attorney for Respondent"}
{:part_name=>"BENJAMIN, SAMANTHA NICOLE", :party_type=>"Co-Counsel for Respondent"}
how I am able to get the party address but how can I merge it with #party_names_and_types so I have the output like this
{:part_name=>"SMALL, DANIEL", :party_type=>"Appellant &nbsp"}
{:part_name=>"KELLY, MARK EDWARD", :party_type=>"Attorney for Appellant", :party_address => "134 N WATER STREETLIBERTY,MO 64068"}
{:part_name=>"PITTMAN, KRISTI LANAE", :party_type=>"Co-Counsel for Appellant",:party_address => "134 N WATER STREETLIBERTY,MO 64068"}
{:part_name=>"RED SIMPSON, INC.", :party_type=>"Respondent "}
{:part_name=>"GREENWALD, DOUGLASMARK", :party_type=>"Attorney for Respondent", :party_address => " 10 EAST CAMBRIDGE CIRCLE DRIVE KANSAS CITY,KS 66103"}
{:part_name=>"BENJAMIN, SAMANTHA NICOLE", :party_type=>"Co-Counsel for Respondent", :party_address => " MCANANY VAN CLEVE AND PHILLIPS 10 E CAMBRIDGE CIRCLE DR STE 300 KANSAS CITY,KS 66103", :party_des => "Business:(913) 573-3319"}

You were right about the table structure being "a bit unusual".
The logic that you implemented, I won't say it was wrong, but for this table, I won't go with it since the associated values (like party name and party address) were in different rows.
Here is the code that I wrote to get the expected output as mentioned by you
require 'nokogiri'
# html = 'your provided html code...'
doc = Nokogiri::HTML(html)
rows = doc.xpath("//table[#class='detailRecordTable']//tr")
#party_names_and_types = []
start = 0
step = 5
def format_text(text)
text.strip.gsub(" ", "").gsub("\n", ' ').gsub("\t", '')
end
def get_party_name_and_type(text)
parts = text.split(',')
name = parts.length == 3 ? "#{parts[0]}, #{parts[1]}" : parts[0]
party_type = format_text(parts[-1].strip) if parts && parts.length >= 2
{ party_name: name, party_type: party_type }
end
while start < rows.count
data_rows = rows.slice(start, step)
[0, 3].each do |row_num|
if row_num == 0
[1, 3].each do |col_num|
party_details = get_party_name_and_type(
format_text(data_rows[row_num].children.filter("td")[col_num].text)
)
address = data_rows[row_num+1].children.filter("td")[3].text if col_num == 3
party_details[:party_address] = format_text(address) unless address.nil? || address.empty?
#party_names_and_types << party_details
end
else
party_details = get_party_name_and_type(
format_text(data_rows[3].children.filter("td")[3].text)
)
address = data_rows[row_num+1].children.filter("td")[3].text
party_details[:party_address] = format_text(address) unless address.nil? || address.empty?
#party_names_and_types << party_details
end
end
start += step
end
puts "======#party_names_and_types======"
puts #party_names_and_types
Output:
======#party_names_and_types======
{:party_name=>"SMALL , DANIEL", :party_type=>"Appellant  &nbsp"}
{:party_name=>"KELLY , MARK EDWARD ", :party_type=>"Attorney for Appellant", :party_address=>"134 N WATER STREETLIBERTY, MO 64068"}
{:party_name=>"PITTMAN , KRISTI LANAE ", :party_type=>"Co-Counsel for Appellant", :party_address=>"134 NORTH WATER STREETLIBERTY, MO 64068"}
{:party_name=>"RED SIMPSON, INC. ", :party_type=>"Respondent   "}
{:party_name=>"GREENWALD , DOUGLAS MARK ", :party_type=>"Attorney for Respondent", :party_address=>"10 EAST CAMBRIDGE CIRCLE DRIVEKANSAS CITY, KS 66103"}
{:party_name=>"BENJAMIN, SAMANTHA NICOLE ", :party_type=>"Co-Counsel for Respondent", :party_address=>"MCANANY VAN CLEVE AND PHILLIPS10 E CAMBRIDGE CIRCLE DRSTE 300KANSAS CITY, KS 66103 Business:(913) 573-3319"}
I'll update the answer to explain the logic in some time.
Hope this helps.

Nokogiri iterating over tr tags too many times

I'm scraping this page https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig and for each tr I am collecting and returning the level name and the number of computers available.
The problem is that it is being iterated over too many times. There are only 4 tr tags but the loop goes through 5 iterations. This causes an extra nil to be appended to the return array. Why is this?
Scraped Section:
<table class="chart">
<tr valign="middle">
<td class="left">Level 1</td>
<td class="middle"><div style="width:68%;"><strong>68%</strong></div></td>
<td class="right">23 Free of 34 PC's</td>
</tr>
<tr valign="middle">
<td class="left">Level 2</td>
<td class="middle"><div style="width:78%;"><strong>78%</strong></div></td>
<td class="right">83 Free of 107 PC's</td>
</tr>
<tr valign="middle">
<td class="left">Level 4</td>
<td class="middle"><div style="width:64%;"><strong>64%</strong></div></td>
<td class="right">9 Free of 14 PC's</td>
</tr>
<tr valign="middle">
<td class="left">Level 5</td>
<td class="middle"><div style="width:97%;"><strong>97%</strong></div></td>
<td class="right">28 Free of 29 PC's</td>
</tr>
</table>
Shortened Method:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
library_name = details_page.css("h3")
details_page.css("table tr").collect do |level|
case level.css("a[href]").text.downcase
when "level 1"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
when "level 2"
name = level.css("a[href]").text
total_available = level.css(".right").text.split(" ")[0]
out_of_available = level.css(".right").text.split(" ")[3]
level = {name: name, total_available: total_available, out_of_available: out_of_available}
end
end
end

You can specify the class attribute of the table and then access the tr tags inside, this way you avoid the "additional" tr, like:
details_page.css("table.chart tr").map do |level|
...
And simplify a little bit the scrape_details_page method:
def scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css('table.chart tr').map do |level|
right = level.css('.right').text.split
{ name: level.css('a[href]').text, total_available: right[0], out_of_available: right[3] }
end
end
p scrape_details_page('https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Duhig')
# [{:name=>"Level 1", :total_available=>"22", :out_of_available=>"34"},
# {:name=>"Level 2", :total_available=>"98", :out_of_available=>"107"},
# {:name=>"Level 4", :total_available=>"12", :out_of_available=>"14"},
# {:name=>"Level 5", :total_available=>"26", :out_of_available=>"29"}]

xpath: searching a node in a html table row (multiple conditions)

Looking for a xpath node whose table row must fulfill several conditions
Searching for those node "col_functions" whose table row values is "John Wayne" from the table #class="table_list".
("col_functions", "col_firstname" and "col_lastname are sibling nodes and childs from the table)
<table class="table_list">
<tbody>
<tr>
<td class="col_firstname">John</td>
<td class="col_lastname">Lennon</td>
<td class="col_functions"></td>
</tr>
<tr>
<td class="col_firstname">John</td>
<td class="col_lastname">Wayne</td>
<td class="col_functions"></td> <=== looking for this node!!
</tr>
<tr>
<td class="col_firstname">Wayne</td>
<td class="col_lastname">John</td>
<td class="col_functions"></td>
</tr>
</tbody>
<table>

One option would be to check for class names all over the place:
//table[#class="table_list"]//tr[td[#class="col_firstname"] = "John" and td[#class="col_lastname"] = "Wayne"]/td[#class="col_functions"]/text()
Here we are basically checking all rows inside the table for cells with first name John and last name Wayne, getting the cell with col_functions as an output.

Using siblings it will be like that:
//table[#class='table_list']//td[#class='col_firstname'][text()='John']/following-sibling::td[#class='col_lastname'][text=()'Wayne']/following-sibling::td[#class='col_functions']

How to find elements using several conditions in Watir?

There is a table which rows are with different class names: first_row, odd_row, even_row and subjectField.
HTML:
<table class="color_table">
<thead></thead>
<tbody>
<tr class="first_row"></tr>
<td colspan="1" rowspan="1"></td>
<td colspan="1" rowspan="1"></td>
<td colspan="1" rowspan="1"></td>
<td colspan="1" rowspan="1">
**63**
</td>
<tr class="subjectField" style="display:none"></tr>
<tr class="odd_row"></tr>
<tr class="subjectField" style="display:none"></tr>
<tr class="even_row"></tr>
<tr class="subjectField" style="display:none"></tr>
</tbody>
Additional HTML:
<tbody>
<tr class="first_row"></tr>
<tr class="subjectField" style="display:none"></tr>
<tr class="odd_row"></tr>
<tr class="subjectField" style="display:none"></tr>
<tr>
<td class="separator" rowspan="1" colspan="10"></td>
</tr>
<tr class="even_row"></tr>
<tr class="subjectField" style="display:none"></tr>
</tbody>
I need to get information from all rows except row which class name is 'subjectField'
My code:
table = #f.div(:id => 'household').table(:class => 'color_table')
table.tbody.trs(:class => 'first_row', :class => 'odd_row', :class =>'even_row').each do
age = tr.td(:index => 3).text
puts age
end
This code takes all rows, subjectFields rows too.
Does anybody know how to make it work with the rows I need only?

To find everything except a class, use a regex with a negative lookahead:
table.trs(:class => /^(?!subjectField)/).size
If you want to get the text for each of these rows:
puts table.trs(:class => /^(?!subjectField)/).collect(&:text)
If you want to get the text of the fourth column for each cell:
puts table.trs(:class => /^(?!subjectField)/).collect do |row|
row.td(:index => 3).text
end

it is really simple:
table = #f.div(:id => 'household').table(:class => 'color_table')
table_element.count #it will display the count of all rows corresponding to specified table.
table_elements(:class => 'first row').index #will return the array count [0]
table_elements(:class => 'even_row').index #will return the array count [2]

not a problem actually
table_elements(:class => 'first row').text # if you need to take the text from the row with corresponding class
or
table_elements[0].text

Have you tried something like this
...(:xpath, "//table/tbody/tr[#class !='subjectField']")...

Getting attributed html element

I'm trying to get table with content of MMEL codes from this site and I'm trying to accomplish it with CSS Selectors.
What I've got so far is:
require_relative 'sources/Downloader'
require 'nokogiri'
html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)
tmp = parsed_html.css("tr[*]")
puts tmp.text
And I'm getting error while trying to get this tr with attribute. How can I complete this task to get this table in simple form because I want to parse it to JSON. It would be nice go get this in sections and call it in.each block.
EDIT:
I'd be nic if I can get things in block like this (look into pages source)
<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP" COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog. They shall be illustrated, showing the part number, Legend and Location. The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations. Those required by government regulations shall be so identified.</FONT></TD>
</TR>

This should print all those TR's from source at line 96. There are three tables in that page and table[1] has all the text you needed:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
puts i #=> prints the exact html between TR tags (including)
puts i.text #=> prints the text
end
For instance:
puts doc.css("table")[1].css("tr")[2]
prints the following:
<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit. Includes dimensions and
areas, lifting and shoring, leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>

You could do the same using xpath also:
Below is the content from the first table of the webpage given in the post by OP:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="33%" valign="MIDDLE" colspan="2">
<p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
</td>
<td width="67%" valign="MIDDLE">
<b><i><font face="Arial" color="#0000ff">
<p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
</td>
</tr>
Now, if you want to collect the last table rows, then do:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2">E-mail S-Tech</font></b></p>
</td>
</tr>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I parse HTML using Nokogiri? - ruby

You can get the numbers following each posto with: posto.parent.search('~ td').map &:text

Related

How to merge 3 hashes?

Nokogiri iterating over tr tags too many times

xpath: searching a node in a html table row (multiple conditions)

How to find elements using several conditions in Watir?

Getting attributed html element

Categories

Resources