Getting a table with Mechanize in Ruby - ruby

I'd like to get items from this table:
<table style="margin: auto;width: 800px" id="myTable" class="tablesorter">
<thead>
<tr class="TableHeader">
<th >Game</th><th>Icon</th><th>Achievement</th>
<th>Achievers</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><img alt="Logo" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/07385eb55b5ba974aebbe74d3c99626bda7920b8.jpg" width=133 height=50 ></td>
<td> <table>
<tr>
<td class="AchievementBox" style="background-color: #347C17">
<a href="Steam_Achievement_Info.php?AchievementID=169&AppID=440"> <img alt="Icon" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/924764eea604817d3c14de9640ae6422c7cdfb7a.jpg" height='50' width='50'>
</a> </td>
</tr>
</table>
</td>
<td style="text-align: left" >Race for the Pennant<br>Run 25 kilometers.</td>
<td style="text-align: right">35505</td><td style="text-align: right">1.3</td>
The table has an id myTable so what I'd like to do is this:
go inside <tbody>
for each <tr> in table:
do something; maybe go inside <td> or get a link from <href>
I have:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://astats.astats.nl/astats/TopListAchievements.php?DisplayType=2")
puts page.body
This prints the page but how do I actually iterate through the table rows?

Using css selector, to print text and href attribute values:
require 'nokogiri'
doc = Nokogiri::HTML(page.body)
doc.css('table#myTable tbody td[3] a').each {|a|
puts a.text, a[:href]
}

Related

How to extract TD values only with Nokogiri?

I am extracting the forexfactory calendar. I can do a puts of the td values but it's printing the full HTML tags. How do I get the td value of a tag with class "calendar__time"?
My code is:
require 'HTTParty'
require 'Nokogiri'
require 'Pry'
require 'csv'
page = HTTParty.get('http://www.forexfactory.com/calendar.php?day=aug31.2016')
p= Nokogiri::HTML(page)
rows=p.css('tr.calendar_row')
rows.map do |row|
puts row.css('td.calendar__date')
puts row.css('td.calendar__time')
end
When I check with irb, it's returning it with the tags:
<td class="calendar__cell calendar__time time">9:45pm</td><a href="javascript:void(0);" class="calendarexpanded__graph" data-touchable><span>Graph</span></a> </td>
</tr>
</tbody>
</table>
</td>
</tr>
<td class="calendar__cell calendar__date date"></td>
The HTML snippet for this TR is:
<tr class="calendar__row calendar_row calendar__row--grey " data-eventid="62529" data-touchable>
<td class="calendar__cell calendar__date date"></td>
<td class="calendar__cell calendar__time time">2:00am</td>
<td class="calendar__cell calendar__currency currency">CHF</td>
<td class="calendar__cell calendar__impact impact calendar__impact calendar__impact--low">
<div class="calendar__impact-icon calendar__impact-icon--screen"> <span title="Low Impact Expected" class="low"></span> </div>
<div class="calendar__impact-icon calendar__impact-icon--print"> <img src="resources/images/icons/impact/impact-yellow.png" alt="" border="0" /> </div>
</td>
<td class="calendar__cell calendar__event event">
<div> <span class="calendar__event-title">UBS Consumption Indicator</span> </div>
</td>
<td class="calendar__cell calendar__detail detail"><a class="calendar__detail-link calendar__detail-link--level-1 calendar_detail level1" data-level="1"></a></td>
<td class="calendar__cell calendar__actual actual">1.32</td>
<td class="calendar__cell calendar__forecast forecast"></td>
<td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendar__cell calendar__graph graph"><a class="calendar__detail-link calendar__detail-link--graph-icon calendar_chart"></a></td>
</tr>
<tr class="calendar__row calendar__expand " data-eventid="62529">
<td> </td>
<td colspan="4" class="calendarexpanded__container">
<table class="calendarexpanded">
<tbody>
<tr>
<td class="calendarexpanded__cell"><strong>Actual</strong>1.32</td>
<td class="calendarexpanded__cell"><strong>Forecast</strong> </td>
<td class="calendarexpanded__cell"><strong>Previous</strong><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendarexpanded__cell calendarexpanded__cell--small"> <a href="javascript:void(0);" class="calendarexpanded__details calendarexpanded__details--1" data-touchable><span>Details</span></a> </td>
<td class="calendarexpanded__cell calendarexpanded__cell--small">
I got it to work by changing as follows:
puts row.css('td.calendar__date').text

Nokogiri: parse, extract and return <tr> content in HTML table

I am trying to parse a HTML table. It is basically the sixth <tr> tag in the HTML:
<HTML>
<HEAD>
<TITLE>date</TITLE>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
</HEAD>
<BODY bgcolor="white">
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=right colspan=2 id=ptitle name=ptitle>
<font size=3>this is my title</font><br>
</td>
</tr>
<tr>
<td height=10 align=left colspan=2 valign=top>
<table border=0 width="100%" cellpadding=0 cellspacing=0>
<tr>
<td width="50%" align=right><font size=2>this is my subtitle</font></td>
</tr>
</table>
</td>
</tr>
<td valign=top>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
this is a line
</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
<tr>
this is a line</tr>
</table>
</td>
</tr>
</table>
<br>
</BODY>
</HTML>
My Ruby code looks like this:
require 'nokogiri'
require 'open-uri'
url = <website-name>
data = Nokogiri::HTML(open(url))
data.at('<tr>').next[6].text
But it wont work. How do I use Nokogiri to extract all these <tr>this is a line</tr> code?
Ideally I'd like it to be in one variable and including the HTML as I would like to but it into another website.
Thanks a lot!
This way:
data = Nokogiri::HTML(open(url))
rows = data.css("td[valign='top'] table tr") # All the <tr>this is a line</tr>
rows.each do |row|
puts row.text # Will print all the 'this is a line'
end

Need help to locate the text of element with class?

I have a file that I have got using the command page.css("table.vc_result span a"), I am not able to get the second and third Span element of the file:
File
<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
<tr>
<td width="260" valign="top">
<table>
<tbody>
<tr>
<td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
Gateway Megatech</a></span><br>
<span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
</tr>
<tr>
<td><span class="cAddText">Cook County Illinois</span></td>
</tr>
<tr>
<td><span class="cAddCategory">Yellow Page Advertising And Telephone
Directory Publica Chicago</span></td>
</tr>
</tbody>
</table>
</td>
<td width="260">
<table align="center">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
</div>
</td>
<td><font style="font-weight:bold">847-506-7800</font></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
</div>
</td>
<td><a href=
"/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
class="cAddNearby">Businesses near 60696</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
...This is not the complete file there are plenty more span entries in that file.
The code that I am using is able to locate the exact text but not able to associate it with the text of the nested element Span A.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url))
rows = page.css("table.vc_result span a")
rows.each do |arow|
if arow.text == "Gateway Megatech"
puts(arow.next_element.text)
puts("Capturing the next span text")
found="Got it"
break
else
puts("Found nothing")
found="None"
end
end
Assuming that each business is a new <tr> inside the top table you have supplied, the following code gives you an array of Hashes with the values:
require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
# Inside the first <td> of the row, find a <td> with a.cAddName in it
business = tr.at_xpath('td[1]//td[//a[#class="cAddName"]]')
name = business.at_css('a.cAddName').text.strip
address = business.at_css('.cAddText').text.strip
# Inside the second <td> of the row, find the first <font> tag
phone = tr.at_xpath('td[2]//font').text.strip
# Return a hash of values for this row, using the capitalization requested
{ Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=> {
#=> :Name=>"Gateway Megatech",
#=> :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=> :Phone=>"847-506-7800"
#=> }
#=> ]
This is pretty fragile, but works for what you've given, and there do not seem to be very many semantic items to hang onto in this insane, horrorific abuse of HTML.
Parsing HTML with regular expressions is a bad idea, because HTML is not a regular language. Ideally, you want to parse the DOM / XML to a tree structure.
http://nokogiri.org/ is pretty popular.

What xpath query would solve this

What XPath query could I use to solve the below. I'm actually using nokogiri (in ruby) so ideally the answer would be in the form of a ruby nokogiri form, but else just XPath and I can adapt in.
Required Output
I'm seeking to parse the below HTML (a full html page, but I've just copy/pasted the relevant part for clarity), and end up with basically the following:
Phone Number Plan ID
545454545 12345
3434343434 67890
So in the context of Ruby/nokogiri this could be in a Hash for example:
% result = { "545454545" => "12345", "3434343434" => "67890" }
HTML to be Parsed
.
.
.
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>
How about:
xpath = '//td[contains(text(),"Phone Number") or contains(text(),"Plan ID")]/following-sibling::td'
Hash[*doc.xpath(xpath).map{|x| x.text.strip}.reverse]
Assuming those lines you've replaced with periods do not contain data you want to collection, which would mean each table provided a unique result set, the following would work:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.HTML DATA.read
results = {}
doc.search('table').each do |table|
plan_id = table.at('tr[1]/td[2]')
phone_number = table.at('tr[2]/td[2]')
if plan_id && phone_number
results[phone_number.text.strip] = plan_id.text.strip
end
end
p results #=> {"545454545"=>"12345", "3434343434"=>"67890"}
__END__
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>

How do I use Nokogiri and Ruby to scrape values from HTML with nested tables?

I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.
I tried some different xpath's but everything I try grabs much more than I want:
<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
<table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
<table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
<td bgcolor="#dddddd">Some Person</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
<td bgcolor="#dddddd">A12345678</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
<td bgcolor="#dddddd">123-456-7890</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
<td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
<td bgcolor="#dddddd">someone#email.com</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
<td bgcolor="#dddddd">Female</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
<td bgcolor="#dddddd">Unknown</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
<td bgcolor="#dddddd">Jan 1st, 1901</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
<td bgcolor="#dddddd">Sophomore</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
<td bgcolor="#dddddd">Biology</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
<td bgcolor="#dddddd">University of Somewhere</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
<td bgcolor="#dddddd">0.00</td>
</tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
<td bgcolor="#dddddd">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:
require 'nokogiri'
# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
doc = Nokogiri::HTML(html)
recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
recruit_labels.map do |recruit_label|
recruit_table = recruit_label.at_xpath('following-sibling::table')
Hash[ fields.map do |field_label|
label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
[field_label, label_td.at_xpath('following-sibling::td/text()').text ]
end ]
end
end
require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=> "EDU ID"=>"A12345678",
#=> "Phone"=>"123-456-7890",
#=> "Email"=>"someone#email.com",
#=> "Gender"=>"Female"}]
An XPath expression like .//foo[bar[text()="jim"]] means:
Find a 'foo' element anywhere under the current node
...but only if it has a 'bar' element as a child
...but only if that 'bar' element has the text "jim" as its content
An XPath expression like following-sibling::... means Find any elements that are siblings after the current node that match the expression ...
The XPath expression .../text() selects the Text node; the text method is used to extract the value (actual string) of that text node.
Nokogiri's xpath method returns an array of all elements matching the expression, while the at_xpath method returns the first element matching the expression.

Resources