Need help to locate the text of element with class? - ruby

I have a file that I have got using the command page.css("table.vc_result span a"), I am not able to get the second and third Span element of the file:
File
<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
<tr>
<td width="260" valign="top">
<table>
<tbody>
<tr>
<td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
Gateway Megatech</a></span><br>
<span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
</tr>
<tr>
<td><span class="cAddText">Cook County Illinois</span></td>
</tr>
<tr>
<td><span class="cAddCategory">Yellow Page Advertising And Telephone
Directory Publica Chicago</span></td>
</tr>
</tbody>
</table>
</td>
<td width="260">
<table align="center">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
</div>
</td>
<td><font style="font-weight:bold">847-506-7800</font></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
</div>
</td>
<td><a href=
"/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
class="cAddNearby">Businesses near 60696</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
...This is not the complete file there are plenty more span entries in that file.
The code that I am using is able to locate the exact text but not able to associate it with the text of the nested element Span A.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url))
rows = page.css("table.vc_result span a")
rows.each do |arow|
if arow.text == "Gateway Megatech"
puts(arow.next_element.text)
puts("Capturing the next span text")
found="Got it"
break
else
puts("Found nothing")
found="None"
end
end

Assuming that each business is a new <tr> inside the top table you have supplied, the following code gives you an array of Hashes with the values:
require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
# Inside the first <td> of the row, find a <td> with a.cAddName in it
business = tr.at_xpath('td[1]//td[//a[#class="cAddName"]]')
name = business.at_css('a.cAddName').text.strip
address = business.at_css('.cAddText').text.strip
# Inside the second <td> of the row, find the first <font> tag
phone = tr.at_xpath('td[2]//font').text.strip
# Return a hash of values for this row, using the capitalization requested
{ Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=> {
#=> :Name=>"Gateway Megatech",
#=> :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=> :Phone=>"847-506-7800"
#=> }
#=> ]
This is pretty fragile, but works for what you've given, and there do not seem to be very many semantic items to hang onto in this insane, horrorific abuse of HTML.

Parsing HTML with regular expressions is a bad idea, because HTML is not a regular language. Ideally, you want to parse the DOM / XML to a tree structure.
http://nokogiri.org/ is pretty popular.

Related

xpath that exclude some specific elements

This is a simple version of the HTML of the page that I want analyse:
<table class="class_1">
<tbody>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"> </td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>square</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_7"></span>circle</td>
</tr>
<tr class="class_2">
<td class="class_3"> </td>
<td class="class_4"> </td>
<td class="class_5"><span class="class_6"></span>triangle</td>
</tr>
</tbody>
</table>
You can find the page at
https://sabbiobet.netsons.org/test.html
If you try in a google sheets the function:
=IMPORTXML("https://sabbiobet.netsons.org/test.html";"//td[#class='class_5']")
i'll obtain:
square
circle
triangle
I need to obtain all the <td> with class="class_5" minus the ones that have or <span class=class_7>.
In other words I want to obtain only these values:
Square
Triangle
can somebody help me?
The following XPath expression
//td[#class='class_5' and span and not(span[#class='class_7'])]
selects all td elements having an attribute class with value class_5, having a child element span and not having a child element span where its class attribute has the value class_7.
Note that you could also use
//td[#class='class_5' and span[#class='class_6']]
to get the same result in this case.
This should work:
//td[#class='class_5'][not(text()=' ')][not(./span[#class='class_7'])]
where [not(text()=' ')] is not testing for a reqular space but rather for a symbol with Unicode code U+00A0 that you can input from keyboard in windows using alt+0160 where numbers are to be input from numpad.

nokogiri parsing first td in tr ignoring specific class

I have the following html
<table>
<tr>
<th>value</th>
<th>description</th>
</tr>
<tr>
<td>OverallHealthScore</td>
<td>
Overall HealthScore.
</td>
</tr>
<tr>
<td class="deprecated">DESTAGED_TRACKS_PER_SEC</td>
<td>
The tracks per second saved into disks.
</td>
</tr>
</table>
There are many many tr's but this is an excerpt of the two scenario's
I need to only print out OverallHealthScore
table.css('tr').map do |row|
puts row.css('td:not(.deprecated)').map(&:text)[0]
end
Gets me just about there but prints out the "description" td on the deprecated items. I can't seem to figure out what I need to do in order to get the results I am needing.
Assuming you want to get the first td's value which are not deprecated:
<table>
<tr>
<th>value</th>
<th>description</th>
</tr>
<tr>
<td>OverallHealthScore</td>
<td>
Overall HealthScore.
</td>
</tr>
<tr>
<td class="deprecated">DESTAGED_TRACKS_PER_SEC</td>
<td>
The tracks per second saved into disks.
</td>
</tr>
<tr>
<td>AvaiableAnother</td>
<td>
Another Available HealthScore.
</td>
</tr>
<tr>
<td class="deprecated">OTHER_DEPRE</td>
<td>
The tracks per second saved into disks.
</td>
</tr>
</table>
Then
puts table.css('td:first-child:not(.deprecated)').map(&:text)
# OverallHealthScore
# AvaiableAnother

Getting a table with Mechanize in Ruby

I'd like to get items from this table:
<table style="margin: auto;width: 800px" id="myTable" class="tablesorter">
<thead>
<tr class="TableHeader">
<th >Game</th><th>Icon</th><th>Achievement</th>
<th>Achievers</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><img alt="Logo" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/07385eb55b5ba974aebbe74d3c99626bda7920b8.jpg" width=133 height=50 ></td>
<td> <table>
<tr>
<td class="AchievementBox" style="background-color: #347C17">
<a href="Steam_Achievement_Info.php?AchievementID=169&AppID=440"> <img alt="Icon" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/924764eea604817d3c14de9640ae6422c7cdfb7a.jpg" height='50' width='50'>
</a> </td>
</tr>
</table>
</td>
<td style="text-align: left" >Race for the Pennant<br>Run 25 kilometers.</td>
<td style="text-align: right">35505</td><td style="text-align: right">1.3</td>
The table has an id myTable so what I'd like to do is this:
go inside <tbody>
for each <tr> in table:
do something; maybe go inside <td> or get a link from <href>
I have:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://astats.astats.nl/astats/TopListAchievements.php?DisplayType=2")
puts page.body
This prints the page but how do I actually iterate through the table rows?
Using css selector, to print text and href attribute values:
require 'nokogiri'
doc = Nokogiri::HTML(page.body)
doc.css('table#myTable tbody td[3] a').each {|a|
puts a.text, a[:href]
}

What xpath query would solve this

What XPath query could I use to solve the below. I'm actually using nokogiri (in ruby) so ideally the answer would be in the form of a ruby nokogiri form, but else just XPath and I can adapt in.
Required Output
I'm seeking to parse the below HTML (a full html page, but I've just copy/pasted the relevant part for clarity), and end up with basically the following:
Phone Number Plan ID
545454545 12345
3434343434 67890
So in the context of Ruby/nokogiri this could be in a Hash for example:
% result = { "545454545" => "12345", "3434343434" => "67890" }
HTML to be Parsed
.
.
.
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>
How about:
xpath = '//td[contains(text(),"Phone Number") or contains(text(),"Plan ID")]/following-sibling::td'
Hash[*doc.xpath(xpath).map{|x| x.text.strip}.reverse]
Assuming those lines you've replaced with periods do not contain data you want to collection, which would mean each table provided a unique result set, the following would work:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.HTML DATA.read
results = {}
doc.search('table').each do |table|
plan_id = table.at('tr[1]/td[2]')
phone_number = table.at('tr[2]/td[2]')
if plan_id && phone_number
results[phone_number.text.strip] = plan_id.text.strip
end
end
p results #=> {"545454545"=>"12345", "3434343434"=>"67890"}
__END__
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>

How do I use Nokogiri and Ruby to scrape values from HTML with nested tables?

I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.
I tried some different xpath's but everything I try grabs much more than I want:
<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
<table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
<table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
<td bgcolor="#dddddd">Some Person</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
<td bgcolor="#dddddd">A12345678</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
<td bgcolor="#dddddd">123-456-7890</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
<td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
<td bgcolor="#dddddd">someone#email.com</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
<td bgcolor="#dddddd">Female</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
<td bgcolor="#dddddd">Unknown</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
<td bgcolor="#dddddd">Jan 1st, 1901</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
<td bgcolor="#dddddd">Sophomore</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
<td bgcolor="#dddddd">Biology</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
<td bgcolor="#dddddd">University of Somewhere</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
<td bgcolor="#dddddd">0.00</td>
</tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
<td bgcolor="#dddddd">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:
require 'nokogiri'
# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
doc = Nokogiri::HTML(html)
recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
recruit_labels.map do |recruit_label|
recruit_table = recruit_label.at_xpath('following-sibling::table')
Hash[ fields.map do |field_label|
label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
[field_label, label_td.at_xpath('following-sibling::td/text()').text ]
end ]
end
end
require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=> "EDU ID"=>"A12345678",
#=> "Phone"=>"123-456-7890",
#=> "Email"=>"someone#email.com",
#=> "Gender"=>"Female"}]
An XPath expression like .//foo[bar[text()="jim"]] means:
Find a 'foo' element anywhere under the current node
...but only if it has a 'bar' element as a child
...but only if that 'bar' element has the text "jim" as its content
An XPath expression like following-sibling::... means Find any elements that are siblings after the current node that match the expression ...
The XPath expression .../text() selects the Text node; the text method is used to extract the value (actual string) of that text node.
Nokogiri's xpath method returns an array of all elements matching the expression, while the at_xpath method returns the first element matching the expression.

Resources