What xpath query would solve this - ruby

What XPath query could I use to solve the below. I'm actually using nokogiri (in ruby) so ideally the answer would be in the form of a ruby nokogiri form, but else just XPath and I can adapt in.
Required Output
I'm seeking to parse the below HTML (a full html page, but I've just copy/pasted the relevant part for clarity), and end up with basically the following:
Phone Number Plan ID
545454545 12345
3434343434 67890
So in the context of Ruby/nokogiri this could be in a Hash for example:
% result = { "545454545" => "12345", "3434343434" => "67890" }
HTML to be Parsed
.
.
.
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>

How about:
xpath = '//td[contains(text(),"Phone Number") or contains(text(),"Plan ID")]/following-sibling::td'
Hash[*doc.xpath(xpath).map{|x| x.text.strip}.reverse]

Assuming those lines you've replaced with periods do not contain data you want to collection, which would mean each table provided a unique result set, the following would work:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.HTML DATA.read
results = {}
doc.search('table').each do |table|
plan_id = table.at('tr[1]/td[2]')
phone_number = table.at('tr[2]/td[2]')
if plan_id && phone_number
results[phone_number.text.strip] = plan_id.text.strip
end
end
p results #=> {"545454545"=>"12345", "3434343434"=>"67890"}
__END__
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>

Related

How to extract TD values only with Nokogiri?

I am extracting the forexfactory calendar. I can do a puts of the td values but it's printing the full HTML tags. How do I get the td value of a tag with class "calendar__time"?
My code is:
require 'HTTParty'
require 'Nokogiri'
require 'Pry'
require 'csv'
page = HTTParty.get('http://www.forexfactory.com/calendar.php?day=aug31.2016')
p= Nokogiri::HTML(page)
rows=p.css('tr.calendar_row')
rows.map do |row|
puts row.css('td.calendar__date')
puts row.css('td.calendar__time')
end
When I check with irb, it's returning it with the tags:
<td class="calendar__cell calendar__time time">9:45pm</td><a href="javascript:void(0);" class="calendarexpanded__graph" data-touchable><span>Graph</span></a> </td>
</tr>
</tbody>
</table>
</td>
</tr>
<td class="calendar__cell calendar__date date"></td>
The HTML snippet for this TR is:
<tr class="calendar__row calendar_row calendar__row--grey " data-eventid="62529" data-touchable>
<td class="calendar__cell calendar__date date"></td>
<td class="calendar__cell calendar__time time">2:00am</td>
<td class="calendar__cell calendar__currency currency">CHF</td>
<td class="calendar__cell calendar__impact impact calendar__impact calendar__impact--low">
<div class="calendar__impact-icon calendar__impact-icon--screen"> <span title="Low Impact Expected" class="low"></span> </div>
<div class="calendar__impact-icon calendar__impact-icon--print"> <img src="resources/images/icons/impact/impact-yellow.png" alt="" border="0" /> </div>
</td>
<td class="calendar__cell calendar__event event">
<div> <span class="calendar__event-title">UBS Consumption Indicator</span> </div>
</td>
<td class="calendar__cell calendar__detail detail"><a class="calendar__detail-link calendar__detail-link--level-1 calendar_detail level1" data-level="1"></a></td>
<td class="calendar__cell calendar__actual actual">1.32</td>
<td class="calendar__cell calendar__forecast forecast"></td>
<td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendar__cell calendar__graph graph"><a class="calendar__detail-link calendar__detail-link--graph-icon calendar_chart"></a></td>
</tr>
<tr class="calendar__row calendar__expand " data-eventid="62529">
<td> </td>
<td colspan="4" class="calendarexpanded__container">
<table class="calendarexpanded">
<tbody>
<tr>
<td class="calendarexpanded__cell"><strong>Actual</strong>1.32</td>
<td class="calendarexpanded__cell"><strong>Forecast</strong> </td>
<td class="calendarexpanded__cell"><strong>Previous</strong><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendarexpanded__cell calendarexpanded__cell--small"> <a href="javascript:void(0);" class="calendarexpanded__details calendarexpanded__details--1" data-touchable><span>Details</span></a> </td>
<td class="calendarexpanded__cell calendarexpanded__cell--small">
I got it to work by changing as follows:
puts row.css('td.calendar__date').text

Getting a table with Mechanize in Ruby

I'd like to get items from this table:
<table style="margin: auto;width: 800px" id="myTable" class="tablesorter">
<thead>
<tr class="TableHeader">
<th >Game</th><th>Icon</th><th>Achievement</th>
<th>Achievers</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><img alt="Logo" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/07385eb55b5ba974aebbe74d3c99626bda7920b8.jpg" width=133 height=50 ></td>
<td> <table>
<tr>
<td class="AchievementBox" style="background-color: #347C17">
<a href="Steam_Achievement_Info.php?AchievementID=169&AppID=440"> <img alt="Icon" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/924764eea604817d3c14de9640ae6422c7cdfb7a.jpg" height='50' width='50'>
</a> </td>
</tr>
</table>
</td>
<td style="text-align: left" >Race for the Pennant<br>Run 25 kilometers.</td>
<td style="text-align: right">35505</td><td style="text-align: right">1.3</td>
The table has an id myTable so what I'd like to do is this:
go inside <tbody>
for each <tr> in table:
do something; maybe go inside <td> or get a link from <href>
I have:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://astats.astats.nl/astats/TopListAchievements.php?DisplayType=2")
puts page.body
This prints the page but how do I actually iterate through the table rows?
Using css selector, to print text and href attribute values:
require 'nokogiri'
doc = Nokogiri::HTML(page.body)
doc.css('table#myTable tbody td[3] a').each {|a|
puts a.text, a[:href]
}

Scraping info of a product that is spread among different <tr> elements?

I'm using something like the following to scrape the info of a page:
def self.parse_products
product_hash = {}
product = #data.css('.simGrid')
product.css('td').each do | product |
product_asin = product.css('.simImage a img').first.value[/(?<=\/)[A-Z\d]{5,}/]
product_image_url = product.css('.simProductInfo a').to_s
product_hash[:product] ||= []
product_hash[:product] << { :image_url => product_image_url,
:asin => product_asin }
end
product_hash
end
The problem is that the structure is something like this:
<table class="simGrid">
<tbody>
<tr class="middle">
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
</tr>
<tr>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
</tr>
<tr>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
</tr>
<tr class="middle">
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
</tr>
<tr>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
</tr>
<tr>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
</tr>
</tbody>
</table>
So as you can see the info of the product is spread among various <tr>. If I try to scrape them by using <td> I end up with many nil values since some of the <td> have the .simImage and others don't. Same for the .simProductInfo.
As anyone encounter something similar before? Is there any workaround for this?
You can try collecting ASINs and URLs in two separate arrays and then zipping them.
asins = product.css('.simImage a img').map { |n| n.value[/(?<=\/)[A-Z\d]{5,}/] }
urls = product.css('.simProductInfo a').map(&:to_s)
asins.zip(urls).map { |asin, url| {image_url: url, asin: asin} }

Need help to locate the text of element with class?

I have a file that I have got using the command page.css("table.vc_result span a"), I am not able to get the second and third Span element of the file:
File
<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
<tr>
<td width="260" valign="top">
<table>
<tbody>
<tr>
<td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
Gateway Megatech</a></span><br>
<span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
</tr>
<tr>
<td><span class="cAddText">Cook County Illinois</span></td>
</tr>
<tr>
<td><span class="cAddCategory">Yellow Page Advertising And Telephone
Directory Publica Chicago</span></td>
</tr>
</tbody>
</table>
</td>
<td width="260">
<table align="center">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
</div>
</td>
<td><font style="font-weight:bold">847-506-7800</font></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
</div>
</td>
<td><a href=
"/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
class="cAddNearby">Businesses near 60696</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
...This is not the complete file there are plenty more span entries in that file.
The code that I am using is able to locate the exact text but not able to associate it with the text of the nested element Span A.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url))
rows = page.css("table.vc_result span a")
rows.each do |arow|
if arow.text == "Gateway Megatech"
puts(arow.next_element.text)
puts("Capturing the next span text")
found="Got it"
break
else
puts("Found nothing")
found="None"
end
end
Assuming that each business is a new <tr> inside the top table you have supplied, the following code gives you an array of Hashes with the values:
require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
# Inside the first <td> of the row, find a <td> with a.cAddName in it
business = tr.at_xpath('td[1]//td[//a[#class="cAddName"]]')
name = business.at_css('a.cAddName').text.strip
address = business.at_css('.cAddText').text.strip
# Inside the second <td> of the row, find the first <font> tag
phone = tr.at_xpath('td[2]//font').text.strip
# Return a hash of values for this row, using the capitalization requested
{ Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=> {
#=> :Name=>"Gateway Megatech",
#=> :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=> :Phone=>"847-506-7800"
#=> }
#=> ]
This is pretty fragile, but works for what you've given, and there do not seem to be very many semantic items to hang onto in this insane, horrorific abuse of HTML.
Parsing HTML with regular expressions is a bad idea, because HTML is not a regular language. Ideally, you want to parse the DOM / XML to a tree structure.
http://nokogiri.org/ is pretty popular.

How to get the `HREF` values only when `<legend>tax</legend>?

<fieldset class="attachmentTable large"><legend>SMF</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/aems/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
<fieldset class="attachmentTable large"><legend>tax</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/relf/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
I have an html source from a webpage,part of which are given above.Now I want to get the HREF values only when <legend>tax</legend>? So could you guys help me here for the same?
I would do:
page.search('legend[text()="tax"] + table a').each do |a|
puts a[:href]
end

Resources