How to extract TD values only with Nokogiri? - ruby

I am extracting the forexfactory calendar. I can do a puts of the td values but it's printing the full HTML tags. How do I get the td value of a tag with class "calendar__time"?
My code is:
require 'HTTParty'
require 'Nokogiri'
require 'Pry'
require 'csv'
page = HTTParty.get('http://www.forexfactory.com/calendar.php?day=aug31.2016')
p= Nokogiri::HTML(page)
rows=p.css('tr.calendar_row')
rows.map do |row|
puts row.css('td.calendar__date')
puts row.css('td.calendar__time')
end
When I check with irb, it's returning it with the tags:
<td class="calendar__cell calendar__time time">9:45pm</td><a href="javascript:void(0);" class="calendarexpanded__graph" data-touchable><span>Graph</span></a> </td>
</tr>
</tbody>
</table>
</td>
</tr>
<td class="calendar__cell calendar__date date"></td>
The HTML snippet for this TR is:
<tr class="calendar__row calendar_row calendar__row--grey " data-eventid="62529" data-touchable>
<td class="calendar__cell calendar__date date"></td>
<td class="calendar__cell calendar__time time">2:00am</td>
<td class="calendar__cell calendar__currency currency">CHF</td>
<td class="calendar__cell calendar__impact impact calendar__impact calendar__impact--low">
<div class="calendar__impact-icon calendar__impact-icon--screen"> <span title="Low Impact Expected" class="low"></span> </div>
<div class="calendar__impact-icon calendar__impact-icon--print"> <img src="resources/images/icons/impact/impact-yellow.png" alt="" border="0" /> </div>
</td>
<td class="calendar__cell calendar__event event">
<div> <span class="calendar__event-title">UBS Consumption Indicator</span> </div>
</td>
<td class="calendar__cell calendar__detail detail"><a class="calendar__detail-link calendar__detail-link--level-1 calendar_detail level1" data-level="1"></a></td>
<td class="calendar__cell calendar__actual actual">1.32</td>
<td class="calendar__cell calendar__forecast forecast"></td>
<td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendar__cell calendar__graph graph"><a class="calendar__detail-link calendar__detail-link--graph-icon calendar_chart"></a></td>
</tr>
<tr class="calendar__row calendar__expand " data-eventid="62529">
<td> </td>
<td colspan="4" class="calendarexpanded__container">
<table class="calendarexpanded">
<tbody>
<tr>
<td class="calendarexpanded__cell"><strong>Actual</strong>1.32</td>
<td class="calendarexpanded__cell"><strong>Forecast</strong> </td>
<td class="calendarexpanded__cell"><strong>Previous</strong><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendarexpanded__cell calendarexpanded__cell--small"> <a href="javascript:void(0);" class="calendarexpanded__details calendarexpanded__details--1" data-touchable><span>Details</span></a> </td>
<td class="calendarexpanded__cell calendarexpanded__cell--small">

I got it to work by changing as follows:
puts row.css('td.calendar__date').text

Related

HTML signature leaves big spaces between rows (Gmail App from Outlook)

This is how it looks in GMail mobile app when sent from Outlook:
How can I avoid those big gaps?
My code is as follows:
<table id="sig" width='320' cellspacing='0' cellpadding='0' border-spacing='0' style="width:320px;margin:0;padding:0;">
<tr>
<td valign='top' width="120" height="48" style="width:120px;height:48px;margin:0;padding:0;vertical-align:top;">
<a style="border:none;text-decoration:none;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/logo.jpg" alt="Crisalix" width='120' height='48' style="border:none;width:120px;height:48px;display:block;">
</a>
</td>
</tr>
<tr>
<td>
<table id="sig1" cellspacing='0' width='320' cellpadding='0' border-spacing='0' style="padding:0;margin:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;mso-line-height-rule:exactly;line-height:11px;color:#b0b0b0;border-collapse:collapse;-webkit-text-size-adjust:none;width:320px;">
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-weight:600;line-height:1.6;font-size:13px;">
<span style="color:#137191">Jaime</span>
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-size:12px;line-height:1">
<span style="color:#555555">Chief Executive Officer</span>
</td>
</tr>
<tr>
<td valign='top' width="27" height='21' style="width:27px;height:1px;margin:0;padding:0;vertical-align:top;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/separator.jpg" alt="Crisalix" width='27' height='21' style="border:none;width:27px;height:21px;display:block;">
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-size:12px;line-height:1.4;">
<div><span style="color:#137191;font-weight:bold">P / </span><span style="color:#555555;"></span></div>
<div><span style="color:#137191;font-weight:bold">A / </span><span style="color:#555555">Parc Scientifique (PSE-A) - EPFL1015</span></div>
<div> <span style="color:#555555"> Lausanne | Switzerland </span></div>
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-weight:600;line-height:1.6;font-size:13px;width:320px;">
<span style="color:#137191;border:none;text-decoration:none!important;color:#137191;">www.crisalix.com</span>
</td>
</tr>
<tr>
<td valign='top' width="27" height='21' style="width:27px;height:1px;margin:0;padding:0;vertical-align:top;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/separator.jpg" alt="Crisalix" width='27' height='21' style="border:none;width:27px;height:21px;display:block;">
</td>
</tr>
</td>
</table>
</tr>
<tr>
<td valign='top' width="230" height="225" style="width:230px;height:225px;margin:0;padding:0;vertical-align:top;">
<a href='http://www.crisalix.com' title="Crisalix" style="border:none;text-decoration:none;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/signature-banner.jpg" alt="Crisalix" width='230' height='225' style="border:none;width:230px;height:225px;display:block;">
</a>
</td>
</tr>
</table>

Unable to select column with its header through XPath

My HTML
<table id="flex1" cellspacing="0" cellpadding="0" border="0">
<thead>
<tr class="hDiv">
<th width="6%">
<div class="text-left field-sorting asc" rel="IFSC_CODE"> IFSC CODE </div>
</th>
<th width="6%">
<div class="text-left field-sorting " rel="BRANCH_NAME"> BRANCH NAME </div>
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="sorted" width="6%">
<div class="text-left">SACS011151</div>
</td>
<td width="6%">
<div class="text-left">check</div>
</td>
</tr>
<tr class="erow">
<td class="sorted" width="6%">
<div class="text-left">SACS011152</div>
</td>
<td width="6%">
<div class="text-left">Motiram</div>
</td>
</tr>
<tr class="erow">
<td class="sorted" width="6%">
<div class="text-left">SACS011158</div>
</td>
<td width="6%">
<div class="text-left">TESTNAME</div>
</td>
</tr>
</tbody>
</table>
My XPath
//table/tbody/tr/td[count(//table/thead/tr/th[.='BRANCH NAME']/preceding-sibling::th)+4]
Above XPath is Selecting all the column but not selecting its header name 'BRANCH NAME' and I want to select the header name with all its column.Any Idea how to do this?
You can simply use xpath union operator (|) to combine two xpath queries, for example* :
//table/tbody/tr/td[count(//table/thead/tr/th[.='BRANCH NAME']/preceding-sibling::th)+4]
|
//table/thead/tr/th[.='BRANCH NAME']
*: formatted into multiple lines just to make it visible without horizontal scroll

Getting a table with Mechanize in Ruby

I'd like to get items from this table:
<table style="margin: auto;width: 800px" id="myTable" class="tablesorter">
<thead>
<tr class="TableHeader">
<th >Game</th><th>Icon</th><th>Achievement</th>
<th>Achievers</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><img alt="Logo" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/07385eb55b5ba974aebbe74d3c99626bda7920b8.jpg" width=133 height=50 ></td>
<td> <table>
<tr>
<td class="AchievementBox" style="background-color: #347C17">
<a href="Steam_Achievement_Info.php?AchievementID=169&AppID=440"> <img alt="Icon" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/924764eea604817d3c14de9640ae6422c7cdfb7a.jpg" height='50' width='50'>
</a> </td>
</tr>
</table>
</td>
<td style="text-align: left" >Race for the Pennant<br>Run 25 kilometers.</td>
<td style="text-align: right">35505</td><td style="text-align: right">1.3</td>
The table has an id myTable so what I'd like to do is this:
go inside <tbody>
for each <tr> in table:
do something; maybe go inside <td> or get a link from <href>
I have:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://astats.astats.nl/astats/TopListAchievements.php?DisplayType=2")
puts page.body
This prints the page but how do I actually iterate through the table rows?
Using css selector, to print text and href attribute values:
require 'nokogiri'
doc = Nokogiri::HTML(page.body)
doc.css('table#myTable tbody td[3] a').each {|a|
puts a.text, a[:href]
}

Parsing a table on a webpage without id's or classes - using Nokogiri or xpath

I wish to parse through a epinions.com page to gather some statistics about a few companies. Epinions have almost no id's or classes, so it's quite difficult to parse the site.
I need to loop through all <tr bgcolor="white"> objects. I have put in 2 samples of this.
From the sample 1, I need to extract:
The alt on this line:
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
The href this line:
CHUMBO ROCKS!
The author at this line:
<span class="rgr">by whitey436, Jan 18, 2006
Here is sample 1:
<tr bgcolor="white">
<td style="padding:10px 5px" align="right" valign="top" height="100%">
<table cellspacing="4" cellpadding="0" border="0" width=100% height="100%">
<tr valign="top">
<td class="rkr" nowrap>Overall Rating:</td>
<td width=80>
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
</td>
</tr>
<span class="rgr">
<tr>
<td class="rgr" nowrap>Ease of Ordering:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>Customer Service:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>Selection:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
<tr>
<td class="rgr" nowrap>On-Time Delivery:</td>
<td>
<img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
</td>
</tr>
</span>
<tr valign="bottom" height="100%">
<td class="rkb" colspan="2">
<div align="center"> </div>
<div align="center"> </div>
</td>
</tr>
</table>
</td>
<td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
<h2 style="font-family:arial,helvetica,sans-serif; font-size:87%; color:#000000; font-weight:bold; margin-bottom:0px;">
CHUMBO ROCKS!
</h2>
<span style="line-height:110%">
<span class="rgr">by whitey436, Jan 18, 2006
Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>
</span>
<span class="rkr">
<div style="padding:5px 0px"> Its just this simple, I tried buying this receiver from another online supplier who had the lowest price only to find they didnt have any of these units and they wanted to sell me extra warranty then tried to sell a different model in stock from Yamaha ...</div>
<b>
Read the full review
</b>
</span>
</td>
</tr>
From the sample 2, I need to extract:
The alt on this line:
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
The href on this line:
Read more
The author at this line:
<span class="rgr">by whitey436, Jan 18, 2006
Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>
Here is sample 2:
<tr bgcolor="white">
<td style="padding:10px 5px" align="right" valign="top">
<table cellspacing="4" cellpadding="0" border="0" width=100%>
<tr>
<td class="rkr" nowrap>Overall Rating:</td>
<td width=80>
<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
</td>
</tr>
<tr>
<td class='rgr' > </td>
<td>
<img src='http://img.epinions.com/images/epi_images/spacer.gif' width=80 height=11>
</td>
</tr>
</table>
</td>
<td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
<span class="rgr">Mar 27, 2006 <br>(Not Yet Rated)</span><br>
<span class="rkr"> Very helpful in giving me the information I needed to make a purchase.<br><b>
Read more
</b></span>
</td>
</tr>
Here is some Nokogiri code to print out the information you want using XPath:
xml.xpath("//tr[#bgcolor='white']").each do |el|
# Get the "Overall rating" tr block from the first td and get (first) img alt
puts el.at_xpath("td[1]//tr[td/text()='Overall Rating:']//img/#alt")
# Get the first link from the second td that contains "content" and get href
puts el.at_xpath("td[2]//a[contains(#href, '/content')][1]/#href")
# Get the (first) link that has an itemprop author value and get the href
puts el.at_xpath("td[2]//a[#itemprop='author']/#href")
end
use Nokogiri will be ok.
to get alt, get back all the image tags and keep the img tag with the specified src
imgs = doc.css('img[src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif"]')
to get back the href
links = doc.css('a[href*="/content"]')
to get back the author
links = doc.css('a[href*="/user"]')

How to get the `HREF` values only when `<legend>tax</legend>?

<fieldset class="attachmentTable large"><legend>SMF</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/aems/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
<fieldset class="attachmentTable large"><legend>tax</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/relf/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
I have an html source from a webpage,part of which are given above.Now I want to get the HREF values only when <legend>tax</legend>? So could you guys help me here for the same?
I would do:
page.search('legend[text()="tax"] + table a').each do |a|
puts a[:href]
end

Resources