How to get the `HREF` values only when `<legend>tax</legend>? - ruby

<fieldset class="attachmentTable large"><legend>SMF</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/aems/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
<fieldset class="attachmentTable large"><legend>tax</legend>
<table cellspacing="2" cellpadding="2" border="0">
<tr>
<td>
<a href="
/relf/file/test.html">
</a>
</td>
<td>
foo
</td>
</tr>
</table>
</fieldset>
I have an html source from a webpage,part of which are given above.Now I want to get the HREF values only when <legend>tax</legend>? So could you guys help me here for the same?

I would do:
page.search('legend[text()="tax"] + table a').each do |a|
puts a[:href]
end

Related

HTML signature leaves big spaces between rows (Gmail App from Outlook)

This is how it looks in GMail mobile app when sent from Outlook:
How can I avoid those big gaps?
My code is as follows:
<table id="sig" width='320' cellspacing='0' cellpadding='0' border-spacing='0' style="width:320px;margin:0;padding:0;">
<tr>
<td valign='top' width="120" height="48" style="width:120px;height:48px;margin:0;padding:0;vertical-align:top;">
<a style="border:none;text-decoration:none;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/logo.jpg" alt="Crisalix" width='120' height='48' style="border:none;width:120px;height:48px;display:block;">
</a>
</td>
</tr>
<tr>
<td>
<table id="sig1" cellspacing='0' width='320' cellpadding='0' border-spacing='0' style="padding:0;margin:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;mso-line-height-rule:exactly;line-height:11px;color:#b0b0b0;border-collapse:collapse;-webkit-text-size-adjust:none;width:320px;">
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-weight:600;line-height:1.6;font-size:13px;">
<span style="color:#137191">Jaime</span>
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-size:12px;line-height:1">
<span style="color:#555555">Chief Executive Officer</span>
</td>
</tr>
<tr>
<td valign='top' width="27" height='21' style="width:27px;height:1px;margin:0;padding:0;vertical-align:top;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/separator.jpg" alt="Crisalix" width='27' height='21' style="border:none;width:27px;height:21px;display:block;">
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="width:320px;margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-size:12px;line-height:1.4;">
<div><span style="color:#137191;font-weight:bold">P / </span><span style="color:#555555;"></span></div>
<div><span style="color:#137191;font-weight:bold">A / </span><span style="color:#555555">Parc Scientifique (PSE-A) - EPFL1015</span></div>
<div> <span style="color:#555555"> Lausanne | Switzerland </span></div>
</td>
</tr>
<tr style="margin:0;padding:0;">
<td style="margin:0;padding:0;font-family:sans-serif,Arial,'Helvetica Neue',Helvetica;white-space:nowrap;font-weight:600;line-height:1.6;font-size:13px;width:320px;">
<span style="color:#137191;border:none;text-decoration:none!important;color:#137191;">www.crisalix.com</span>
</td>
</tr>
<tr>
<td valign='top' width="27" height='21' style="width:27px;height:1px;margin:0;padding:0;vertical-align:top;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/separator.jpg" alt="Crisalix" width='27' height='21' style="border:none;width:27px;height:21px;display:block;">
</td>
</tr>
</td>
</table>
</tr>
<tr>
<td valign='top' width="230" height="225" style="width:230px;height:225px;margin:0;padding:0;vertical-align:top;">
<a href='http://www.crisalix.com' title="Crisalix" style="border:none;text-decoration:none;">
<img moz-do-not-send="true" src="https://s3.amazonaws.com/media_crisalix/signatures/signature-banner.jpg" alt="Crisalix" width='230' height='225' style="border:none;width:230px;height:225px;display:block;">
</a>
</td>
</tr>
</table>

How to extract TD values only with Nokogiri?

I am extracting the forexfactory calendar. I can do a puts of the td values but it's printing the full HTML tags. How do I get the td value of a tag with class "calendar__time"?
My code is:
require 'HTTParty'
require 'Nokogiri'
require 'Pry'
require 'csv'
page = HTTParty.get('http://www.forexfactory.com/calendar.php?day=aug31.2016')
p= Nokogiri::HTML(page)
rows=p.css('tr.calendar_row')
rows.map do |row|
puts row.css('td.calendar__date')
puts row.css('td.calendar__time')
end
When I check with irb, it's returning it with the tags:
<td class="calendar__cell calendar__time time">9:45pm</td><a href="javascript:void(0);" class="calendarexpanded__graph" data-touchable><span>Graph</span></a> </td>
</tr>
</tbody>
</table>
</td>
</tr>
<td class="calendar__cell calendar__date date"></td>
The HTML snippet for this TR is:
<tr class="calendar__row calendar_row calendar__row--grey " data-eventid="62529" data-touchable>
<td class="calendar__cell calendar__date date"></td>
<td class="calendar__cell calendar__time time">2:00am</td>
<td class="calendar__cell calendar__currency currency">CHF</td>
<td class="calendar__cell calendar__impact impact calendar__impact calendar__impact--low">
<div class="calendar__impact-icon calendar__impact-icon--screen"> <span title="Low Impact Expected" class="low"></span> </div>
<div class="calendar__impact-icon calendar__impact-icon--print"> <img src="resources/images/icons/impact/impact-yellow.png" alt="" border="0" /> </div>
</td>
<td class="calendar__cell calendar__event event">
<div> <span class="calendar__event-title">UBS Consumption Indicator</span> </div>
</td>
<td class="calendar__cell calendar__detail detail"><a class="calendar__detail-link calendar__detail-link--level-1 calendar_detail level1" data-level="1"></a></td>
<td class="calendar__cell calendar__actual actual">1.32</td>
<td class="calendar__cell calendar__forecast forecast"></td>
<td class="calendar__cell calendar__previous previous"><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendar__cell calendar__graph graph"><a class="calendar__detail-link calendar__detail-link--graph-icon calendar_chart"></a></td>
</tr>
<tr class="calendar__row calendar__expand " data-eventid="62529">
<td> </td>
<td colspan="4" class="calendarexpanded__container">
<table class="calendarexpanded">
<tbody>
<tr>
<td class="calendarexpanded__cell"><strong>Actual</strong>1.32</td>
<td class="calendarexpanded__cell"><strong>Forecast</strong> </td>
<td class="calendarexpanded__cell"><strong>Previous</strong><span class="revised worse" title="Revised From 1.34">1.21</span></td>
<td class="calendarexpanded__cell calendarexpanded__cell--small"> <a href="javascript:void(0);" class="calendarexpanded__details calendarexpanded__details--1" data-touchable><span>Details</span></a> </td>
<td class="calendarexpanded__cell calendarexpanded__cell--small">
I got it to work by changing as follows:
puts row.css('td.calendar__date').text

Scraping info of a product that is spread among different <tr> elements?

I'm using something like the following to scrape the info of a page:
def self.parse_products
product_hash = {}
product = #data.css('.simGrid')
product.css('td').each do | product |
product_asin = product.css('.simImage a img').first.value[/(?<=\/)[A-Z\d]{5,}/]
product_image_url = product.css('.simProductInfo a').to_s
product_hash[:product] ||= []
product_hash[:product] << { :image_url => product_image_url,
:asin => product_asin }
end
product_hash
end
The problem is that the structure is something like this:
<table class="simGrid">
<tbody>
<tr class="middle">
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
</tr>
<tr>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
</tr>
<tr>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
</tr>
<tr class="middle">
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
<td>
<div class="simImage"></div>
</td>
</tr>
<tr>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
<td>
<div class="simProductInfo"></div>
</td>
</tr>
<tr>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
<td>
<hr class="divider" />
</td>
</tr>
</tbody>
</table>
So as you can see the info of the product is spread among various <tr>. If I try to scrape them by using <td> I end up with many nil values since some of the <td> have the .simImage and others don't. Same for the .simProductInfo.
As anyone encounter something similar before? Is there any workaround for this?
You can try collecting ASINs and URLs in two separate arrays and then zipping them.
asins = product.css('.simImage a img').map { |n| n.value[/(?<=\/)[A-Z\d]{5,}/] }
urls = product.css('.simProductInfo a').map(&:to_s)
asins.zip(urls).map { |asin, url| {image_url: url, asin: asin} }

Content Wont Scroll With Footer And Header Fixed

I have set both the header and the footer to fixed but still cannot get the content to adjust in IPhone.
Here is my layout code
<body>
<div data-role="page" #(Page.Id == null ? string.Empty : "id=" + Page.Id) data-fullscreen="false">
#* #RenderSection("MoreCode", false)*#
#if (IsSectionDefined("Header"))
{
<div data-role="header" data-position="fixed">
#RenderSection("Header", false)
</div><!-- /header -->
}
<div data-role="content">
#RenderBody()
</div><!-- /content -->
<div data-role="footer" data-tap-toggle="false" data-position="fixed">
<div class="center-wrapper">
<table border="0">
<tr>
<td>
<label for="basic">
Place Of Interest
</label>
</td>
</tr>
<tr>
<td>
<table border="0">
<tr>
<td>
Terms of Use
</td>
<td>
<label for="basic">
|
</label>
</td>
<td>
Privacy Policy
</td>
</tr>
</table>
</td>
</tr>
</table>
</div>
</div>
</body>
Here is my View
<h3 style='color:#4FA600; font-weight:bold;'>#ViewBag.Number</h3>
<table border="1" width="100%">
<tr>
<td align="center" valign=middle style=' font-weight:bold; color:#FFFFFF; background:#4FA600; '>
<label for="basic">Available Documents</label>
</td>
</tr>
<tr>
<td>
<table width="100%" height="100%">
<tr>
<td align="left" valign="middle" style=' font-weight:bold;'>
Type
</td>
<td align="center" valign="middle" style=' font-weight:bold;'>
Description
</td>
</tr>
#{
foreach(var row in Model)
{
<tr>
<td align="left" valign="middle">
#row.DocType
</td>
<td align="center" valign="middle">
#row.Description
</td>
</tr>
}
}
</table>
</td>
</tr>
</table>
Isnt jquerymobile suppose to manually adjust? It renders fine in firefox. What can I do to make the content fully display?
\The issue was I had this in my css:
[data-role=page]{height: 100% !important; position:relative !important;}
I got this code from use theme roller not sure why this messed it up. Maybe someone can clarify would be appreciated

What xpath query would solve this

What XPath query could I use to solve the below. I'm actually using nokogiri (in ruby) so ideally the answer would be in the form of a ruby nokogiri form, but else just XPath and I can adapt in.
Required Output
I'm seeking to parse the below HTML (a full html page, but I've just copy/pasted the relevant part for clarity), and end up with basically the following:
Phone Number Plan ID
545454545 12345
3434343434 67890
So in the context of Ruby/nokogiri this could be in a Hash for example:
% result = { "545454545" => "12345", "3434343434" => "67890" }
HTML to be Parsed
.
.
.
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>
How about:
xpath = '//td[contains(text(),"Phone Number") or contains(text(),"Plan ID")]/following-sibling::td'
Hash[*doc.xpath(xpath).map{|x| x.text.strip}.reverse]
Assuming those lines you've replaced with periods do not contain data you want to collection, which would mean each table provided a unique result set, the following would work:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.HTML DATA.read
results = {}
doc.search('table').each do |table|
plan_id = table.at('tr[1]/td[2]')
phone_number = table.at('tr[2]/td[2]')
if plan_id && phone_number
results[phone_number.text.strip] = plan_id.text.strip
end
end
p results #=> {"545454545"=>"12345", "3434343434"=>"67890"}
__END__
<form method="post">
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 12345 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 545454545 </td>
</tr>
.
.
.
</table>
</div>
<br>
.
.
.
<div style='line-height:18px;background-color:#FFFFFF;border: 1px #dedede solid;padding:10px;'>
<table width='90%' border=0>
<tr>
<td width='30%'> Plan ID </td>
<td width='70%'> 67890 </td>
</tr>
<tr>
<td> Phone Number </td>
<td> 3434343434 </td>
</tr>
.
.
.
</table>
</div>
<br>

Resources