Using Nokogiri to extract optional address components - ruby

This is my first attempt parsing a webpage using Nokogiri.
I am trying to extract the addresses from a webpage and store them in a CSV file. So far, I've only been able to extract the City, State, and Zip fields.
I don't know how to extract the facility name, address, phone, numbers, and company information. The address may contain one or two street components.
For the phone, there may be one or more phone numbers. The phone numbers may be regular numbers or fax numbers, but they are only indicated in the text as opposed to a tag. For the company, I'd like to be able to extract the URL and the name.
Each address on the page is enclosed as follows:
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
Here's my very basic set-up.
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end
Storing the information in separate arrays here seems very clunky. I'd basically like to make a row entry in a CSV table for each occurrence of the address node in the source document, and then populate it with fields if they exist:
Facility St_1 St_2 City State Zip Phone Fax URL Company
======== ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx xxxx xxxx xxxxx xxxx xxxxx xxxx xxxxxxxx
xxxxxxxx xxxx xxxxx xxxx xxxxx xxxx xxxxx xxxx xxxx xxxxxxxx
Can someone help me?

You probably have some edge cases that this won't handle, but this takes care of your example. You'll need to change the doc to read from the real page instead of the data segment, and you'll need to change the csv to print to a file instead of display inline like I've done.
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(DATA.read)
CompanyInfo = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []
doc.css("div.address").each do |address_div|
facility = address_div.at_css('.address_header .header_name').text.strip
info = address_div.css('div.address_details .info')
street1, street2 = info.css('.street').map(&:text)
city = info.at_css('.city').text
state = info.at_css('.state').text
zip = info.at_css('.zip').text
phone, fax = info.css('.phone .tel').map(&:text)
url = info.at_css('.company a')['href']
company = info.at_css('.company a').text
company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end
csv = CSV.generate do |csv|
csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
company_infos.each do |company_info|
csv << company_info.to_a
end
end
csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,Company\nFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Name\n"
__END__
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>

You're asking for a lot, but I'll get you started:
fields = %w{street1 street2 phone fax city state zip}
doc.search('div.address').each do |div|
address = {}
address['street1'], address['street2'] = *div.search('span.street').map(&:text)
address['phone'], address['fax'] = *div.search('span.tel').map(&:text)
['city', 'state', 'zip'].each{|f| address[f] = div.at("span.#{f}").text}
csv << fields.map{|f| address[f]}
end

Related

Scrapy and XPath issue with nested Xpaths

I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>
It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')
//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.
You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?
This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end
To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

How can I get several similar tags data with HtmlAgilityPack?

Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.

get div nested in div element using Nokogiri

For following HTML, I want to parse it and get following result using Nokogiri.
event_name = "folk concert 2"
event_link = "http://www.douban.com/event/12761580/"
event_date = "20th,11,2010"
I know doc.xpath('//div[#class="nof clearfix"]') could get each div element, but how should I proceed to get each attribution like event_name, and especially the date?
HTML
<div class="nof clearfix">
<h2>folk concert 2 <span class="pl2"> </span></h2>
<div class="pl intro">
Date:25th,11,2010<br/>
</div>
</div>
<div class="nof clearfix">
<h2>folk concert <span class="pl2"> </span></h2>
<div class="pl intro">
Date:10th,11,2010<br/>
</div>
</div>
I don't know xpaths, I prefer to use css selectors, they make more sense to me. This tutorial might be useful for you.
require 'rubygems'
require 'nokogiri'
require 'pp'
Event = Struct.new :name , :link , :date
doc = Nokogiri::HTML DATA
events = doc.css("div.nof.clearfix").map do |eventnode|
name = eventnode.at_css("h2 a").text.strip
link = eventnode.at_css("h2 a")['href']
date = eventnode.at_css("div.pl.intro").text.strip
Event.new name , link , date
end
pp events
__END__
<div class="nof clearfix">
<h2>folk concert 2 <span class="pl2"> </span></h2>
<div class="pl intro">
Date: 25th,11,2010<br/>
</div>
</div>
<div class="nof clearfix">
<h2>folk concert <span class="pl2"> </span></h2>
<div class="pl intro">
Date: 10th,11,2010<br/>
</div>
</div>
This outputs:
[#<struct Event
name="folk concert 2",
link="http://www.douban.com/event/12761580/",
date="Date: 25th,11,2010">,
#<struct Event
name="folk concert",
link="http://www.douban.com/event/12761581/",
date="Date: 10th,11,2010">]

Resources