Ruby Nokogiri Parsing Multiple Elements within Lists - ruby

<div class='prdlist'>
<ul>
<li class='first'>
<a href="some url 1">
<div class="text>
<br>product number 1
</div>
</a>
</li>
<li class='second'>
<a href="some url 2">
<div class="text">
<br>product number 2
</div>
</a>
</li>
</ul>
</div>
Using above example,
I would like to parse the values inside each list, list by list. Something like:
html.xpath("//*[#class='prdlist']/ul/li'").each do |each|
url = each.xpath/css (parse the href from each list)
name = each.xpath/css (parse the text from each list)
end
arr << [url,name]
which would eventually output:
arr = [["some url 1","product number1"],["some url2","product number2"]]
I am currently using regex & xpath("//*[#href]/#href) to get all urls and similar to get all product names and then using .zip to put the arrays together... but I've come across an html where I would like to do it list by list..
Thanks for the help!

And there you have it.
arr = []
html.css("div.prdlist li").each do |me|
url = me.css("a").map{|link| link['href']}[0]
name = me.text.delete("\n").split.join(" ")
arr << [url,name]
end

Related

Extract text inside anchor tag using xpath

I am trying to ascertain how many pages are there for any search result on a site so that i can scrape data for all the pages using lxml and xpath.
There is a pagination tab with the following structure:
Page: 1 2 3 ... 7 next
the html content for the same being something like
<ul class="ulclass">
<li></li>
<li>
<span> You are on the first page</span>
"1"
</li>
<li>
<a href="link to second page">
<span></span>
"2"
</a>
</li>
<li>
</li>
...
<li>
<a href="link to last page">
<span></span>
"7"
</a>
</li>
My approach is to extract the page numbers 1,2,3,7 so that i can repeat the web scraping 7 times for every page 'cause otherwise it just scrapes the first result of the page.
I have written the following xpath, but it doesnot return correct page numbers.
xpath('//ul[#class="ulclass"]/li/a/text())
If I expand your example to form this,
<ul class="ulclass">
<li><span>You are on the first page</span>"1"</li>
<li><span></span>"2"</li>
<li><span></span>"3"</li>
<li><span></span>"4"</li>
<li><span></span>"5"</li>
<li><span></span>"6"</li>
<li><span></span>"7"</li>
</ul>
then using scrapy in Python I can get this:
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('..//ul[#class="ulclass"]/li/a/text()').extract()
['"2"', '"3"', '"4"', '"5"', '"6"', '"7"']

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.
You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?
This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end
To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Scraping data based on the text of other neighboring elements?

I have a code like this:
<div id="left">
<div id="leftNav">
<div id="leftNavContainer">
<div id="refinements">
<h2>Department</h2>
<ul id="ref_2975312011">
<li>
<a href="#">
<span class="expand">Pet Supplies</span>
</a>
</li>
<li>
<strong>Dogs</strong>
</li>
<li>
<a>
<span class="refinementLink">Carriers & Travel Products</span>
<span class="narrowValue"> (5,570)</span>
</a>
</li>
(etc...)
Which I'm scriping like this:
html = file
data = Nokogiri::HTML(open(html))
categories = data.css('#ref_2975312011')
#categories_hash = {}
categories.css('li').drop(2).each do | categories |
categories_title = categories.css('.refinementLink').text
categories_count = categories.css('.narrowValue').text[/[\d,]+/].delete(",").to_i
#categories_hash[:categories] ||= {}
#categories_hash[:categories]["Dogs"] ||= {}
#categories_hash[:categories]["Dogs"][categories_title] = categories_count
end
So now. I want to do the same but without using #ref_2975312011 and "Dogs".
So I was thinking I could tell Nokogiri the following:
Scrap the li elements (starting from the third one) that are right
below the li element which has the text Pet Supplies enclosed by a link and a span tag.
Any ideas of how to accomplish that?
The Pet Supplies li would be:
puts doc.at('li:has(a span[text()="Pet Supplies"])')
The following sibling li's would be (skipping the first one):
puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Extract a link with Nokogiri from the text of link?

I want to extract a specific link from a webpage, searching for it by its text, using Nokogiri:
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
I would like the href of "site 3" and return:
http://example.org/site/3/
Or I would like the href of "site 1" and return:
http://example.org/site/1/
How can I do it?
Original:
text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT
link_text = "site 1"
doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/#href").to_s
Updated:
As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with matching there's a function called starts-with that you can use like this (links starting with "s"):
doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/#href").map(&:to_s)
Maybe you will like css style selection better:
doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere
require 'nokogiri'
text = "site 1"
doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[#class='links']//a[contains(text(), '#{text}')]/#href").to_s
Just to document another way we can do this in Ruby, using the URI module:
require 'uri'
html = %q[
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
]
uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]
=> {
1 => "http://example.org/site/1/'",
2 => "http://example.org/site/2/'",
3 => "http://example.org/site/3/'"
}
uris[1]
=> "http://example.org/site/1/'"
uris[3]
=> "http://example.org/site/3/'"
Under the covers URI.extract uses a regular expression, which isn't the most robust way of finding links in a page, but it is pretty good since a URI usually is a string without whitespace if it is to be useful.

Accessing a div element in an array of li elements

I am trying to access a div in an li array
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<div class="news-item">
</li>
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<div class="news-item">
<div class="image">
<div class="details with-image">
<h2>
<p class="standfirst">The best two-seat </p>
<div class="meta">
<div class="pub-date">26 April 2012</div>
<div class="topic-bar clearfix">
<div class="topic car_review">review</div>
</div>
</div>
</div>
</div>
</li>
I am trying to access the "div class="topic car_review">car review "and get its text.
The reason I am specifically using that text is that, depending on what the text is it would enter specific steps.
Code that I am using is
#topic = #browser.li(:class => /views-row-#{x}/).div(:class,'news-item').div(:class,'details').div(:class,'meta').div(:class,/topic /).text
The script was working fine before and suddenly it has stopped working and is just not able to get the div(:class,'news-item').
The error message I get is
unable to locate element, using {:class=>"news-item", :tag_name=>"div"} (Watir::Exception::UnknownObjectException)
I tried div(:class => /news-/) but still its just not able to find that element
I am really stuck!!!
I assume that when you are doing li(:class => /views-row-#{x}/), the x means you are iterating over all rows? If so, then your script will fail on the row-2 since it does not contain the news-item div (resulting in the error that you see).
If there is only one of these 'topic car_review' div tags, you can just do:
#topic = #browser.div(:class, 'topic car_review')
Update - Iterating over each LI:
If you need to iterate over each LI, then you could do:
#browser.lis.each do |li|
#topic = li.div(:class, 'topic car_review').text
end

Resources