How can I extract URLs from HTML content with a Ruby regexp? - ruby

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.

You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?

This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end

To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Related

Remove leading or trailing whitespace of string except html tag in ruby

I want to remove leading or trailing whitespace of string except html tag
example
html = <a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a>
My way of Doing this
html.gsub(/>\s{1,8}</, "><").gsub(/>\s{1,8}/, ">").gsub(/\s{1,8}</, "<")
How to remove blanks depends on the pattern.
Is there any better way to write it?
Use positive lookarounds:
html = %| <a class=\"c-.......| # your line goes here
html.gsub(/(?<=>)\s+|\s+(?=<)/, '')
The above means “remove all whitespace after '>' or before '<'.”
Try Following:
html = "<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\">\n <div class=\"c-flex\">\n <div class=\"c-grid__quotation--main\">\n <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" />\n </div>\n <div class=\"c-grid__quotation--side\">\n <div class=\"c-grid__quotation--side-title text--b\">\n Yahoo\n </div>\n <div class=\"c-grid__quotation--side-description\">\n News, email and search are just the beginning. Discover more every day. Find your yodel.\n </div>\n <div class=\"c-grid__quotation--side-url\">\n www.yahoo.com\n </div>\n </div>\n </div>\n</div></a>"
-> html.squeeze(' ').strip
output:
"<a class=\"c-grid__quotation--link\" target=\"_blank\" href=\"https://www.yahoo.com/\"><div class=\"c-grid__quotation text--s-md p-topic__quotation__border c-border-r-5\"> <div class=\"c-flex\"> <div class=\"c-grid__quotation--main\"> <img src=\"https://s.yimg.com/dh/ap/default/130909/y_200_a.png\" alt=\"Y 200 a\" /> </div> <div class=\"c-grid__quotation--side\"> <div class=\"c-grid__quotation--side-title text--b\"> Yahoo </div> <div class=\"c-grid__quotation--side-description\"> News, email and search are just the beginning. Discover more every day. Find your yodel. </div> <div class=\"c-grid__quotation--side-url\"> www.yahoo.com </div> </div> </div> </div></a>"

Scrapy and XPath issue with nested Xpaths

I'm trying to read Amazon products into scrapy.
Starting from a random category using this XPath:
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
item['url'] = product.xpath('//a[#class="s-access-detail-page"]/#href').extract()[0]
yield item
('//div[#class="s-item-container"]') returns all the divs with the products on one category page - that's correct.
Now, how would I get the link to the product?
// stands for where ever in the code
a with the #class should select the right class
But I get a:
item['title'] = product.xpath('//a[#class="s-access-detail-page"]/#title').extract()[0]
exceptions.IndexError: list index out of range
So my list matching this XPath must be empty - but I don't understand why?
EDIT:
The HTML would look like that:
<div class="s-item-container" style="height: 343px;">
<div class="a-row a-spacing-base">
<div class="a-column a-span12 a-text-left">
<div class="a-section a-spacing-none a-inline-block s-position-relative">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a>
<div class="a-section a-spacing-none a-text-center">
<div class="a-row a-spacing-top-mini">
<a class="a-size-mini a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<div class="a-box">
<div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div>
</div>
</a>
</div>
</div>
</div>
</div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none">
<a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">
<h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2>
</a>
</div>
<div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div>
</div>
<div class="a-row a-spacing-mini">
<div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://www.amazon.com/gp/offer-listing/B0105S434A/ref=sr_1_21_olp?s=pet-supplies&ie=UTF8&qid=1435391788&sr=1-21&keywords=pet+supplies&condition=new"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div>
</div>
<div class="a-row a-spacing-none"><span name="B0105S434A">
<span class="a-declarative" data-action="a-popover" data-a-popover="{"max-width":"700","closeButton":"false","position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&asin=B0105S434A&contextId=search&ref=acr_search__popover"}"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></span></span>
<a class="a-size-small a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B0105S434A" rel="nofollow noreferrer">48</a>
</div>
</div>
It should be:
# ------------- The dot makes the query relative to product
product.xpath('.//a[#class="s-access-detail-page"]/#title')
//a[#class="s-access-detail-page"] requires to be exactly class="s-access-detail-page", because xpath works with string but not with meaning :) When you have "multi class ", use contains function
//a[contains(concat(' ', #class, ' '), " s-access-detail-page ")]/#title

Ruby Nokogiri Parsing Multiple Elements within Lists

<div class='prdlist'>
<ul>
<li class='first'>
<a href="some url 1">
<div class="text>
<br>product number 1
</div>
</a>
</li>
<li class='second'>
<a href="some url 2">
<div class="text">
<br>product number 2
</div>
</a>
</li>
</ul>
</div>
Using above example,
I would like to parse the values inside each list, list by list. Something like:
html.xpath("//*[#class='prdlist']/ul/li'").each do |each|
url = each.xpath/css (parse the href from each list)
name = each.xpath/css (parse the text from each list)
end
arr << [url,name]
which would eventually output:
arr = [["some url 1","product number1"],["some url2","product number2"]]
I am currently using regex & xpath("//*[#href]/#href) to get all urls and similar to get all product names and then using .zip to put the arrays together... but I've come across an html where I would like to do it list by list..
Thanks for the help!
And there you have it.
arr = []
html.css("div.prdlist li").each do |me|
url = me.css("a").map{|link| link['href']}[0]
name = me.text.delete("\n").split.join(" ")
arr << [url,name]
end

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

Scraping data based on the text of other neighboring elements?

I have a code like this:
<div id="left">
<div id="leftNav">
<div id="leftNavContainer">
<div id="refinements">
<h2>Department</h2>
<ul id="ref_2975312011">
<li>
<a href="#">
<span class="expand">Pet Supplies</span>
</a>
</li>
<li>
<strong>Dogs</strong>
</li>
<li>
<a>
<span class="refinementLink">Carriers & Travel Products</span>
<span class="narrowValue"> (5,570)</span>
</a>
</li>
(etc...)
Which I'm scriping like this:
html = file
data = Nokogiri::HTML(open(html))
categories = data.css('#ref_2975312011')
#categories_hash = {}
categories.css('li').drop(2).each do | categories |
categories_title = categories.css('.refinementLink').text
categories_count = categories.css('.narrowValue').text[/[\d,]+/].delete(",").to_i
#categories_hash[:categories] ||= {}
#categories_hash[:categories]["Dogs"] ||= {}
#categories_hash[:categories]["Dogs"][categories_title] = categories_count
end
So now. I want to do the same but without using #ref_2975312011 and "Dogs".
So I was thinking I could tell Nokogiri the following:
Scrap the li elements (starting from the third one) that are right
below the li element which has the text Pet Supplies enclosed by a link and a span tag.
Any ideas of how to accomplish that?
The Pet Supplies li would be:
puts doc.at('li:has(a span[text()="Pet Supplies"])')
The following sibling li's would be (skipping the first one):
puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Resources