Extract a link with Nokogiri from the text of link?

Extract a link with Nokogiri from the text of link? - ruby

I want to extract a specific link from a webpage, searching for it by its text, using Nokogiri:
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
I would like the href of "site 3" and return:
http://example.org/site/3/
Or I would like the href of "site 1" and return:
http://example.org/site/1/
How can I do it?

Original:
text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT
link_text = "site 1"
doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/#href").to_s
Updated:
As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with matching there's a function called starts-with that you can use like this (links starting with "s"):
doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/#href").map(&:to_s)

Maybe you will like css style selection better:
doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere

require 'nokogiri'
text = "site 1"
doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[#class='links']//a[contains(text(), '#{text}')]/#href").to_s

Just to document another way we can do this in Ruby, using the URI module:
require 'uri'
html = %q[
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
]
uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]
=> {
1 => "http://example.org/site/1/'",
2 => "http://example.org/site/2/'",
3 => "http://example.org/site/3/'"
}
uris[1]
=> "http://example.org/site/1/'"
uris[3]
=> "http://example.org/site/3/'"
Under the covers URI.extract uses a regular expression, which isn't the most robust way of finding links in a page, but it is pretty good since a URI usually is a string without whitespace if it is to be useful.

Related

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.

You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?

This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end

To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Ruby Nokogiri Parsing Multiple Elements within Lists

<div class='prdlist'>
<ul>
<li class='first'>
<a href="some url 1">
<div class="text>
<br>product number 1
</div>
</a>
</li>
<li class='second'>
<a href="some url 2">
<div class="text">
<br>product number 2
</div>
</a>
</li>
</ul>
</div>
Using above example,
I would like to parse the values inside each list, list by list. Something like:
html.xpath("//*[#class='prdlist']/ul/li'").each do |each|
url = each.xpath/css (parse the href from each list)
name = each.xpath/css (parse the text from each list)
end
arr << [url,name]
which would eventually output:
arr = [["some url 1","product number1"],["some url2","product number2"]]
I am currently using regex & xpath("//*[#href]/#href) to get all urls and similar to get all product names and then using .zip to put the arrays together... but I've come across an html where I would like to do it list by list..
Thanks for the help!

And there you have it.
arr = []
html.css("div.prdlist li").each do |me|
url = me.css("a").map{|link| link['href']}[0]
name = me.text.delete("\n").split.join(" ")
arr << [url,name]
end

How to find an embedded "a" tag

I can't get an a tag in p.user_info:
<p class="user_info">
<a href="javascript:;" onClick="showSideView(this, 'login_id', 'user_name', 'ZmFubmlAaGFubWFpbC5uZXQ=', '');" title="[login_id]user_name">
<img src='/cs2/data/member/fa/login_id.gif?dt=20130117095107' align='absmiddle' border='0'> of
</a>
</p>
Using:
p_user_info = page.css("p.user_info")
puts p_user_info.css("a") # => []
puts p_user_info.css("a")[0] # => null
puts p_user_info.css("a").text # => ""
Is it possible to get login_id, user_name in a tag using Nokogiri?
I found a more important problem:
url = "http://clien.net/cs2/bbs/board.php?bo_table=park&wr_id=23895599"
html = open(url).read
puts html
# => ...
<p class="user_info"> <img src='/cs2/data/member/at/atlantis33.gif?dt=20130506110916' align='absmiddle' border='0'>님 </p>
...
I don't know why I can't get the a tag.

Try following:
require 'nokogiri'
html = <<eoh
<p class="user_info">
<a href="javascript:;" onClick="showSideView(this, 'login_id', 'user_name', 'ZmFubmlAaGFubWFpbC5uZXQ=', '');" title="[login_id]user_name">
<img src='/cs2/data/member/fa/login_id.gif?dt=20130117095107' align='absmiddle' border='0'> of
</a>
</p>
eoh
page = Nokogiri::HTML(html)
a = page.at_css("p.user_info a")
p a[:onclick].split(',')[1,2]
# => [" 'login_id'", " 'user_name'"]
p a[:onclick].split(',')[1,2].map { |x| x.gsub(/^[' ]+|[' ]+$/, '') }
# => ["login_id", "user_name"]

answer my self. that a tag can see only after login. need mechanize library.

require 'nokogiri'
a =%{<p class="user_info">
<a href="javascript:;" onClick="showSideView(this, 'login_id', 'user_name', 'ZmFubmlAaGFubWFpbC5uZXQ=', '');" title="[login_id]user_name">
<img src='/cs2/data/member/fa/login_id.gif?dt=20130117095107' align='absmiddle' border='0'> of
</a>
</p>"}
html = Nokogiri::HTML(a)
link = html.at_css "a"
puts link.values[1].split[1]
puts link.values[1].split[2]

Scraping data based on the text of other neighboring elements?

I have a code like this:
<div id="left">
<div id="leftNav">
<div id="leftNavContainer">
<div id="refinements">
<h2>Department</h2>
<ul id="ref_2975312011">
<li>
<a href="#">
<span class="expand">Pet Supplies</span>
</a>
</li>
<li>
<strong>Dogs</strong>
</li>
<li>
<a>
<span class="refinementLink">Carriers & Travel Products</span>
<span class="narrowValue"> (5,570)</span>
</a>
</li>
(etc...)
Which I'm scriping like this:
html = file
data = Nokogiri::HTML(open(html))
categories = data.css('#ref_2975312011')
#categories_hash = {}
categories.css('li').drop(2).each do | categories |
categories_title = categories.css('.refinementLink').text
categories_count = categories.css('.narrowValue').text[/[\d,]+/].delete(",").to_i
#categories_hash[:categories] ||= {}
#categories_hash[:categories]["Dogs"] ||= {}
#categories_hash[:categories]["Dogs"][categories_title] = categories_count
end
So now. I want to do the same but without using #ref_2975312011 and "Dogs".
So I was thinking I could tell Nokogiri the following:
Scrap the li elements (starting from the third one) that are right
below the li element which has the text Pet Supplies enclosed by a link and a span tag.
Any ideas of how to accomplish that?

The Pet Supplies li would be:
puts doc.at('li:has(a span[text()="Pet Supplies"])')
The following sibling li's would be (skipping the first one):
puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>

Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.

Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']

I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extract a link with Nokogiri from the text of link? - ruby

Maybe you will like css style selection better: doc.at('a[text()="site 1"]')[:href] # exact match doc.at('a[text()^="site 1"]')[:href] # starts with doc.at('a[text()*="site 1"]')[:href] # match anywhere

require 'nokogiri' text = "site 1" doc = Nokogiri::HTML(DATA) p doc.xpath("//div[#class='links']//a[contains(text(), '#{text}')]/#href").to_s

Related

How can I extract URLs from HTML content with a Ruby regexp?

Ruby Nokogiri Parsing Multiple Elements within Lists

How to find an embedded "a" tag

Scraping data based on the text of other neighboring elements?

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

Categories

Resources