How to parse the image href in Nokogiri - ruby

I am parsing a web page using Nokogiri, and would like to parse out an image URL. This is my setup:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://themeforest.net/search?sort=sales'))
I can see the following code block if I inspect the page on chrome:
<div class="_2_3rp " style="padding-top:50.847457627118644%">
<div style="">
<img class="_1xvs1" src="https://themeforest.img.customer.envatousercontent.com/files/274559780/screenshots/00-Preview.jpg?auto=compress%2Cformat&fit=crop&crop=top&w=590&h=300&s=37354d884fd0f3b574238e013b4ea423"
title="Avada | Responsive Multi-Purpose Theme"
alt="Avada | Responsive Multi-Purpose Theme" style="left: 0%;">
</div>
</div>
However, when I run:
puts doc.search("//div[#class = '_2_3rp ']")
I get the following:
<div class="_2_3rp " style="padding-top:50.847457627118644%"><div style="height:100%" class="lazyload-placeholder"></div></div>
<div class="_2_3rp " style="padding-top:50.847457627118644%"><div style="height:100%" class="lazyload-placeholder"></div></div>
.....
=> nil
Why am I not getting the img class, and instead getting lazyload-placeholder? Is there any way I can get over this, and escape the image placeholder?

Here's the minimal code I came up with that's necessary to test your assertion:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="12345">
<div>
<img class="67890" src="https://foo.bar">
</div>
</div>
EOT
doc.search('//div[#class=12345]').map(&:to_html)
# => ["<div class=\"12345\">\n" +
# " <div>\n" +
# " <img class=\"67890\" src=\"https://foo.bar\">\n" +
# " </div>\n" +
# "</div>"]
# "</div>"]
It looks like the img tag is there.
You're using Nokogiri::XML to parse. Don't, because strict parsing occurs and with HTML, which is anything but strict, problems can occur if the HTML is malformed.

Related

Images inserted into HTML properly but not displayed - size is 0*0px

I am using a HAML template of a bootstrap carousel to display all images from a folder.
The images should not be displayed with a size of 0 by 0 pixels. There is no CSS property that would be setting this, the width of the element is set to 100% in CSS and even changing the size in a browser console does nothing.
All the images are accessible directly from a browser otherwise (like http://localhost:4567/car-images/fb_1.jpg) and there are no 404 errors.
This is the HAML template with a block of Ruby code:
.col-sm-6#carousel
.carousel.slide#myCarousel{ "data-ride" => "carousel", :style => "height:inherit"}
%ol.carousel-indicators
.carousel-inner{ :role => "listbox"}
- #images.each do |image|
.item
%img{ :src => "car-images/#{image}"}
%a.left.carousel-control{ "data-slide" => "prev", :href => "#myCarousel", :role => "button"}
%span.glyphicon.glyphicon-chevron-left{ "aria-hidden" => "true"}
%span.sr-only Previous
%a.right.carousel-control{ "data-slide" => "next", :href => "#myCarousel", :role => "button"}
%span.glyphicon.glyphicon-chevron-right{ "aria-hidden" => "true"}
%span.sr-only Next
And this is how it renders in a browser:
<div class='col-sm-6' id='carousel'>
<div class='carousel slide' data-ride='carousel' id='myCarousel' style='height:inherit'>
<ol class='carousel-indicators'></ol>
<div class='carousel-inner' role='listbox'>
<div class='item'>
<img src='car-images/fb_1.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_2.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_3.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_4.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_5.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_6.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_7.jpg'>
</div>
<div class='item'>
<img src='car-images/fb_8.jpg'>
</div>
</div>
<a class='left carousel-control' data-slide='prev' href='#myCarousel' role='button'>
<span aria-hidden='true' class='glyphicon glyphicon-chevron-left'></span>
<span class='sr-only'>Previous</span>
</a>
<a class='right carousel-control' data-slide='next' href='#myCarousel' role='button'>
<span aria-hidden='true' class='glyphicon glyphicon-chevron-right'></span>
<span class='sr-only'>Next</span>
</a>
</div>
</div>
Also, the Ruby code that runs the view:
require 'sinatra'
require 'haml'
$car_img_dir = 'public/car-images'
get '/' do
#images = Dir.foreach($car_img_dir).select { |x| File.file?("# {$car_img_dir}/#{x}") }
haml :index
end
get '/about' do
haml :about
end
get '/products' do
haml :products
end
I have been trying to solve this for about 2.5 hours now, and being a beginner, I am unaware of any solutions.
Your use of a global is bad practice. I'd recommend learning about variable scoping and using constants. Eventually you'll learn why and when to use a global but until them it'd be better to avoid them.
The code in question could be written more idiomatically like:
CAR_IMG_DIR = 'public/car-images'
get '/' do
#images = Dir.foreach(CAR_IMG_DIR).select { |x| File.file?(File.join(CAR_IMG_DIR, x)) }
haml :index
end
Rather than assume that the path separator character is / you should let Ruby determine what it is and supply it. Using File.join allows Ruby to do that. See the beginning of the documentation for the IO class and join.
You can't trust the HTML that the browser shows you. Browsers can, and will, modify HTML. You can use curl or wget, or various HTTP clients for Ruby, or even HAML itself, to view the emitted HTML without the browser getting in the way. I'd recommend getting used to doing that; You always need to know how the browser will render the page, but just don't trust its rendered HTML to be an accurate rendition of the real HTML.
Regarding the image size of 0x0: You can change your template to set the size, or you can add CSS to do it for all images of that class. I'd add the CSS, but you're going to have to do it one place or another. By default the browser should show the images in their native sizes so something is telling the browser to set it to 0x0 and you need to override it. Perhaps looking at the style attributes of the images will give you an idea where the problem is.

Parsing nodes with Nokogiri?

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?
Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg
Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

Extract a link with Nokogiri from the text of link?

I want to extract a specific link from a webpage, searching for it by its text, using Nokogiri:
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
I would like the href of "site 3" and return:
http://example.org/site/3/
Or I would like the href of "site 1" and return:
http://example.org/site/1/
How can I do it?
Original:
text = <<TEXT
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT
link_text = "site 1"
doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/#href").to_s
Updated:
As far as I know Nokogiri's XPath implementation doesn't support regular expressions, for basic starts with matching there's a function called starts-with that you can use like this (links starting with "s"):
doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/#href").map(&:to_s)
Maybe you will like css style selection better:
doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere
require 'nokogiri'
text = "site 1"
doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[#class='links']//a[contains(text(), '#{text}')]/#href").to_s
Just to document another way we can do this in Ruby, using the URI module:
require 'uri'
html = %q[
<div class="links">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
]
uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]
=> {
1 => "http://example.org/site/1/'",
2 => "http://example.org/site/2/'",
3 => "http://example.org/site/3/'"
}
uris[1]
=> "http://example.org/site/1/'"
uris[3]
=> "http://example.org/site/3/'"
Under the covers URI.extract uses a regular expression, which isn't the most robust way of finding links in a page, but it is pretty good since a URI usually is a string without whitespace if it is to be useful.

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>
Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.
Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']
I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

How do I exclude a nested element when grabbing content using Nokogiri?

I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.
If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.

Resources