How to parse url in <a alt="url attribute""> - ruby

I have html code on a site:
<a alt="Кроссовки adidas. Цвет черный. Категории: Женская обувь, Лучшие отзывы, Кеды, кроссовки, ботинки. Вид 3."
class="enabledZoom MagicThumb-swap" href="http://img2.site.ru/big/120000/129102-3.jpg" rel="zoom-id:Azoom;zoom-width:450;zoom-height:598;zoom-distance:10;zoom-position:right;opacity:50;"
rev="http://img2.site.ru/large/120000/129102-3.jpg" style="outline: 0px; " id="mt-1334303054133">
<img src="http://img2.site.ru/tm/120000/129102-3.jpg" class=""></a>
How to extract "http://img2.site.ru/large/120000/129102-3.jpg" with nokogiri gem?
P.S. Nokogiri is parsing element :
[#<Nokogiri::XML::Element:0x42c1ad8 name="a" attributes=[#<Nokogiri::XML::Attr:0x42c1a7e name="alt" value="Кроссовки adidas. Цвет черный. Категории: Женская обувь, Лучшие отзывы, Кеды, кроссовки, ботинки. Вид 1.">, #<Nokogiri::XML::Attr:0x42c1a74 name="class" value="enabledZoom">, #<Nokogiri::XML::Attr:0x42c1a6a name="href" value="http://img2.site.ru/big/120000/129102-1.jpg">, #<Nokogiri::XML::Attr:0x42c1a60 name="rel" value="zoom-id:Azoom;zoom-width:450;zoom-height:598;zoom-distance:10;zoom-position:right;opacity:50;">, #<Nokogiri::XML::Attr:0x42c1a4c name="rev" value="http://img2.site.ru/large/120000/129102-1.jpg">] children=[#<Nokogiri::XML::Element:0x42c0ee4 name="img" attributes=[#<Nokogiri::XML::Attr:0x42c0e94 name="src" value="http://img2.site.ru/tm/120000/129102-1.jpg">, #<Nokogiri::XML::Attr:0x42c0e8a name="class" value="current">]>]>]

You can use the at method if you know the <img> tag you want is in the first <a> tag:
doc.at('a img')['src'] => "http://img2.site.ru/tm/120000/129102-3.jpg"
If it's not, then you'll need to isolate the <a> or the <img>. I'd probably go after the <a id="..."> using something like:
doc.at('a#mt-1334303054133 img')['src'] => "http://img2.site.ru/tm/120000/129102-3.jpg"
If there are multiple <a> or <img> tags then your sample isn't good enough and we'd need more information about the HTML you're receiving.

Related

XPath Query for URL Extraction

I need to extract http://site.ru/ from this code:
<div class="one">
<dl>
<dt class="two">
<span class="name">Site</span>
</dt>
<dd class="three">
<span class="js-pseudo-link" data-url="rAnDoMlEtTeRsAnDnUmBeRs" style>
<a href="http://site.ru/" class rel="nofollow" target="_blank" style> http://site.ru/ </a>
</span>
</dd>
</dl>
</div>
I use this XPath query: //div//dl//dd//span//a/#href
But it doesn't work. It doesn't return anything.
I'm a newbie in XPath.
Unfortunately, the data source you are looking for is an empty span node (class js-pseudo-link). The data-url attribute has the base64 encoded link you want. This node only gets populated after loading. ImportXML for some reason ignores nodes with no text and there's no way to get it not to do that. To get around this, looks like you'll have to write an apps script that can handle empty nodes or just gets the raw HTML code and parse it.

Scraping text within several span tags (Ruby & Nokogiri)

I am trying to scrape "Description" from this HTML structure
<div class="menu-index-page__item-content">
<h6 class="menu-index-page__item-title">
<span> Item title </span>
</h6>
<p class="menu-index-page__item-desc">
<span>
<span>
<span>Description</span>
</span>
</span>
Each tag has an element with it that I don't know how to handle:
data-reactid=".3wrqgx5340.3.5.0.4:$523105.2.$3959254.$menuItemContent.1.0"
Each data-reactid is different. So if I target this attribute I will scrape stuff I don't want.
I've tried .search .xpath, using tags and classes but nothing seems to work.
Is there a way to say: give me the p tag that has a class="menu-index-page__item-desc" and scrape the 3rd span from there?
You can get the required value via xpath
//text()[contains(.,'Description')]
You code and xpath:

handling iframe with capybara ruby

Below is the html code ..
<iframe id="I0_1366100881331" frameborder="0" width="100%">
<div class="ZRa">
<span id="button" class="hAa Qo Bg" tabindex="0" role="button" title="" aria- label="Click here to publicly +1 this." aria-pressed="false">
</div>
</iframe>
In the above scenario, I want to switch into the IFRAME (iframe id="I0_1366100881331") to perform some actions on the SPAN present in that IFRAME. I have tried with most of the cases but no result :(... any one please help.
I want the solution for cucumber using capybara ruby only..
Note: I tried with following code but no result.
page.driver.browser.switch_to.frame "I0_1366100881331"
within_frame(find('<css rule>')) do
<code for dealing with iframe entries>
end
I think you can try to use method:
within_frame 'id' do
<code for dealing with iframe entries>
end
Can have this code:
withinframe((:xpath,"//div")) do
#code
end

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>
Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.
Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']
I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

How do I exclude a nested element when grabbing content using Nokogiri?

I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.
If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.

Resources