Use nokogiri with xpath - ruby

How can i use nokogiri, to fetch image via xpath, but my main problem is that, i could have this div, but didn't have image:
image_node = #get_doc.xpath( '//*[#id="recaptcha_image"]/img/#src').map {|a| a.value }
#binding.pry
if image_node != nil
rec = Net::HTTP.get( URI.parse( "#{image_node['src']}" ) )
end
but i get
in `[]': can't convert String into Integer (TypeError)
how is it correct to use?
some part of html:
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img *****>
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field"
style="width: 295px">

I recommend CSS over XPath for most HTML queries, and many XML ones. Using CSS makes this very "visible":
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
doc.at('#recaptcha_widget img')['src'] # => "path_to_image.jpg"
how to do check, if i have div, but didn't have image?
How do you check if you didn't have the embedded <img> tag inside the <div>? Break your lookup into two parts, and check for a nil:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="recaptcha_widget" style="display: none">
<div id="recaptcha_image">
<img src="path_to_image.jpg">
</div>
<div id="recaptcha_image2">
</div>
<input type="text" id="recaptcha_response_field" name="recaptcha_response_field" style="width: 295px">
EOT
img = doc.at('#recaptcha_widget img')
img_src = img['src'] # => "path_to_image.jpg"
If the <img> tag doesn't exist you'll get nil:
img = doc.at('#recaptcha_widget2 img') # => nil
From that point you'd continue with a check to see if img was set:
if (img)
# ...do something...
end
Or, use a trailing rescue to capture the nil-exception and assign nil to img_src then test for it:
img_src = doc.at('#recaptcha_widget img')['src'] rescue nil # => "path_to_image.jpg"
img_src = doc.at('#recaptcha_widget2 img')['src'] rescue nil # => nil
if (img_src)
# do something
end

Related

How to use Ruby/Nokogiri to Strip <tr> and <td> tags not enclosed in <table> tag?

I'm dealing with some badly formed HTML where table elements aren't enclosed in a table tag, such as the following:
<div class="row">
<div class="large-12 columns main-content">
<tr>
<td colspan="4"><img src="../img/H006265.jpg"></td>
</tr><tr valign="top">
<td> </td>
</tr>
</div>
</div>
I'd like to get rid of the junk tags and end up with something like this:
<div class="row">
<div class="large-12 columns main-content">
<img src="../img/H006265.jpg">
</div>
</div>
There are legit tables elsewhere in the document, so I'm not able to just strip and tags altogether, only those not enclosed in a tag.
I've tried having Nokogiri parse it, thinking it would clean up the incorrect HTML, to no avail:
Nokogiri::HTML::DocumentFragment.parse(badly_formed_html_string)
You can use the parsed fragment to clean your HTML:
frag = Nokogiri::HTML::DocumentFragment.parse(badly_formed_html_string)
frag.css('tr').each do |tr|
tr.add_previous_sibling tr.children
tr.remove
end
frag.css('td').each do |td|
td.add_previous_sibling td.children
td.remove
end
puts frag.to_s
# <div class="row">
# <div class="large-12 columns main-content">
# <img src="../img/H006265.jpg">
#
# </div>
# </div>
Thanks to Uri's code for helping me find a good answer, his was close but this did the trick for me of stripping those tags only when not enclosed in a tag:
def strip_tag_if_not_in_parent(doc, tag, parent)
doc.css(tag).each do |element|
if (element.parent.name != parent)
new_element = Nokogiri::HTML::DocumentFragment.parse(element.inner_html)
element.replace new_element
end
end
doc
end
doc = strip_tag_if_not_in_parent(doc, 'tr', 'table')
doc = strip_tag_if_not_in_parent(doc, 'td', 'tr')

How to get ID of an element using Watir where the child contains the string i search for

<div class="wrapper">
<div id="minHeightBlock" style="min-height: 430px;">
<div class="borderbox"><div class="standaloneBox">
<div class="sysHeaderContainer clearfix"> … </div>
<div class="notesForGuests"> … </div>
<div class="filterBox clearfix"> … </div>
<div class="resListHeader"> … </div>
<div id="corporaContainer" class="fullList">
<div id="c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8" class="resItem clearfix">
<div class="resTitle">
<span id="filter-empty" class="statBall statFile empty" title="Status: Empty corpus"></span>
<span class="theText">
12321 corpora
</span>
</div>
<div class="resType"> … </div>
<div class="resSize"> … </div>
<div class="resPermission private"> … </div>
<div class="resDomain"> … </div>
<div class="resDescr"> … </div>
<div class="resDetails clearfix" style="display:none;"> … </div>
</div>
<div id="c-b8c0faba-e662-4998-836f-0ee58009b7fa" class="resItem clearfix"> … </div>
<div id="c-9d02b887-4835-4606-ad4b-775b39af9f48" class="resItem clearfix"> … </div>
<div id="c-021d3ba1-db03-4c4e-81a5-294737eb5b54" class="resItem clearfix"> … </div>
This is the code of the webpage im trying to script using Watir. All i know is only the what kind of span text the element should contain. I have many of these elements and i need to colect all of the element ID values so i can use them in further actions.
I have comented the places in the above code what i know and what i need to get.
So far i have tried this code:
#b.div(:id, "pageHeader").link(:text, "Corpora").click
sleep 5
#b.div(:id, "corporaContainer").spans(:text => /TestAuto\s.*/).each do |span|
puts span.parent.attribute_value("id")
end
But no output is done. Maybe im doing something wrong. Help me get this nut shell cracked.
Your attempt was close. The problem is that span.parent only goes up to the <div class="resTitle">. You need to go up one more parent:
#b.div(:id, "corporaContainer").spans(:text => /corpora/).each do |span|
puts span.parent.parent.attribute_value("id")
end
(Note that I changed the text in the locator of the spans since TestAuto\s.* did not match the sample html.)
Alternatively, I sometimes find it better to find the divs that contain the span. This way you do not have to worry about the number of parents changing:
p #b.divs(:class => 'resItem')
.find_all { |div| div.span(:text => /corpora/).exists? }
.collect { |div| div.id }
#=> ["c-a06ffa6a-dc62-4640-9760-dbd661c7ffe8"]
Below is a working example. Note that there are 2 important things:
The list of results is loaded asynchronously. Therefore you need to wait for the list to finish loading before capturing the results. sleep(5) might work, but you are better off using an actual wait method (since it seems to take longer than 5 seconds).
Make sure the search text actually exists on the page. In the below example, there is no "12321 corpora" title that was mentioned in the sample html.
Example:
require 'watir-webdriver'
# Title to search for:
title_text = /UniAdm/
# Go to the Corpora page:
#b = Watir::Browser.new :ff
#b.goto "https://www.letsmt.eu/Corpora.aspx"
# Wait for the results to load:
container = #b.div(:id, "corporaContainer")
container.div(:class => 'resItem').wait_until_present
# Find the matching ids:
p container.divs(:class => 'resItem')
.find_all { |div| div.span(:class => 'theText', :text => title_text).exists? }
.collect { |div| div.id }
#=> ["c-87ee80a9-e529-48b2-92be-bc8d76375478", "c-f139e781-4789-41f9-82e8-914e0e3eff81", "c-e17641d2-9364-4e87-9047-ba35580dc32f"]

Parsing nodes with Nokogiri?

I'm parsing web pages and I want to get the link from the <img src> by finding the <div id="image">.
How do I do this in Nokogiri? I tried walking through the child nodes but it fails.
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
This is my code:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image").each do |node|
node.children().each do |c|
puts c.attr("src")
end
end
Any ideas?
Try this and let me know if it works for you
require 'nokogiri'
source = <<-HTML
<div id="image" class="image textbox ">
<div class="">
<img src="img.jpg" alt="" original-title="">
</div>
</div>
HTML
doc = Nokogiri::HTML(source)
doc.css('div#image > div > img').each do |image|
puts image.attr('src')
end
Output:
img.jpg
Here is a great resource: http://ruby.bastardsbook.com/chapters/html-parsing/
Modifying an example a bit, I get this:
doc = Nokogiri::HTML(open("site.com"))
doc.css("div.image img").each do |img|
puts img.attr("src")
end
Although you should use the ID selector, #image, rather than the class selector, .image, when you can. It is very much faster.

Ruby/Nokogiri inspect reveals more then class. I need the extra item inspect shows

In the following code:
page = Nokogiri::HTML($browser.html)
page_links = page.css("a").select
page_links.each do |link|
if not link.nil?
if not link['href'].nil? and !!link['href']["/about"]
puts link.class
puts link.inspect
end
end
end
the link.class outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623d3c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb623c7e name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb623c74 name="class" value="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623c6a name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb623c60 name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Text:0x..fdb623792 "PetSmart Winchester">]>
And link.inspect outputs the following:
Nokogiri::XML::Element
#<Nokogiri::XML::Element:0x..fdb623666 name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdb6235a8 name="action-type" value="8">, #<Nokogiri::XML::Attr:0x..fdb62359e name="class" value="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP">, #<Nokogiri::XML::Attr:0x..fdb623594 name="target" value="_top">, #<Nokogiri::XML::Attr:0x..fdb62358a name="href" value="./104882190640970316938/about">] children=[#<Nokogiri::XML::Element:0x..fdb6230bc name="div" attributes=[#<Nokogiri::XML::Attr:0x..fdb62304e name="style" value="height:110px; width:110px;">] children=[#<Nokogiri::XML::Element:0x..fdb622e1e name="img" attributes=[#<Nokogiri::XML::Attr:0x..fdb622db0 name="style" value=" height: 110px; width: 110px;">, #<Nokogiri::XML::Attr:0x..fdb622da6 name="class" value="mja">, #<Nokogiri::XML::Attr:0x..fdb622d9c name="src" value="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">]>]>]>
In Nokogiri I can access the link text by link.content and the link url by link['href'] . Yet neither of these methods work for image source from the inspect results.
How can I get the img src within this example code that inspect is revealing?
UPDATE: HERE IS THE HTML CODE
<div class="HWb">
<div class="erb">
<div class="ubb">
<div role="button" class="a-f-e c-b c-b-T c-b-Oe c-b-H-ra L0a X9" tabindex="0"
data-placeid="6817440171144926830" data-source="lo-gp" data-inline="true"
data-tooltip-delay="600" data-tooltip-align="b,l" data-oid="104882190640970316938"
data-size="small">
<span class="TIa c-b-fa"></span>
</div>
</div>
<h3 class="drb">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa I8 EjFvwd VP"
action-type="8">PetSmart Winchester</a>
</h3>
</div>
<div class="Qbb">
<span class="vqb SIa">Pet Store</span>
<span class="lja SIa">
<a href="//www.google.com/url?sa=D&oi=plus&q=https://maps.google.com/maps?q%3DPetsmart%2Bloc:22601%26numal%3D1%26hl%3Den-US%26gl%3DUS%26mix%3D2%26opth%3Dplatter_request:2%26ie%3DUTF8%26cid%3D6817440171144926830%26iwloc%3DA"
target="_blank" class="a-n uqb">2310 Legge Boulevard, Winchester, VA</a>
</span>
<span class="SIa">(540) 662-5544</span>
</div>
<div class="crb">
<div class="Pbb a-f-e">
<div class="Fbb">
<div class="cca">
<div class="tob">
<div class="xob">“Do not bother with the grooming salon, the staff are unusually stupid.
Otherwise the store is a typical petsmart.”</div>
</div>
</div>
</div>
</div>
<div class="dWa">
<a href="./104882190640970316938/about" target="_top" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP"
action-type="8"><div style="height:110px; width:110px;"><img src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg" class="mja" style=" height: 110px; width: 110px;"></div></a>
</div>
</div>
Without the HTML you're making it a lot harder, but after some digging into the inspect output, I think I have a reasonable HTML snippet.
This is how I'd go about getting to the <img src="..."> tag:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<a action-type="8" class="a-n g-s-n-aa g-s-n-aa Gbb EjFvwd VP" target="_top" href="./104882190640970316938/about">
<div style="height:110px; width:110px;">
<img style=" height: 110px; width: 110px;" class="mja" src="https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg">
</div>
</a>
EOT
doc.at('img')['src'] # => "https://mts0.google.com/vt/data=TSwRVVf0DGlwBQqarpBU3wUz-i2gqbuWEbxTilWKINf30Au9l0oLM_ojk4KI0oPUi8kL5fJaJWte45O3abOXMzE3L7xDBg"
You'll need to take the time to improve your question and provide more detail if that doesn't work.
If you are not sure whether you will have 0, 1 or 1+ instances of a tag, use search because it returns a NodeSet, which acts like an Array, making it easy to deal with no, single or multiple occurrences:
doc.search('img').map{ |img| img['src'] }
will return all the <img src="..."> values in the document in an array. You can iterate over those easily or use empty? to see if there are no hits:
doc.search('img').map{ |img| img['src'] }.each do |src|
# do something with src if any are found.
end
If it's possible you'll have <img> tags without the src="..." parameter, use compact to filter them out before iterating:
doc.search('img').map{ |img| img['src'] }.compact.each do |src|
# do something with src if any are found.
end
If you only expect 0 or 1 occurrence, try:
src = doc.at('img') && doc.at('img')['src']
as in:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img src="blah">
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> "blah"
or, without the src parameter:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<img>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
or missing the <img> tag entirely:
doc = Nokogiri::HTML(<<EOT)
<html><body><p>foo</p>
<p>bar</p></body></html>
EOT
src = doc.at('img') && doc.at('img')['src']
=> nil
If you want to continue to use an if block:
if doc.at('img')
puts doc.at('img')['src']
end
will accomplish what your:
if not doc.at('img').nil?
puts doc.at('img')['src']
end
accomplishes, but in a more straightforward and concise manner, while maintaining readability.
The downside to doing two at lookups is it can be costly in big documents, especially inside a loop. You could get all Perlish and use:
if (img = doc.at('img'))
puts img['src']
end
but that's not really the Ruby way. For clarity and long-term maintenance, I'd probably use:
img = doc.at('img')
if (img)
puts img['src']
end
but that exposes the img variable, cluttering up things. It's programmer's choice at that point.
Your two outputs look like they are two different links (ie both the link.class and link.inspect for each).
Assuming we are talking about getting the image source in the second output, it looks like the HTML is something like:
<div><img src="image_src" /></div>
Assuming that is true, then you need to do:
puts link.at_css("img")['src']
I have found if you take the results from link.inspect, since they are a string, and use regex you can grab the image URL.
link.inspect[/http.*com.*"/].chop # Since all other urls are relative ./
I don't believe this is the best method. I will try working with the other answers first.

Why does parsing HTML with Nokogiri return a blank?

This is the HTML I am parsing:
<div class="audio" id="audio59779184_153635497_-28469067_16663">
<table width="100%" cellspacing="0" cellpadding="0"><tbody><tr>
<td>
<a onclick="playAudioNew('59779184_153635497_-28469067_16663')"><div class="play_new" id="play59779184_153635497_-28469067_16663"></div></a>
<input id="audio_info59779184_153635497_-28469067_16663" type="hidden" value="http://cs5888.userapi.com/u59779184/audio/0fc0fc5d8799.mp3,245">
</td>
<td class="info">
<div class="duration fl_r" onmousedown="if (window.audioPlayer) audioPlayer.switchTimeFormat('59779184_153635497_-28469067_16663', event);">4:05</div>
<div class="audio_title_wrap">
<b>Don Omar feat. Lucenzo and Pallada</b> – <span id="title59779184_153635497_-28469067_16663"> Danza Kuduro (Dj Fleep Mashup)(21.05.12).ılııllı.♫♪Новая Клубная Музыка♫♪.ıllıılı.http://vkontakte.ru/public28469067 </span>
</div>
</td>
</tr></tbody></table>
<div class="player_wrap">
<div class="playline" id="line59779184_153635497_-28469067_16663"><div></div></div>
<div class="player" id="player59779184_153635497_-28469067_16663" ondragstart="return false;" onselectstart="return false;">
<table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr id="audio_tr59779184_153635497_-28469067_16663" valign="top">
<td style="padding: 0px; width: 100%; position: relative;">
<div class="audio_white_line" id="audio_white_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);"></div>
<div class="audio_load_line" id="audio_load_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);"><!-- --></div>
<div class="audio_progress_line" id="audio_progress_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);">
<div class="audio_pr_slider" id="audio_pr_slider59779184_153635497_-28469067_16663"><!-- --></div>
</div>
</td>
<td id="audio_vol59779184_153635497_-28469067_16663" style="position: relative;"></td>
</tr></tbody></table>
</div>
</div>
</div>
And the code I'm using:
require 'watir'
require 'nokogiri'
require 'open-uri'
ff = Watir::Browser.new
ff.goto 'http://vk.com/wall-28469067_16663'
htmlSource = ff.html
doc = Nokogiri::HTML(htmlSource, nil, 'UTF-8')
doc.xpath('//div[#class="audio"]/#id').each do |idSongs|
divSong = doc.css('div#'+idSongs)
aa = idSongs.text
link = doc.xpath("//input[#id='#{aa}']//#value")
puts link
puts '========================='
end
ff.close
If I write:
aa = 'audio_info59779184_153625626_-28469067_16663'
puts link returns a good result of "http://cs5333.userapi.com/u14251690/audio/bcf80f297520.mp3,217".
Why is it, if aa = idSongs.text
does puts link return " " ?
To answer the question asked, link returns "", because it's an empty NodeSet. In other words, Nokogiri didn't find what you were looking for. A NodeSet behaves like an Array, so when you try to puts an empty array you get "".
Because it's a NodeSet you should iterate over it, as you would an array. (The same is true of your doc.css, which would also return a NodeSet.)
The reason it's empty is because Nokogiri can't find what you want. You're looking for the contents of aa which are:
"audio59779184_153635497_-28469067_16663"
Substituting that into "//input[#id='#{aa}']" gives:
"//input[#id='audio59779184_153635497_-28469067_16663']"
but should be:
"//input[#id='audio_info59779184_153635497_-28469067_16663']"
Searching for that finds content:
doc.search("//input[#id='audio_info59779184_153635497_-28469067_16663']").size => 1
Short answer to: "Why is it, if aa = idSongs.text does puts link return " " ?" Because you're trying to find an input element that has the same dom id as the div you've already matched on, which doesn't exist and therefore Nokogiri just gives you an empty string.
It looks like they reuse the audio identifier in several places, so to make your code more versatile probably extract that out and then prefix your selections with whatever you are needing to access... As such:
doc.xpath('//div[#class="audio"]/#id').each do |idSongs|
divSong = doc.css('div#'+idSongs)
aa = idSongs.text
identifier = (match = aa.match(/^audio(.*)$/)) ? match[1] : ""
link = doc.xpath("//input[#id='audio_info#{identifier}']//#value")
puts link
puts '========================='
## now if you want:
title = doc.xpath("//input[#id='title#{identifier}']//#value")
puts title
end

Resources